Engineering Manager, Site Reliability

Remote or NYC

Olo logo
Olo
Apply now Apply later

Posted 3 weeks ago

Are you passionate about building highly available, performant and scalable systems? We are looking for an ambitious manager to lead our team of Site Reliability Engineers. Our mission is to instill the SRE culture deep within our engineering teams’ DNA, and help us realize that reliability is everyone’s job. Olo is experiencing tremendous growth, and Reliability at Scale has become our key mantra. As we enhance our platform to support the increased demand, it must be positioned for continued stability, reliability and resiliency...even at 10x scale! You will be challenged with complex yet interesting problems, and your passion to lead this team to success will be crucial.
You will partner with Engineering and Product Managers to continually learn, improve system availability and sharpen our execution skills as we deliver an amazing digital ordering platform. Your focus will be on helping us improve system reliability while building and maintaining our solutions. Your curiosity and passion for learning will help discover new ways for us to improve and deliver the best service to our customers.
At Olo, Site Reliability Engineering is a discipline that combines software and systems engineering to build and run web-scale, distributed, fault-tolerant and performant systems. As the leader of the Site Reliability team, you will ensure that Olo's internal and external systems have reliability and uptime appropriate to end users' needs and a feedback loop focused on improvement while keeping a watchful eye on capacity and performance. While we mature our presence in this space, you will evangelize and advocate for the core SRE principles and push others to grow in this area as well!
We expect our Engineering Managers to be seasoned engineers with the technical experience to both guide and challenge their teams to build robust, high-performing solutions. As a servant leader, your focus will be on facilitating strong team outcomes, hiring and developing engineering talent, and ensuring that our systems are ready to support emerging business priorities. While your primary focus will not be one of designing, developing and maintaining software, we expect our engineering managers to have a strong software and infrastructure engineering foundation to be able to effectively guide their teams. As such we expect you to have demonstrable experience in software engineering that spans infrastructure, architecture and design, quality best practices, and production operational concerns.

What You’ll Be Doing

  • As a servant leader, your focus will be on facilitating strong team outcomes
  • Recruiting, hiring, and developing a team of highly skilled Site Reliability and Chaos Engineering talent
  • Take ownership of different SRE dimensions, from observability and SLIs/SLOs to Incident Response, postmortems and follow-up actions.
  • Work to define standards and best practices and help drive those out into the broader engineering organization.
  • Help implement and tailor our incident response tools in order to minimize outage durations and provide the best service to our response teams.
  • Brainstorm, define, and build collaborative monitoring solutions with members across multiple product teams.
  • Help evolve our L2 support capabilities, by working hand in hand with our product and engineering teams and establishing the resources needed for triage, assessment, escalation and remediation.
  • Contribute insights across teams to help us improve or re-architect existing systems to support scale, performance and extensibility.
  • Constantly re-evaluate our observability tooling to improve architecture, knowledge models, user experience, performance and stability.
  • Analyze and mature our processes around Incident Response, Observability, Postmortems and Predictive Monitoring.
  • Help evolve our use of Chaos Engineering to pressure test our services and architecture, allowing us to quantify the resiliency and robustness of our architecture.
  • Maintain production services by measuring and monitoring availability, latency and overall system health.
  • Influence an engineering culture of reliability, observability, and availability.
  • Strive to coach and mentor engineering teams through game days, SRE boot camps and other training and feedback channels.

What We'll Expect From You

  • A passion for computing that extends beyond work
  • Strong experience with monitoring systems like Datadog, Sumo Logic, Raygun, New Relic or similar.
  • Fluency in at least one Incident Management tool such as FireHydrant, OpsGenie, PagerDuty, VictorOps or similar.
  • Some past experience with build and deploy tools such as Jenkins, TeamCity, Octopus, CircleCI, etc.
  • You've been in the trenches building highly scalable, efficient, and resilient systems.
  • Self-starter: can take high level direction and organize to achieve its objectives.
  • Highly motivated individual with a curiosity to learn as you grow.
  • Able to inspire and motivate both direct reports and peer teams towards a common goal.
  • Demonstrated ability to build and mentor a high performance team while seeking continuous improvement
  • Ability to translate business priorities into viable technology solutions, and deliver
  • Experience in, and good understanding of, large scale, highly performant, distributed systems architecture and principles
  • Experience developing realistic project plans, managing stakeholder expectations, and tracking team execution
  • Able to define, plan and drive a roadmap in an agile approach, while providing transparency and surfacing execution metrics along the way.
  • Responsible for ensuring core systems can support the growing demand of our customers and business priorities
  • Provide coaching and counseling via mentoring, one-on-one meetings, etc.
  • Legally able to work in the U.S.
  • Willing to roll up your sleeves, work hard and be scrappy!

Nice to Have

  • Prior hands-on software development experience.
  • Experience with Ansible, Terraform or other Infrastructure-as-Code tools.
  • Expertise in guiding Incident Response, in terms of both process and tooling.

What's Important to Olo

  • Our families come first. We know they make us who we are and they are who we live and work for every day. 
  • Olo is our extended family. We’re in this together, fighting for one another. We’re happy to be here. We will not let one another down. 
  • We learn from and fight through setbacks. We recognize and help one another with direct feedback. 
  • We care about you. We offer 20 days of paid time off, fully paid health, dental and vision care premiums, stock options, a generous parental leave plan.
  • We value diversity. At Olo, we know a diverse and inclusive team not only makes our products better, but our workplace better. Many groups are consistently underrepresented across the tech sector and we are fully committed to doing our part to move the needle. 
COVID-19 Impact
Olo is committed to the well-being of candidates, employees and our community. The  Olo NYC Headquarters will be closed for the foreseeable future because of the global outbreak of COVID-19. While an in-person interview is typical for many roles at Olo, we will conduct interviews via video conferencing while our HQ is closed. Olo benefits from the fact that over half of our workforce is remote, therefore we are accustomed to conducting interviews via video conferencing and we anticipate no impact on our recruiting timelines. We encourage candidates to share any concerns or questions with Olo’s recruiting team.
About Olo
Olo powers digital ordering and delivery programs that connect restaurant brands to the on-demand world, placing orders directly into the restaurant through all order origination points – from a brand’s own website or app, third party marketplaces, social media platforms, smart speakers, and home assistants. Olo serves as the on-demand ordering and delivery platform for over 300 brands, such as Applebee’s, Checkers & Rally’s, Cheesecake Factory, Chili’s, Dairy Queen, Denny’s, Five Guys Burgers & Fries, Jamba Juice, Noodles & Company, Portillo’s Hot Dogs, Shake Shack, sweetgreen, Wingstop, and more. Learn more at www.olo.com.
Olo's headquarters is located on the 82nd floor of One World Trade Center.  We offer great benefits, such as 20 days of Paid Time Off, fully paid health, dental and vision care premiums, stock options, a generous parental leave plan, and perks like FitBits, rotating craft beers on tap in our kitchen, and food events featuring our clients' menu items (now you know why we give out FitBits!). Check out our culture map:https://www.olo.com/images/culture.jpg.
We encourage you to apply! 
Olo is an equal opportunity employer and diversity is highly valued at our company. All applicants receive consideration for employment. We do not discriminate on the basis of race, religion, color, national origin, gender identity, sexual orientation, pregnancy, age, marital status, veteran status, or disability status.
If you like what you read, hear, and/or know about Olo, and want to be a part of our team, please do not hesitate to apply! We are excited to hear from you!
Job tags: Ansible High performance Reliability engineering Terraform