Engineering Team Lead - Site Reliability
Posted 4 months ago
At Datadog, we’re on a mission to build the best monitoring platform in the world. We operate at high scale—trillions of data points per day—providing always-on alerting, metrics visualization, logs, and application tracing for tens of thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way.
How do you keep a data-intensive, real-time service that monitors hundreds of thousands of servers up-and-running around the clock?
How do you respond to infrastructure failures or performance issues in a high-volume, low-latency computing environment?
What should the infrastructure look like when Datadog monitors millions of servers and containers? If you think you have the answers, join us on the Site Reliability team!
As an Engineering Team Lead for SRE team, you will manage a team of engineers, own significant chunks of our architecture, design and build systems at scale, and shape product decisions. You'll work on challenging projects, make an impact, and grow as an engineer and a lead.
- Solve a scaling bottleneck in a critical service
- Mentor other engineers on your team
- Design a new service and write an architecture RFC
- Deploy a new feature to production, progressively rolling it out with feature flags
- Investigate and fix a production issue from a service your team owns
- Plan the most important projects to work on next
- You have been building applications for 4+ years and know the systems you’ve worked on from top to bottom
- You have significant backend programming experience
- You have managed a team of software engineers
- You have architected, built, and operated distributed systems to solve problems at high scale
- You have a BS/MS/PhD in a scientific field or equivalent experience
- You want to work in a fast-paced, high-growth startup environment that respects its engineers and customers
- You've shipped complex projects with teams of engineers
- You've worked at high scale with systems like Redis, Cassandra, Kafka
- You have significant experience with Go, C, or Python
Is this you? Let's chat!