Staff Site Reliability Engineer
Duolingo is the most popular language learning application in the world, with over 300 million users. We are passionate about education, fact-based decision making, and elegant solutions to cross-functional problems. If that sounds like you, then come join us as we build the next-generation learning company!
As a Staff Site Reliability Engineer, you will work closely with cross-functional engineering teams to ensure Duolingo’s complex distributed systems and products are built and maintained with world-class quality, and operated in measurable and scalable ways.
- Collaborate with internal teams to identify sources of instability in distributed systems and drive operational excellence
- Own core infrastructure (i.e manage, diagnose, and debug large-scale distributed systems in production)
- Provide system design consulting, develop software platforms/frameworks, and conduct launch reviews and root cause analysis
- Maintain and document sustainable postmortem/incident response practices
- Understand and resolve potential threats to performance or security
- Monitor and measure latency, availability and overall system health, once live
- Advocate for and implement changes that improve reliability, scalability, and velocity
- Monitor and stress test systems to collect metrics for tuning and capacity planning
- Reduce the burden of toil with iterative development of tooling and automation
- Collaborate with engineering teams to release new features and become an authority on our services
- Participate in on-call rotation
- Bachelor’s Degree in Computer Science
- 5+ years of experience within site reliability engineering/devops of a product with millions of users
- Experience analyzing and troubleshooting large-scale distributed systems
- Proven knowledge of C, C++, Java, Kotlin, Python or Go
- Fluency in networking protocols, such as TCP/IP, HTTP, SSL, DNS, etc
- An understanding of containerization toolsets and container orchestration technologies (Docker, Mesos, Kubernetes, Nomad, etc)
- Effective communication skills and understanding of best practices around tools/methodologies for Infrastructure, Automation, Capacity Planning, etc.
- Ability to be on-call for critical incident responses