Staff Site Reliability Engineer

Pittsburgh, PA

Duolingo logo
Duolingo
Apply now Apply later

Posted 6 days ago

Duolingo is the most popular language learning application in the world, with over 300 million users. We are passionate about education, fact-based decision making, and elegant solutions to cross-functional problems. If that sounds like you, then come join us as we build the next-generation learning company!

As a Staff Site Reliability Engineer, you will work closely with cross-functional engineering teams to ensure Duolingo’s complex distributed systems and products are built and maintained with world-class quality, and operated in measurable and scalable ways.

You will...

  • Collaborate with internal teams to identify sources of instability in distributed systems and drive operational excellence
  • Own core infrastructure (i.e manage, diagnose, and debug large-scale distributed systems in production)
  • Provide system design consulting, develop software platforms/frameworks, and conduct launch reviews and root cause analysis
  • Maintain and document sustainable postmortem/incident response practices
  • Understand and resolve potential threats to performance or security
  • Monitor and measure latency, availability and overall system health, once live
  • Advocate for and implement changes that improve reliability, scalability, and velocity
  • Monitor and stress test systems to collect metrics for tuning and capacity planning
  • Reduce the burden of toil with iterative development of tooling and automation
  • Collaborate with engineering teams to release new features and become an authority on our services
  • Participate in on-call rotation

You have...

  • Bachelor’s Degree in Computer Science
  • 5+ years of experience within site reliability engineering/devops of a product with millions of users 
  • Experience analyzing and troubleshooting large-scale distributed systems
  • Proven knowledge of C, C++, Java, Kotlin, Python or Go
  • Fluency in networking protocols, such as TCP/IP, HTTP, SSL, DNS, etc
  • An understanding of containerization toolsets and container orchestration technologies (Docker, Mesos, Kubernetes, Nomad, etc)
  • Effective communication skills and understanding of best practices around tools/methodologies for Infrastructure, Automation, Capacity Planning, etc.
  • Ability to be on-call for critical incident responses
Job tags: C Docker Go Java Kubernetes Mesos Python Reliability engineering
Job region(s): North America
Share this job: