Site Reliability Engineer

Palo Alto, CA | San Francisco, CA

Apply now Apply later

Posted 1 month ago

DFINITY is reimagining the Internet as a public network that hosts secure software and services. The Internet Computer is a new technology stack that will be unhackable, fast, scales to billions of users around the world, and supports a new kind of autonomous software that promises to reverse Big Tech’s monopolization of the internet. DFINITY was founded in 2016 by Dominic Williams and is backed by top-tier institutions including Polychain Capital and Andreessen Horowitz.

As the Site Reliability Engineering, you will be providing operational support for the Internet Computer components at the application layer. This includes on-going development of systems that monitor the Internet Computer’s health, corrective actions in case of incidents,. 


  • Select, design, build, deploy, and maintain the services used to ensure high availability of DFINITY's product
  • Identify opportunities to automate or improve processes by writing code, and then write the code
  • Bake reliability and operability in to the product from the start, by participating in design and code reviews, identifying risks, problems, and mitigations
  • Work with other engineering and security teams to define processes that preserve the goals of the Internet Computer while remaining operationally feasible and automatable
  • Work with product owners to set SLOs, then implement SLOs in code and observability infrastructure
  • On-call for production services. 12/7 (on-call is split across two sites), roughly 1 week in 6. As issues may be caused by problems in wildly different areas of the code the chief responsibility is to coordinate the response to the issue and ensure it is resolved, pulling in engineers from other teams as necessary. On-call work is compensated with generous time off
  • This is not a team that exists to be on-call. This is a team that elects to be on-call because it helps do the job better. Being on-call makes it easier and more motivating to identify opportunities to reduce the number of alerts the system generates.
  • Operating, troubleshooting, and deploying software to Unix systems


  • Think about things in a systemic, methodical way, especially when troubleshooting
  • Know when This is good enough for the next 12 months is appropriate
  • Coordinate incident response across multiple teams -- clearly understanding and communicating what is going on, next steps, who is responsible for what, and so on
  • Write code. We use Rust -- you don't need to know Rust already, there'll be opportunities to learn, but experience designing and writing moderate sized applications (up to ~ 10Kloc) is necessary. Identifying opportunities to automate or improve processes by writing code, and then write the code to do it is key.

Within 1 month you will

  • Understand DFINITY's infrastructure and production environment
  • Picked a suitable starter project
  • Submitted improvements to our documentation and process that you will have noticed during onboarding.

Within 3 months you will

  • Have delivered the starter project
  • Shadowed other team members on-call, and be ready to join the on-call rotation from month 4 onwards
  • Pro-actively identified other improvements and proposed projects to deliver them

All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status.

Job tags: High availability Reliability engineering Unix
Job region(s): North America
Job stats:  1  0  0
  • Share this job via
  • or