Software Engineer, Systems Reliability

San Francisco

Applications have closed
OpenAI logo

Posted 1 month ago

Join us in building some of the largest AI supercomputing clusters in the world! You will manage and scale the company's supercomputers (powered by Kubernetes), build our research platform, and work on cross-functional projects to accelerate progress at the cutting-edge of AI research. We work at the very cutting edge of speed and scale, combining the traditions of High-Performance Computing (HPC) in a modern cloud and containerized environment. 
We recently launched our newest cluster, “Owl” with over 250K cores, 10K GPUs, and 400Gbps of networking per node. This would be in the top 5 of the TOP500 supercomputers in the world. See this blog post to get a sense of what kind of challenges we solve in our day-to-day work. 
In this role, you will work closely with and directly accelerate machine learning researchers, but don't need to be a machine learning expert yourself. We value people who can quickly obtain a deep technical understanding of new domains and enjoy being self-directed and identifying the most important problems to solve. Experience with high-performance computing, or open-source contributions is a bonus. 
We believe that increasing compute is a huge lever to AI progress. You will have a direct impact on our ability to grow to an unprecedented scale and likewise produce unprecedented results.

We look for a blend of:

  • Experience designing, implementing and running production services
  • Comfort managing and monitoring large-scale infrastructure deployments
  • Willingness to debug problems across the stack, such as networking issues, performance problems, or memory leaks
  • Ownership problems end-to-end, and are willing to pick up whatever knowledge you're missing to get the job done

You might enjoy this work if you:

  • Know your way around bash, Terraform, Python, and/or Chef
  • Have experience running large Kubernetes clusters with GPU workloads
  • Can design a highly-available distributed system
  • Have helped a team mature with standardized tools and processes around stability, observability, and scaling
About OpenAI
We’re building safe Artificial General Intelligence (AGI), and ensuring it leads to a good outcome for humans. We believe that unreasonably great results are best delivered by a highly creative group working in concert. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
This position is subject to a background check for any convictions directly related to its duties and responsibilities. Only job-related convictions will be considered and will not automatically disqualify the candidate. Pursuant to the San Francisco Fair Chance Ordinance, we will consider for employment qualified applicants with arrest and conviction records.
- Health, dental, and vision insurance for you and your family - Unlimited time off (we encourage 4+ weeks per year) - Parental leave - Flexible work hours - Lunch and dinner each day - 401(k) plan
Job tags: Bash Chef High-performance Kubernetes Node Python Terraform