Cloud - SRE - Reliability

Distributed, APJ

Applications have closed
Elastic logo

Posted 1 month ago

Elastic is a search company with a simple goal: to solve the world's data problems with products that delight and inspire. As the creators of the Elastic Stack, we help thousands of organizations including Cisco, eBay, Goldman Sachs, Microsoft, The Mayo Clinic, NASA, The New York Times, Wikipedia, Verizon, and many more use Elastic to power mission-critical systems. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what's possible with data, delivering on the promise that good things come from connecting the dots. We have a distributed team of Elasticians across 30+ countries (and counting), and our diverse open source community spans over 100 countries. Learn more at

Thanks to our ongoing expansion we have the opportunity to grow our Cloud SRE - Reliability team, the front-line owners of Incident Management, Investigation, and Response for the Elastic Cloud platform. We take a Site Reliability Engineering approach to addressing stability concerns, so we’re looking for people who are just as passionate about resolving distributed system issues as they are coding and collaborating with others. In this role you’ll be responsible for the health of thousands of Elasticsearch clusters spread across all major cloud providers.

Who you are:

  • You have outstanding interpersonal skills, and can effectively coordinate incident response across globally distributed teams in a dynamic, growing environment
  • You are a software engineer at heart, with a compulsion to automate yourself out of a job
  • You have production-grade experience operating Linux systems, with the ability to methodically diagnose system, network, and application issues
  • Experience with GovCloud is welcome

What you’ll do:

In this role you will:

  • participate in a weekly on-call rotation, using a follow-the-sun model; on-call shifts are aligned with local business hours
  • provide low-latency response to incidents and service instability, coordinating with internal and external teams as needed
  • contribute to tooling, automation, and system engineering efforts, freeing yourself and others from day-to-day toil
  • lead blameless post-mortems, ensuring preventative actions are prioritised appropriately
  • be an advocate for Elastic Cloud customers, sharing your deep insight into our production systems with other engineering teams

What you’ve done:

You don't need to have all of these items, but these represent the types of work you will do at Elastic Cloud

  • You have operated a SaaS product in a public cloud (AWS, GCP, Azure, or SoftLayer preferred), and have some stories to share
  • You are adept at writing software to automate orchestration tasks at scale; we commonly use Python, Go, and Shell scripting
  • You can use metrics systems (e.g. Elastic, Graphite, Prometheus, Influx) effectively to diagnose issues and quantify impacts
  • You have worked with cloud infrastructure-as-code tooling; Terraform, CloudFormation, or others
  • You've diagnosed and resolved Elastic Stack cluster issues
  • You are familiar with containerisation and container orchestration concepts



Job tags: Apache AWS Azure CloudFormation Elasticsearch GCP Go Linux Open source Prometheus Python Reliability engineering Terraform