Cloud SRE - Reliability

Distributed, APJ

Elastic logo
Apply now Apply later

Posted 1 month ago

We seek a new member for our Cloud Reliability team, whose goal is a more reliable and resilient Elastic Cloud. We take an engineering approach to solving operational problems, and utilise data to focus attention on issues of significance. We draw upon our operational experience to shorten feedback loops, connect all teams to production, and promote a culture of operational excellence. We enable future growth of Elastic Cloud.

Who you are:

  • You have outstanding interpersonal skills, and are able to build strong relationships across a dynamic, growing team
  • You have a background in software engineering, and can confidently collaborate with engineers to identify and resolve issues 
  • You have a holistic view of and true appreciation for reliability, borne of real-world experience operating production services

What you’ll do:

In this role you will:

  • Lead initiatives aimed at improving the reliability of Elastic Cloud, through prevention, detection, and timely mitigation of issues
  • Contribute to a culture of mutual respect, collaboration, and operational excellence
  • Contribute to auto-remediation and system engineering efforts, freeing yourself and others from day-to-day toil
  • Participate in a weekly on-call rotation, using a follow-the-sun model

What you’ve done:

You don't need to have all of these items, but these represent the types of work you will do at Elastic Cloud

  • Designed and implemented applications which leverage the Elastic Stack
  • Operated a SaaS product in a public cloud (AWS, GCP, Azure, or SoftLayer preferred)
  • Diagnosed and resolved issues with the Elastic Stack
  • Actively contributed in a self-organising and collaborative team environment
  • Mentored, coached, and grown team members to bring out the best in them
  • Automated orchestration tasks at scale (we commonly use Python, Go, and Shell scripting)
  • Used metrics systems (e.g. Elastic Stack, Graphite, Prometheus, Influx) effectively to diagnose issues and quantify impacts
  • Worked with containerised services

For this position we can only accept applicants currently in: Australia, Japan, South Korea, Hong Kong or New Zealand

Job tags: AWS Azure GCP Go Prometheus Python
Job region(s): Asia/Pacific Remote/Anywhere
Job stats:  3  0  0
  • Share this job via
  • or