Cloud SRE - Senior Site Reliability Engineer (Observability)
Elastic is a free and open search company that powers enterprise search, observability, and security solutions built on one technology stack that can be deployed anywhere. From finding documents to monitoring infrastructure to hunting for threats, Elastic makes data usable in real-time and at scale. Thousands of organizations worldwide, including Barclays, Cisco, eBay, Fairfax, ING, Goldman Sachs, Microsoft, The Mayo Clinic, NASA, The New York Times, Wikipedia, and Verizon, use Elastic to power mission-critical systems. Founded in 2012, Elastic is a distributed company with Elasticians around the globe. Learn more at elastic.co.
Thanks to our ongoing expansion, we have the opportunity to grow our Cloud Observability team. As part of Elastic Cloud engineering, we focus on delivering a reliable and resilient Elastic Cloud. We draw upon our operational experience to not just solve issues with distributed systems, but also influence the direction of Elastic Cloud for designing and solving for a stable and reliable service. We’re looking for people who are just as passionate about taking an engineering approach to solving operational problems as they are to utilizing data and feedback to work collaboratively to solve problems. For Observability in particular, we’d love to speak to candidates with a background in data processing, visualisation, and Observability concepts as a whole.
What you will be doing:
- Lead technical initiatives aimed at improving the reliability (and specifically predictability and observability) of Elastic Cloud, taking an engineering approach to the prevention, detection, and timely mitigation of issues.
- Contribute to SRE engineering through auto-remediation and system engineering efforts to continue our efforts in reducing human intervention in automation of processes and operational tasks.
- Respond to major incidents, correcting and improving systems to prevent incidents and grow at scale.
- Solve the operational problems that you find in Elastic Cloud with full support from your team. You will contribute to a culture of elevating others, collaboration, and operational excellence.
- Participate in a weekly on-call rotation, using a follow-the-sun model.
What you bring along:
- Holistic view of and true appreciation for reliability, borne of real-world experience operating production services. You have examples of using software engineering and SRE practices to solve operational problems.
- Background in software engineering, and can confidently collaborate with engineers to identify and resolve issues. Ideally with experience in public cloud; AWS, GCP, Azure and preferably on distributed systems at scale.
- You have outstanding interpersonal skills, and are able to build strong relationships with your inclusive communication methods. Examples of working in distributed teams or working remotely is desirable.
- Familiar with the core concepts surrounding Observability of large-scale SaaS platforms, including both trend analysis for operation of the service, and business intelligence related reporting.
- You have operated a SaaS product in a public cloud (AWS, GCP, Azure, or SoftLayer preferred).
- Experience in system administration with professional skills in Linux on distributed systems at scale.
- Have designed, implemented or diagnosed and resolved issues with the Elastic Stack.
- You have demonstrable experience in leading alerting and major incident management best practices.
- You are experienced in contributing in a self-organizing and collaborative team environment.
- You have mentored, coached, and grown team members to bring out the best in them.
- Comfortable writing software to automate orchestration tasks at scale (we commonly use Python, Go, and Shell scripting).
- Have used metrics systems (e.g. Elastic Stack, Graphite, Prometheus, Influx) effectively to diagnose issues and quantify impacts, sharing this information with others at varying levels in the organization.
- You have worked with containerized services (such as Docker).
Additional Information - We Take Care of Our People
As a distributed company, diversity drives our identity. Whether you’re looking to launch a new career or grow an existing one, Elastic is the type of company where you can balance great work with great life. Your age is only a number. It doesn’t matter if you’re just out of college or your children are; we need you for what you can do.
We strive to have parity of benefits across regions and while regulations differ from place to place, we believe taking care of our people is the right thing to do.
- Competitive pay based on the work you do here and not your previous salary
- Health coverage for you and your family in many locations
- Ability to craft your calendar with flexible locations and schedules for many roles
- Generous number of vacation days each year
- Double your charitable giving - We match up to $1500 (or local currency equivalent)
- Up to 40 hours each year to use toward volunteer projects you love
- Embracing parenthood with minimum of 16 weeks of parental leave
Different people approach problems differently. We need that. Elastic is committed to diversity as well as inclusion. We are an equal opportunity employer and committed to the principles of affirmative action. Qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, protected veteran status, or disability status or any other basis protected by federal, state or local law, ordinance or regulation. If you require any reasonable accessibility support, please email firstname.lastname@example.org.
Please see here for our Privacy Statement.
Explore more DevOps, Cloud and SRE career opportunities
- Open Cloud Infrastructure Architect Jobs
- Open Staff, Product Manager - Global Infrastructure Jobs
- Open IT DevOps Engineer Jobs
- Open Manager of DevOps & Engineering Infrastructure Jobs
- Open Senior Automation Engineer Jobs
- Open Data Platform Engineer Jobs
- Open Site Reliability Engineer II Jobs
- Open DevOps Infrastructure Engineer Jobs
- Open Senior Software Engineer - Site Reliability - Toronto Hub Jobs
- Open Principal Cloud Architect Jobs
- Open Staff DevOps Engineer Jobs
- Open Reliability Engineer Jobs
- Open Sr. Site Reliability Engineer Jobs
- Open Senior DevOps Engineer - Pleasanton Hub Jobs
- Open Senior Software Engineer, DevOps Jobs
- Open Sr Software engineer (Infrastructure) Jobs
- Open DevOps Engineer - Raleigh Hub Jobs
- Open Senior Security Automation Engineer Jobs
- Open Software Development Engineer, AWS Security Jobs
- Open QA Automation Engineer - Workforce Engagement Management Jobs
- Open Senior Software Development Engineer, AWS Security Jobs
- Open Senior Devops Engineer Jobs
- Open Cloud DevOps Systems Engineer Jobs
- Open Senior Cloud Architect Jobs
- Open Solutions Architect - VMware Specialist Jobs
- Open MySQL-related jobs
- Open REST-related jobs
- Open CloudFormation-related jobs
- Open Prometheus-related jobs
- Open S3-related jobs
- Open Jira-related jobs
- Open Elasticsearch-related jobs
- Open Virtualization-related jobs
- Open High availability-related jobs
- Open Golang-related jobs
- Open Reliability engineering-related jobs
- Open EC2-related jobs
- Open VMware-related jobs
- Open Redis-related jobs
- Open JS-related jobs
- Open MongoDB-related jobs
- Open Node-related jobs
- Open Grafana-related jobs
- Open Gitlab-related jobs
- Open PostgreSQL-related jobs
- Open Jenkins-related jobs
- Open Perl-related jobs
- Open Web applications-related jobs
- Open Spark-related jobs
- Open Load Balancing-related jobs