Staff Site Reliability Engineer
Remote, US

Segment.io, Inc.
At Segment, we believe companies should be able to send their data wherever they want, whenever they want, with no fuss. Unfortunately, most product managers, analysts, and marketers spend too much time searching for the data they need, while engineers are stuck integrating the tools they want to use. Segment standardizes and streamlines data infrastructure with a single platform that collects, unifies, and sends data to hundreds of business tools with the flip of a switch. That way, our customers can focus on building amazing products and personalized messages for their customers, letting us take care of the complexities of processing their customer data reliably at scale. We’re in the running to power the entire customer data ecosystem, and we need the best people to take the market.
The Infrastructure Engineering group is central to Segment’s Platform strategy. The ecosystem of tools that your team creates and supports are the foundation for the services built by Product teams. In order to maintain our leadership position in the customer engagement space we must continue to build innovative services that support our developers in seamlessly delivering value to customers. You will partner with some of the brightest minds in the industry to push the boundaries of web-scale service delivery.
As a member of the Site Reliability Engineering (SRE) team, you’ll help to empower our entire R&D organization. Alongside a diverse distributed Infrastructure group you’ll participate in building the next iteration of our service platform; focusing on the reliability, operability, observability, flexibility, and cost-effectiveness of our production infrastructure.
What you’ll do
- Write software to build, maintain, automate, and introspect our production systems
- Mentor teams to reliably and cost effectively operate and maintain their services
- Build the next version of Segment’s Service Platform (focused on deployment and observability) to support teams in deploying hundreds of services across a multi-region cloud environment
- Take proactive steps to improve our availability, reliability, and efficiency
- Participate in driving Segment as a market leader in the development of Open Source Software like kafka-go, chamber, kubeapply, etc.
- Participate in an on-call rotation to support our business-critical infrastructure
What you’ll bring
- Minimum of 5 years experience as a Software Engineer, Systems Administrator, Operations Engineer, Site Reliability Engineer, or another similar role
- A systematic problem-solving approach, coupled with good communication skills, sense of ownership, and drive
- Experience operating large-scale, distributed systems on top of cloud infrastructure such as Amazon Web Services (AWS) or Google Compute Platform (GCP)
- Experience programming in one or more of the following: Go, Python, Node.js, Bash, or similar languages
- A proven grasp of Linux systems administration and programming concepts
We’re especially excited about candidates who:
- Have hands-on experience with container orchestration frameworks (e.g. Kubernetes, EKS, ECS)
- Have hands-on experience in operating event-based systems (e.g. Kafka) capable of processing millions of events per second and petabytes of data each month
- Possess a broad understanding of the Linux kernel internals and networking protocols
- Are proficient in metrics tooling such as Datadog and Prometheus
- Have lead teams, large projects, or been the owner of an important system
Job tags:
AWS
Bash
GCP
Go
JS
Kafka
Kubernetes
Linux
Node
Node.js
Open source
Prometheus
Python
Reliability engineering
Job region(s):
North America
Remote/Anywhere
Job stats:
1
0
0