Senior Site Reliability Engineer
San Francisco or Remote

Labelbox
Labelbox is building software infrastructure for industrial data science teams to do data labeling for the training of neural networks. When we build software, we take for granted the existence of collaborative tools to write and debug code. The machine learning workflow has no standard tooling for labeling data, storing it, debugging models and then continually improving model accuracy. Enter Labelbox. Labelbox's vision is to become the default software for data scientists to manage data and train neural networks in the same way that GitHub or text editors are defaults for software engineers.
We are backed by some of the finest people in the Silicon Valley who work at Andreessen Horowitz, Gradient Ventures (Google's AI fund), Kleiner Perkins and First Round Capital.
At Labelbox, we’re building a platform to accelerate the development of this future. Rather than requiring companies to create their own expensive and incomplete homegrown tools, we’ve created a training data platform that acts as a central hub for humans to interface with AI. When humans have better ways to input and manage data, machines have better ways to learn.
We are backed by some of the finest people in the Silicon Valley who work at Andreessen Horowitz, Gradient Ventures (Google's AI fund), Kleiner Perkins and First Round Capital.
What you'll be doing
- Executing and actioning on an infrastructure roadmap, collaborating with team members across engineering, product, and design
- Maintaining and improving our monitoring and alerting for both our SaaS and on-premises offerings
- Managing log and metrics collection using tools such as ElasticStack, Datadog, and others
- Enabling development teams to monitor, analyze, and manage their services
- Building out tools or frameworks to improve the overall development experience
- Identifying and measuring key performance metrics for our infrastructure and defining service-level objectives (SLOs)
- Participating in our on-call rotation
We're looking for someone with
- 4+ years of relevant experience in an SRE or DevOps role
- Experience with modern Linux systems and running services in production
- Experience managing infrastructure in a major public cloud (AWS, GCP, Azure)
- Experience with Kubernetes or other container orchestration systems
- Experience with CI/CD tools and technologies such as Codefresh, Jenkins, TeamCity, etc
- Experience with and an understanding of complex distributed systems
Bonus
- Experience with automation tools and technologies such as shell scripting, Terraform, Helm, etc
- Experience deploying, maintaining, and automating services in on-premises environments
- Coding skills in languages such as Java or Golang
- Experience with database technologies such as PostgreSQL, MySQL, or other RDBMS
- Experience with other open source technologies such as Redis, Elasticsearch, and RabbitMQ
- Experience with SOC 2, FedRAMP, HIPAA, and other compliance-related programs
- Experience managing multiple Kubernetes clusters / clusters spanning multiple cloud providers
- Advanced knowledge of infrastructure management in GCP
At Labelbox, we’re building a platform to accelerate the development of this future. Rather than requiring companies to create their own expensive and incomplete homegrown tools, we’ve created a training data platform that acts as a central hub for humans to interface with AI. When humans have better ways to input and manage data, machines have better ways to learn.
Job tags:
AWS
Azure
CD
CI
Elasticsearch
GCP
Golang
Java
Kubernetes
Linux
MySQL
Open source
PostgreSQL
RabbitMQ
Redis
Terraform
Job region(s):
North America
Remote/Anywhere