Staff Software Engineer - Site Reliability
New York, Boston, Denver, Seattle, San Francisco, Remote
We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams. We operate at high scale—trillions of data points per day—providing always-on alerting, metrics visualization, logs, and application tracing for tens of thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way
The Site Reliability teams at Datadog are responsible for ensuring that our high-volume, low-latency environments continue to perform around the clock. These teams collaborate closely with our product engineers to ensure that Datadog can monitor millions of servers and containers, ensuring our customers always have dependable and actionable data at their fingertips. You’ll be responsible for shaping the infrastructure of our data-intensive, real-time services as we continue to grow at petabyte scale.
We are a globally distributed team with US Offices in New York (HQ), Boston, and Denver and International Offices in Paris, Dublin, London, Madrid, the Netherlands, and Singapore. About 33% of our engineering team are remote.
Datadog values people from all walks of life. We understand that not everyone will meet these requirements on day one. If you’re passionate about reliability engineering and want to grow these skills but don’t meet all of these qualifications, we encourage you to apply.
- Keep our services reliable, available, fast and cost-efficient.
- Respond to, investigate and fix service issues, whether they are deep in the OS kernel or in the application code.
- Build tools and production frameworks to make our engineering team’s lives easier.
- Design, build and maintain the infrastructure we need to support orders of magnitude more customers.
- Guide projects across many areas of organizational scope to drive reliability and resilience best practices.
- 8+ years of experience in software engineering
- You value correctness and efficiency; you leave no stone unturned when diagnosing production issues
- You handle infrastructure with code because automation lets you focus on the more difficult and rewarding problems
- You're an expert with running large scale distributed compute/storage tools in production, (we use Kubernetes, Cassandra, Postgres, Kafka, Elasticsearch, Redis)
- You value driving consensus and collaboration through technical expertise and strong communication and project management skills.
- You have an area of expertise that will level up the team you're joining
- You’ve worked in a cloud-native or multi-cloud environment (we use AWS, GCP and Azure)
- You have worked at a company with large scale systems, handling large amounts of data
- You are fluent in Python, Ruby and Golang
Equal Opportunity at Datadog:
Datadog is an Affirmative Action and Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.
Any information you submit to Datadog as part of your application will be processed in accordance with Datadog’s Applicant and Candidate Privacy Notice.
More DevOps and Cloud position highlights
- Explore open Data Platform Engineer Jobs
- Explore open Staff, Product Manager - Global Infrastructure Jobs
- Explore open Manager of DevOps & Engineering Infrastructure Jobs
- Explore open Linux Infrastructure Developer Jobs
- Explore open Principal Cloud Architect Jobs
- Explore open DevOps Infrastructure Engineer Jobs
- Explore open Senior Automation Engineer Jobs
- Explore open IT DevOps Engineer Jobs
- Explore open Site Reliability Engineer II Jobs
- Explore open Senior Cloud Architect Jobs
- Explore open Staff DevOps Engineer Jobs
- Explore open Software Development Engineer, AWS Security Jobs
- Explore open Reliability Engineer Jobs
- Explore open Senior Software Engineer - Site Reliability - Toronto Hub Jobs
- Explore open Sr Software engineer (Infrastructure) Jobs
- Explore open Senior Security Automation Engineer Jobs
- Explore open DevOps Engineer - Python/Ansible Jobs
- Explore open DevOps Engineer - Raleigh Hub Jobs
- Explore open Software Engineer, Cloud Infrastructure Jobs
- Explore open Senior Quality Automation Engineer Jobs
- Explore open Application Developer: DevOps Jobs
- Explore open DevOps Engineer (Remote) Jobs
- Explore open Solutions Architect - VMware Specialist Jobs
- Explore open Cloud DevOps Systems Engineer Jobs
- Explore open Senior Software Development Engineer, AWS Security Jobs
- Explore open REST-related jobs
- Explore open MySQL-related jobs
- Explore open CloudFormation-related jobs
- Explore open Prometheus-related jobs
- Explore open Jira-related jobs
- Explore open S3-related jobs
- Explore open Elasticsearch-related jobs
- Explore open Virtualization-related jobs
- Explore open High availability-related jobs
- Explore open VMware-related jobs
- Explore open Golang-related jobs
- Explore open EC2-related jobs
- Explore open Reliability engineering-related jobs
- Explore open Redis-related jobs
- Explore open MongoDB-related jobs
- Explore open JS-related jobs
- Explore open PostgreSQL-related jobs
- Explore open Grafana-related jobs
- Explore open Gitlab-related jobs
- Explore open Node-related jobs
- Explore open Perl-related jobs
- Explore open Web applications-related jobs
- Explore open Spark-related jobs
- Explore open Load Balancing-related jobs
- Explore open Node.js-related jobs