Systems Reliability Engineer - Multicloud


Datadog logo
Apply now Apply later

Posted 4 weeks ago

About Datadog:

We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams. We operate at high scale—tens of trillions of data points per day—providing always-on alerting, metrics visualization, logs, application tracing, synthetics and more for thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems.


The opportunity:

We’re looking for Systems Reliability Engineers to join our new Multicloud Systems Reliability Engineering team. Today Datadog runs across a few vendors in a handful of regions.  As we move towards becoming the first-choice telemetry platform no matter where our customers run, we have found we need to greatly expand the footprint of where our infrastructure runs. With that, there are enough challenges specific to each cloud provider that we need to start building focused core reliability teams or each cloud provider.

At Datadog, Systems Reliability Engineers are our systems-focused generalists, blending deep and practical knowledge of Linux, Open Source, Cloud Vendors and System design. They are at the front line maintaining and expanding the capabilities of our many and varied systems, filling the gap between traditional systems administration and development, seeking to merge the capabilities from both disciplines to run reliable systems at massive scale. 

One of the first region builds we will support is for U.S. FedRAMP Moderate customers. This will require candidates to be a U.S. citizen or national, U.S. permanent resident (i.e., current Green Card holder), or lawfully admitted into the U.S. as a refugee or granted asylum.



  • Bachelor’s degree in Computer Science or related field, or relevant work experience
  • Experience as a software engineer
  • Experience with working on AWS services (S3, DynamoDB, EC2)
  • Experience in 24x7 production environments
  • 2+ years Linux experience
  • 2+ years devops, reliability, technical support, operations, or development experience


Bonus points:

  • Strong Linux skills
  • In depth Python/Go programming ability, with a focus on automation
  • Java/JVM operations experience
  • Experience managing large server/container fleets
  • 2+ years working in a software as a service environment
  • Excellent problem solving skills with a strong attention to detail
  • Ability to dive deep into complex technical problems



Equal Opportunity at Datadog:

Datadog is an Affirmative Action and Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements.


Your Privacy:

For more information on how we maintain the privacy of the information you submit as part of your application, please refer to our Applicant and Candidate Privacy Notice.

Job tags: AWS EC2 Go Java Linux Open source Python Reliability engineering S3
Job region(s): North America
Job stats:  1  0  0
  • Share this job via
  • or