Software Engineer - APM Reliability


Datadog logo
Apply now Apply later

Posted 2 weeks ago

About Datadog:

We're on a mission to build the best platform in the world for engineers to understand and scale their systems, applications, and teams. We operate at high scale—trillions of data points per day—providing always-on alerting, metrics visualization, logs, and application tracing for tens of thousands of companies. Our engineering culture values pragmatism, honesty, and simplicity to solve hard problems the right way.


The team:

At Datadog, APM Reliability Engineers are strong developers focused on improving the performance, stability and release quality of our tracing libraries. We instrument critical paths of systems at scale and the mission of the team is to ensure our libraries are not intrusive and don’t alter the performance or reliability of such systems.


The opportunity:

Datadog is building a world-class APM product that traces requests as they flow across complex systems at scale. As APM Reliability Engineer, you will work with multiple teams to measure and analyze the performance impact we may introduce in such systems with our tracing and profiling tools. You will provide guidance and you will make improvements to push our tracing tools to the next level. Come and join us to build fast and reliable open source software.


You will:

  • Measure the performance of our tracing libraries to detect and solve performance issues that nobody else has been able to crack
  • Coach other engineers to validate tracing libraries reliability introducing methodologies and testing approaches such as Defensive Programming or Fuzzing
  • Build high leverage tools to help you in your day-to-day work, to introduce chaos in our tracing libraries and validate if they are resilient to unexpected errors
  • Own the key performance metrics across libraries, to ensure we never introduce regressions in our libraries



  • You have significant experience in doing software optimizations in at least one of the following languages: Java/Go/C++
  • You have a proven track record of understanding the performance of large-scale services
  • You have significant experience in using profilers, debuggers, tracers, or similar tools to improve the code quality of the software you were writing
  • You communicate well and your enthusiasm for the craft is contagious


Bonus points:

  • You have experience in deploying services in Kubernetes or alternative orchestrators
  • You have experience in building distributed systems


Is this you? Let's chat! 

Job tags: C Go Java Kubernetes Open source
Share this job: