Machine Learning Infrastructure Engineer, ML Platform
Remote, San Francisco
Infrastructure Engineer, Machine Learning Platform
Stripe's mission is to increase the GDP of the internet. To do this, we need to fight fraud at scale and build great software products, which means assembling strong machine learning teams and equipping them with the technologies they need to be effective. Our mission on Machine Learning Platform is to make these teams more impactful by providing reliable and flexible infrastructure to enable Machine Learning at scale.
The Machine Learning Platform team does this by designing and engineering the underlying infrastructure that powers experimentation, training and serving for Stripe’s key machine learning systems. Our flagship products include Railyard and Diorama. Railyard provides an expressive and powerful interface for model training at scale. Diorama enables model serving in real-time with strong reliability and latency guarantees. We work closely with ML engineers, data scientists, and platform infrastructure teams to build the powerful, flexible, and user-friendly systems that substantially increase ML velocity across the company.
You will work on:
- Building powerful, flexible, and user-friendly infrastructure that powers all of ML at Stripe
- Designing and building fast, reliable services for ML model training and serving, and distributing that infrastructure across multiple regions
- Creating services and libraries that enable ML engineers at Stripe to seamlessly transition from experimentation to production across Stripe’s systems
- Pairing with product teams and ML modeling engineers to develop easy to use infrastructure for production ML models
We are looking for:
- A strong engineering background and experience with data infrastructure and/or distributed systems.
- Experience optimizing the end-to-end performance of distributed systems.
- Experience developing and maintaining distributed systems built with open source tools.
- Experience with or strong interest in developing ML models.
Nice to haves:
- Experience with Scala and Python
- Experience with Kubernetes
- Experience with creating developer tools
- Experience with model training and serving in production and at scale.
- Experience in writing and debugging ETL jobs using a distributed data framework (such as Spark, Kafka, or Flink)
It’s not expected that you’ll have deep expertise in every dimension above, but you should be interested in learning any of the areas that are less familiar.