Machine Learning Infrastructure Engineer

San Francisco, California

Scale AI logo
Scale AI
Apply now Apply later

Posted 1 month ago

As a member of the Machine Learning Infrastructure team at Scale, you will be responsible for building systems to accelerate the development and deployment of machine learning models built by our Machine Learning Research team. Our models span the range from computer vision, deep learning, and natural language processing, and are trained on massive datasets to deliver improvements to our customers.
We are building a large hybrid human-machine system in service of ML pipelines for dozens of industry-leading customers. We currently complete millions of tasks a month, and will grow to complete billions of tasks monthly.

You will be:

  • Build elastic data pipelines that process billions of events per day.
  • Build highly available and observable model inference services.
  • Work with our ML research to automate aspects of our pipeline and deploy research models in production.
  • Work with our Infrastructure team to build core abstractions and create standards and best practices for building systems.
  • Be a self-starter who can own projects end-to-end, from requirements, scoping, design, to implementation.
  • Have good taste in building systems and tools and know when to make build vs. buy tradeoffs, as well as having an eye for cost efficiency.
  • Have attention to detail and a good sense for automation, debugging, and troubleshooting.

This role could be a fit if you have:

  • Solid background in algorithms, data structures, and object-oriented programming.
  • Experience in building scalable and fault-tolerant distributed systems that process large volumes of data.
  • Degree in computer science or related field.

Nice to haves:

  • Experience working with a cloud technology stack (eg. AWS or GCP).
  • Experience building machine learning training pipelines or inference services in a production setting.
  • Experience building, deploying, and monitoring complex microservice architectures.
  • Experience with machine learning frameworks and libraries (PyTorch, Tensorflow, Kubeflow, Seldon).
  • Experience with big data tools (Spark, Flink, Hadoop) and building ETL and streaming pipelines.
  • Experience with Python, Docker, Kubernetes, and Infrastructure as code (e.g. terraform).


About Us:At Scale, our mission is to accelerate the development of Machine Learning and AI applications across multiple markets. Our first product is a suite of APIs that allow AI teams to generate high-quality ground truth data. Our customers include OpenAI, Zoox, Lyft, Pinterest, Airbnb, nuTonomy, and many more.
Scale AI is an equal opportunity employer. We aim for every person at Scale to feel like they matter, belong, and can be their authentic selves so they can do their best work. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Scale AI is committed to working with and providing reasonable accommodations to applicants with physical and mental disabilities. If you need assistance and/or a reasonable accommodation in the application or recruiting process due to a disability, please contact us at accommodations@scale.com. Please see the United States Department of Labor's EEO poster and EEO poster supplement for additional information.
Job tags: AWS Docker GCP Hadoop Kubernetes Python Spark Streaming Terraform
Job region(s): North America
Share this job: