Site Reliability Engineer

New York or Remote

Frame.io logo
Frame.io
Apply now Apply later

Posted 1 week ago

See all roles

 

We’re looking for someone to join our Infrastructure team who can work closely with Backend Services to create more reliable and robust cloud infrastructure as we scale our product.

 

About Frame.io

Frame.io is changing the future of how videos are made by helping over 1 million creative professionals seamlessly collaborate from all over the world. 

We’re backed by Accel, FirstMark, Insight Partners, SignalFire, Jared Leto, and a host of other amazing investors. Our market-leading product is used and loved by companies such as Turner, Disney, NASA, Snapchat, BBC, BuzzFeed, TED, Adobe, Udemy, and many more.

We’re in an exciting period of growth and are always seeking extremely talented and passionate individuals who share our vision for helping visual content creators produce their best work.

 

About the Role

As a Senior member of a Site Reliability Engineering team at Frame.io, you will work to transform and perfect our Kubernetes platform, develop multi-cloud strategy, reduce infrastructure cost, and make our infrastructure reliable, performant, and competitive. You will have the opportunity to work cross functionally to transform and maintain monitorable and reliable software systems, serving millions of users everyday. We’re looking for someone that has deep technical expertise and experience to join a fast-paced, growing team of SREs tackling challenging problems at scale. 

 

Requirements

  • 8+ years of experience in managing cloud infrastructure, including hands-on experience with AWS (or another public cloud), Kubernetes, GitOps, Terraform, Docker, CI/CD.
  • You have worked in multi-cloud environments and developed migration and deployment strategies around it.
  • You have experience in setting up SLAs/SLOs/SLIs for key services and establishing the monitoring around them.
  • You have deep experience in collaborating with engineering teams and developing tools and technologies for them.
  • You have broad knowledge of Cloud Security and facilitate close collaboration between our security and infrastructure teams.
  • You’ll be just as passionate about troubleshooting issues with distributed systems at scale as you are to automate, code and collaborate to solve problems.
  • You have materially improved the operability of the systems you've run - through monitoring, service level management, lifecycle management, performance tuning, and documentation.
  • You are passionate about reliable, scalable, observable software with strong sense of ownership
  • You have substantial experience with a programming language like Python and Golang.
  • You have good knowledge of a centralized configuration tool like Chef, Puppet, or Ansible.
  • Experience in storage technologies and developing cost-effective storage solutions is a plus.

 

Responsibilities

  • Be a thought leader in the SRE team to generate new ideas to build next generation cost-efficient infrastructure to host frame.io services.
  • Develop multi-cloud/storage provider strategy to increase availability and reduce cost.
  • Identify and bridge gaps to ensure Frame.io cloud infrastructure is reliable, scalable and secure.
  • Continue building, maintaining, and improving our Kubernetes and ECS platforms.
  • Run ChaosDays to continuously iterate on how we handle and respond to failure.
  • Ensure our platform's reliability by taking part in our periodic on-call duty.
  • Partner with product & engineering teams on design, development, and capacity planning to ensure Frame.io continues to scale and maximize availability + observability.
  • Ensure sufficient logging, monitoring and alerting strategies around availability, latency and overall system health.
  • Scale systems sustainably through automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Continuously improve Incident Response policies, procedures, tools, automation, and implementation.
  • Reduce waste in the infrastructure by leading initiatives to cut cost without compromising the reliability and security of cloud systems.
  • Design and implement tools for engineering to interact with the infrastructure and deploy services in an easy fashion.
  • We stay active within the infrastructure + security communities by attending or talking at industry events like Kubecon and AWS:reinvent, and would love for you to join in, if you were interested as well.

 

Benefits

  • Competitive salary and equity
  • Paid parental leave for primary or secondary caregivers
  • Unlimited PTO and designated Volunteering paid time off
  • Work From Anywhere Week
  • Yearly stipend for learning and development
  • Medical, Dental, Vision Insurance and OneMedical membership
  • Pre-tax commuter benefit and Flexible Spending Account
  • Daily catered lunch & fully stocked kitchen with cold brew on tap
  • Discounted gym membership, Classpass discount and Free Citi-Bike membership

 

Our Philosophy 

Our philosophy is simple. At Frame.io, we believe that working with people of different backgrounds and perspectives allows us to elevate each other and helps us build a better product for our users.

We’re proud to be an equal opportunity employer, and are committed to providing all employees with a work environment that celebrates individuality and remains free from any form of discrimination and harassment. We base our employment decisions on the needs of our business, job requirements, and applicants' qualifications. In other words, we only care that you’re the best person for the job.

Job tags: Ansible AWS CD Chef CI Docker Golang Kubernetes Puppet Python Reliability engineering Terraform