Data Reliability Engineer
New York, NY
Our mission at Vimeo is to help businesses drive impact through video. Thanks to our strong community of video professionals, the volume of video consumption, and uploaded content, data is a key ingredient for success and one of our economic moat.
We are looking for a Data reliability engineer to help us with improving the reliability of our data platforms and pipelines serving billions of events and terabytes of data daily.
You’ll be working closely with different data engineering teams on their incident management process, post-mortem, root cause analysis, and preventing incidents recurrence.
If you are passionate about data reliability, scale, and automation we should talk soon!
What You'll Do:
- You will collaborate with engineering teams to improve, maintain, performance tune, and capacity plan for Vimeo’s data platforms and infrastructure.
- Design business continuity and disaster recovery plans and processes, work with the engineering team in implementation.
- You will drive the incident management process for our data platform working with our partner teams to perform incident post-mortems, root cause analysis, and prevent recurring incidents.
- You will lead the standard change and release management process, automate and promote related best practices across engineering teams and help Vimeo to meet and maintain legal compliance status.
- Build intelligent monitoring over data pipelines and infrastructure, to achieve early and automated anomaly detection.
- You'll work closely with software developers to build an end-to-end automated testing framework and system-level testing environment.
- Participate in an on-call rotation.
Skills and knowledge you should possess:
- You have production experience with distributed datastores, e.g. Hbase, zookeeper, Kafka (alternative experience such as RabbitMQ, Cassandra, elasticsearch, etc would be also relevant)
- Own, manage, monitor and optimize the reliability and overall health of our development and production environments
- Detailed problem-solving approach, coupled with a strong sense of ownership and drive
- A passionate bias to action and passion for delivering high-quality data solutions
- 3+ years of experience working on Linux environment, and proficient with cloud environment (AWS, GCP)
- Experience coding in one or more of the following programming language: Python, Java (mandatory), or Scala
- 3+ years of hands-on experience in Reliability Engineering for high-performant, scalable and distributed data systems with a focus on automation
- Experience in a config management systems like chef, puppet, Ansible, or terraform.
- Deep understanding of CI/CD principles, familiar with source control systems (Git)
- Work with peer SREs to roll out changes to our production environment and help mitigate data-related production incidents.
- Experience with a Change Data Capture system, such as Debezium, is a plus.
- Attention to detail and quality with excellent problem solving and interpersonal skills
- A bonus - you have some experience in data warehousing and data engineering
Vimeo is the world’s leading professional video platform and community. We empower over 175 million users — from creatives to entrepreneurs to the world’s largest brands — to grow their business with video. Our products make it easy to create high-quality, impactful videos and to reach teams, audiences and customers anywhere.
Vimeo is powered by a growing team of over 600 passionate, dedicated humans. We’re headquartered in New York City with offices around the world. We believe our impact is greatest when our workforce represents the diverse and global community that we serve, and we’re proud to be an equal opportunity employer where diversity, equity and inclusion is prioritized in how we build our products, leaders and culture. Learn more at www.vimeo.com/jobs.