Head of Site Reliability Engineering
Clarifai is a leading, full-lifecycle deep learning AI platform for computer vision and natural language processing. We help organizations transform unstructured images, video, and text data into structured data, significantly faster and more accurately than humans would be able to do on their own. Founded in 2013 by Matt Zeiler, Ph.D. Clarifai has been a market leader in AI since winning the top five places in image classification at the 2013 ImageNet Challenge. Clarifai continues to grow with more than 100 employees and offices in New York, San Francisco, and Tallinn, Estonia.
We are looking for a Head of Site Reliability Engineering who will partner with our AI experts to scale a bleeding-edge platform across multi-cloud, bare metal and edge. You will ensure we adhere to the highest security posture for commercial and public sector clients. You will help Clarifai researchers and engineers effortlessly iterate quickly, ship high-quality products. You will build world class training and inference clusters relied upon not only by our own researchers, but also the world’s biggest organizations and developers everywhere. You will lead a program that enables our software development with the ease and speed of the world’s most successful startups. You will support the entire software development lifecycle, including core research & development tools, environments, build tools, CI/CD pipelines.
- Lead: Drive cross-team and cross-org strategic direction, alignment, and oversight of reliability initiatives.
- Mentor: Mentor a highly skilled team of infrastructure, security and IT engineers. Hire, supervise, develop, evaluate, mentor and coach a geo distributed team. Cultivate an environment of continual learning and growth for group members.
- Champion: Establish standard practices and processes for planning and prioritizing reliability work and champion a culture of reliability.
- Partner: With senior engineering, research and product leadership to plan and ensure key initiatives are implemented for both on-premise and cloud services (GCP/AWS/Azure).
- Budget: Develop, track, and control the information technology annual operating and capital budgets.
- Optimize: Continuously identify opportunities for improvements, expansion, and/or reduction of services and/or costs.
- Storytell: Use effective communication strategies such as dashboards and visual analytics and create a data-driven culture.
- Innovate: Research current and new industry trends, technologies, and software development practices.
- Secure: Ensure mitigation of security vulnerabilities and risks across all our managed systems, applications, and services. Direct or indirect involvement in the development of policies, standards and guidelines to ensure our product meets all security requirements
- Operate: Develop, maintain and monitor effective operation’s processes to prevent failures of infrastructure, systems, applications and services.
- Commit: Lead investigation of the incidents, drive the efforts to identify and fix the root causes of the incidents within our SLAs. Promote and use a data driven approach within its group to ensure SLAs are met and to drive understanding and improvements within areas of responsibilities.
- Recover: Lead the planning, implementation, and documentation of disaster recovery and business continuity efforts.
- A minimum of a Bachelor's degree required. Master’s degree preferred.
- At least 10 years of experience across core technical requirements.
- At least 4 years of experience in management of professional technical staff positions and formulating a team's technical strategy and roadmap.
- Proficiency in Python, Golang, C++, Java and/or shell scripting.
- Working understanding of modern security vulnerabilities and best practices.
- Experience with 24/7/365 distributed-site monitoring and first-response support for
- Experienced deploying container orchestration (e.g. Kubernetes, GKE, EKS)
- Experienced debugging and operating common cloud datastores (RDS, Cloud SQL, Redshift) or their open source alternatives.
- Experience with configuration management systems such as Ansible, Puppet or Terraform.
- Demonstrable knowledge of TCP/IP, Linux operating system internals, filesystems, disk/storage technologies and storage protocols.
- Expert with CI/CD pipelines.
- Experience with distributed computing and storage (e.g. Hadoop, Spark, HDFS, Ceph).
- Experience with large server environments (1000+ servers) that are geographically dispersed.
- Deep understanding of advanced development engineer practices around automation, code testing, and SRE principles.
- Demonstrated experience with data-centers, network design, application services and technologies, security and compliance.