Senior Site Reliability Engineer


Full Time Senior-level / Expert
CipherHealth logo
Apply now Apply later

Posted 1 month ago

CipherHealth is an award-winning healthcare technology company that delivers a comprehensive portfolio of scalable and flexible patient engagement solutions for healthcare organizations to keep patients, staff, families and communities up to date and informed about their preventative, acute or elective care -- whether it is in a hospital, clinic, facility, at home or anywhere in between.

In this historic time, when the entire globe is facing a global healthcare crisis, CipherHealth is out in front helping hundreds of leading healthcare providers like UCSF, Johns Hopkins and University of Pennsylvania manage through this pandemic and beyond with solutions that enable them to deliver remarkable in-care experiences and impactful around-care engagement that empower patients and staff, reduce friction and waste,  and drive best possible outcomes.

How can you join the movement?

We are seeking smart, collaborative, cross functional, and highly motivated Sr. SRE to join our growing Development Operations team.  This is a key role reporting to the Director of Development Operations, and will be instrumental in modernizing our deployment, monitoring, and logging methodologies.  The right candidate will design systems to handle new functionalities and interfaces in a redundant, reliable, and repeatable fashion while delivering actionable metrics and telemetry for application maintenance and business intelligence purposes. 

You will bring your SRE expertise to work closely within the DevOps team to modernize the logging and telemetry architecture while simultaneously enhancing our continuous integration and delivery pipelines in the cloud.  Your knowledge of product integration and delivery processes will enable you to work closely with the Product and Customer Success teams to identify relevant key performance indicators — and then write the logic to convert those KPIs into actionable metrics.  You will bring an ‘automate everything’ mindset to the team with a focus on scalability, infrastructure as code, and high availability.  Your efforts will help drive CipherHealth’s platform forward to maintain the utmost reliability and deliver functional iterations as quickly as possible. 



  • Develop, maintain, and constantly improve our on premises infrastructure while building next generation infrastructure as code (IaC) for our migration to cloud
  • Identify and develop actionable metrics based on Key Performance Indicators relating to application stability, uptime, response, usability, and throughput 
  • Continuously improve the visibility of our stack with enhanced logging, metrics, tracing, and relevant statistics
  • Design custom dashboards to aggregate relevant data into easy to digest views, and custom alerts based on relevant thresholds
  • Work with our Software Architects to design highly available environments in the cloud through IaC and configuration management 
  • Work closely with the product team to deliver cutting edge functionality as efficiently as possible — leveraging your continually improving CI/CD pipelines and deployment utilities
  • Write custom terraform, python, shell, and yaml scripts to automate the entire deployment and build process from staging to production
  • Assist in the migration of legacy applications from monolithic architectures to service based containers for scalability, reliability, and quicker deployments
  • Share on call responsibilities with other team members to help meet 24/7/365 SLAs
  • Troubleshoot, analyze, and assist product and customer success teams to identify client pain points and work to resolve them as quickly as possible, in a repeatable and automated manner
  • Work with agile teams to bring your DevOps expertise across disparate projects and disciplines


  • 7+ years of Linux administration
  • 4+ years of experience in site reliability engineering, devops engineering, and CI/CD tooling
  • 2+ years working in a software operations production environment, bonus points for SaaS experience
  • 2+ years experience working with cloud technologies (AWS, GCP, Azure)
  • Expertise in application monitoring, telemetry gathering, and associated utilities (Nagios, Icinga, DataDog, NewRelic, FluentD, AWS Cloudwatch, GCP Cloud Logging/Stackdriver, etc)
  • Experience with centralized logging pipelines (ELK, AWS Cloudwatch Logs, Stackdriver, etc) and exposure to parsing concepts (GROK filters, Regex)
  • Experience with common CI/CD tools (Gitlab, Jenkins, CircleCI, etc)
  • Experience working within an Agile/Scrum team 
  • Familiarity with MongoDB, PostgreSQL, Redis
  • Experience with container technologies (Docker, Kubernetes) from building to deploying
  • Exposure to configuration management utilities (Ansible, Chef, Puppet)
  • Microservice exposure, conceptual understanding of application decoupling 
  • Strong networking fundamentals
  • Cross functional acumen
  • “Automate everything” mindset
  • Passion for building reliable systems
  • Fluent in English (written and spoken)
  • Candidates must reside in and be able to legally work in the US

Nice to haves

  • SaaS experience
  • Experience working in a highly regulated environment (Fintech, Health Care, Education)
  • GCP expertise 
  • Snowflake experience
  • Data Warehouse experience
  • ETL experience
  • DevSecOps experience
Job tags: Ansible AWS Azure CD Chef CI Docker ELK GCP Gitlab High availability Kubernetes Linux MongoDB PostgreSQL Puppet Python Redis Reliability engineering Terraform
Job region(s): North America
Job stats:  1  0  0