Lead Site Reliability Engineer
Remote - Brazil
The Lead Site Reliability Engineer will be in charge of helping design, implement, and optimize both the appropriate tools and best practices to support a 24x7 Alert Response Service Offering. Ideal candidates will bring knowledge of the best ways to leverage and integrate tools to support monitoring, alerting, on-call scheduling, escalation workflow, runbook documentation, root-cause analysis, and incident management. In addition, the ideal candidates will possess knowledge of current thought leadership in Site Reliability Engineering at the level to help educate customers on how to effectively implement an SRE Program for their applications and systems. If the idea of being able to combine your passion for Site Reliability Engineering with the unique opportunity to help build the systems, processes, and philosophies that drive a best-of-breed new practice in this space, this might be just the challenge you are looking for!Responsibilities and Duties
- Select, customize, and integrate the optimal tools to support our Alert Response Service as a part of helping customers adopt best-of-breed SRE fundamentals.
- Document our processes, procedures, practices, and methodologies that define our opinionated approach to achieving SRE objectives.
- Help develop materials which help us introduce SRE concepts to customers and help lead them through an organizational readiness process for empowering their teams for end-to-end ownership of their application components and suitable and adequate component metrics and functional monitoring coverage.
- Developing processes and procedures for our Alert Response team who will handle initial Alerts, determine the presence/absence of a customer-impacting incident, attempt to address issues for which existing runbooks already exist, and direct targeted escalations to the right teams based upon root cause analysis to minimize both alert fatigue and the incidences where on-call resources must be pulled-in to address critical incidents.
- Help design and measure KPI’s for our Services and for Customer Success following the adoption of our processes and best practices for site reliability engineering and alert/incident response.
- Perform monitoring/alerting readiness assessments and ensure appropriate work backlogs are generated for changes required to set customers and their applications up for success.
- Ongoing development and improvement of our offering(s) focused on customer site availability and platform incident response.
- Past experience as a Site Reliability Engineer with the systems and tooling which facilitate monitoring, alerting, and incident response for production workloads.
- Extensive knowledge of Site Reliability Engineering theory and best practices at a level where you can talk extensively about state of the art of thought leadership in this discipline.
- Past experience as a Systems Administrator responsible for Linux/Unix systems, desired
- Past experience managing multi-team platform support/incident response .
- Past experience as Manager or Lead for a DevOps or SRE Team with responsibility for a team of Engineers supporting a production product or platform.
- Knowledge of current monitoring and alerting tools catering to Cloud Native, such as Prometheus, Grafana, and AlertManager.
- Knowledge of modern log aggregation tools for Cloud Native workloads, such as ELK/EFK stack implementations, Grafana Loki, Graylog, Fluentd/Fluentbit.
- Knowledge of current monitoring and alerting tools catering to Serverless technologies.
- Strong knowledge of one or more Alert Management tools, such as Pagerduty, OpsGenie, AlertOps, VictorOps, etc.
- Strong knowledge of other monitoring tools such as Cloudwatch, Nagios, MRTG, Zabbix, SolarWinds, WhatsUp, and other similar tools.
- Knowledge of Kubernetes and Container Orchestration
- Linux Systems Administration Fundamentals
- Experience managing Cloud infrastructure in AWS
- Fundamentals of containers, containerd, and Docker
- Knowledge of remote systems management using OpenSSH
- Understanding of Linux logging subsystems
- Troubleshooting and managing Linux services
- Understanding of DNS fundamentals
- Experience working with wiki-based or markdown-based documentation tools
- Intermediate-level knowledge of AWS Cloud (often represented by the AWS Solutions Architect Associate Certification)
Skills desired, but not required:
- Advanced-level knowledge of AWS Cloud (often represented by the AWS Solutions Architect Professional Certification)
- Linux/Unix Shell Scripting
- Knowledge of managing any specific AWS Managed Services including, but not limited to: ECS, EKS, S3, Elasticsearch, Cloudwatch, EC2 Auto-Scaling Groups, Route53, Cloudwatch Logs
- Hands-on experience of one or more CI/CD or Build/Release tools
- IP Networking fundamentals
- Knowledge of Git for version control
- Knowledge of Terraform or any other infrastructure-as-code tools
- SysAdmin or monitoring knowledge for any other platforms or technologies is a plus
- 100% remote, work from home or in a shared workspace
- Competitive base salary plus commission and bonus program
- Medical, dental, vision, and life insurance benefits
- Generous holidays and paid time off
- State of the art laptop and tools
- Individual professional development plan
- Fun corporate events
- Work with an amazing worldwide team and in an incredible corporate culture
Caylent is a specialty Consultant and DevOps/Cloud Managed Services provider (DevOps-as-a-Service) with an emphasis on Kubernetes to software-enabled companies, mostly high-growth software startups. Our clients rely on Caylent engineers to architect, develop and integrate highly complex DevOps pipelines, including build automation, CI/CD, Infrastructure as Code, security, monitoring, logging, and alerting.
Part of our value proposition is that we structure our relationships with clients as real partnerships, where we become part of their team and share in their challenges and successes. That means every Caylent engineer is client-facing. Our engineers love the challenge of working on new stuff every single day without corporate headaches.
We are only as good as each person’s contribution to the delivery. Enormous effort is put forth into nurturing each employee’s career development, professional curiosity, technical innovation, and client interaction. If these things matter to you, read on, and apply.NOTE: We are unable to provide sponsorship for this position.