Site Reliability Engineering Manager
About the team
An engineer in our team works with a global scale infrastructure and has great impact in millions of players. To guarantee the best experience possible, we count with several Kubernetes clusters spread around the world and connected to each other. We are in the cutting edge of open-source infrastructure technology, we adopted Kubernetes in production little after the project was launched and today we use technologies such as eBPF and Cilium in our network stack.
We handle billions of logs daily and have hundreds of nodes and thousands of containers to serve more than 1 million requests per minute. We know this number will only grow and we're looking for engineers that can help with the challenges of provisioning and operating infrastructure at large scale.
About the role
Wildlife is searching for an SRE manager to join our SRE team. On the technical side, you’ll need to find bottlenecks and solve performance problems in distributed systems with ease. You’ll also make decisions about bigger systems design themes, such as when to build sustainable, high-impact projects and when to ship things quickly and with quality. You’ll play a fundamental role in making our games reach the next billion people by guiding and developing highly-skilled engineers and working closely with other managers and leaders to improve our processes and ensure we have a world-class engineering organization.
More about you
- Player focused. We are player oriented and infrastructure has a great impact in their experience. You have empathy with our players and focus on ensuring they have an amazing experience. You aim for a top-level infrastructure, guaranteeing the highest availability possible.
- Automation is key to scaling. We look for engineers that have a history of projecting and executing automation projects in order to get rid of any manual and repetitive tasks.
- Calm and pragmatism. When everything seems to be falling apart around you, you have a plan and keep calm.
- Bleeding edge. You are curious and like to study new technologies, test new solutions and measure the impact brought by changes. We want to ensure we are using the best stack possible
What you’ll do
- Lead and contribute to the design of CI/CD pipelines, using best practices around automation, pushing changes that improve reliability and velocity.
- Own end-to-end availability and performance of key services and build automation to prevent problem recurrence. Automate response to all non-exceptional service conditions.
- Provide mentorship and training on CI/CD pipelines and processes. Drive education and knowledge transfer of design patterns.
- Provide leadership and prioritization to the experienced team of 4-5 site reliability engineers and help make the key trade-offs required to keep the team working most effectively.
- Drive innovation through research, setting the direction and standards in the use of technical solutions.
- Have an enormous impact working closely with teams on our organization, Be an advocate for SRE principles
- Drive collaboration and agreement across disciplines and geographically dispersed teams, ensuring they are using the best practices.
- Active participation in recruiting and process refinement. You’ll help to improve internal practices and standards to bring new candidates to your team and the company.
What you'll need
- 4+ years of experience as an SRE.
- Wiliness to be a hands-on contributor.
- Strong background in programming or experienced systems administrator.
- Experience of defining KPI's/SLA's and managing teams to excel at these.
- Experience in public cloud, we have a large significant presence in AWS.
- Experience with Kubernetes and/or different container orchestration platforms
- Experience with incident/response and on-call processes and tools.
- Challenge the status quo and work with the team to design simple and flexible solutions
We welcome people from all backgrounds who seek the opportunity to help build the best gaming company, where everyone thrives.