Site Reliability Engineer
If you’re looking to work in a collaborative culture, solving engineering challenges at a global scale, and having a real impact in making our products better for our customers, we would love to talk to you!
We believe in providing trust and autonomy so everyone can do their best work. From how we work to how managers support you, our goal is to provide an environment that enables you to continuously grow, ask questions and not be afraid to fail - because when we do, we see it as an opportunity to learn.
Site Reliability Engineering at GC
As a Site Reliability Engineer you'll be part of a small team that sets the direction of the GoCardless core stack. You'll think through the moving pieces that make up our infrastructure and the complex interactions between them. You'll work with every other team within engineering – from product to data – to help build and scale the global platform our product sits on.
We take pride in our work, and open-source as much as we can. We’ve picked some examples of recent open-source work you can use to understand our remit, along with real-world examples of how our team operate from GitHub discussions:
- Kubernetes extensions: We run a lot of our workloads in Kubernetes, and have extended our clusters with a suite of operators that we open-source at gocardless/theatre. We recently added support for authorised consoles, providing developers across GoCardless with a tool to securely administer and debug their application from within a production environment. See the PRs for this feature here.
- HA Postgres tooling: GoCardless prefers Postgres for applications that need relational databases. We run large HA Postgres clusters using a tool called Stolon. Our SRE team built an extension for Stolon called stolon-pgbouncer which can provide zero-downtime failover between Postgres primaries, along with first-class integration with PgBouncer, a common Postgres connection pooler
- Ensuring service reliability: As SREs, we work to ensure the reliability of our services. We recently upgraded our larger Postgres clusters to use the latest Postgres version on more powerful hardware. To ensure this change went smoothly, we wrote a load testing tool for Postgres called pgreplay-go that we used to replay production queries, helping us discover performance regressions. We also wrote a blog post about it.
We use a large list of technologies and we will not expect you to have experience with all of them. We expect you to be enthusiastic about the infrastructure space, and the impact your work can have on others. To enjoy this work, we need you to be excited about learning new technologies and not afraid to tackle hard problems.
Ideally there’s a programming language you’ve become proficient in, and used it to build a project you’re proud of. We don’t mind what language this is, as long as you’re up to joining a team that focuses mostly on Go/Ruby. We’ll probably ask you about this during the interview - we’d love to hear what you’ve done!
Finally, you take pride in your work and aim to build solutions that solve real problems, as simply as possible. You appreciate (and try to build!) tools that are easy to understand and run with minimal effort. You want to work with people who share these values, and will continually push you to improve. You value receiving and giving feedback, and trust in your team.
Technologies (not exhaustive, subject to change!)
For much of your time, you’ll be building software in Go and Ruby. We manage our cloud resources with Terraform, our VMs with Chef, and our Kubernetes configuration with Jsonnet. You’ll also help run Postgres and MySQL databases, Elasticsearch, Vault, Prometheus, and a whole host of Google Cloud products like Pub/Sub, BigQuery and GKE.
You'll rely on:
- Your fluency in one or more programming languages, and writing clean and effective code
- Your ability to build reliable and well tested infrastructure on top of cloud computing systems
- Your experience in designing, analysing, and troubleshooting distributed systems
- Your knowledge of Unix fundamentals and TCP/IP networking
It's useful, but not essential to have:
- Experience managing Kubernetes environments
- Experience in relational databases and other stores, especially around optimising their performance
Our team comes from a variety of backgrounds and we embrace diversity – if you’re unsure, please apply.