Software Engineer - Site Reliability
Remote - United States
This is a remote opportunity and we would be interested in applications from a European or Americas time zone. (Other time zones will be considered on a case by case basis.)
We are looking for an experienced software or site reliability engineer to join the Grafana Labs R&D team. We are hiring for the Infrastructure squad that provides the platform on which Grafana Cloud delivers its services, as well as the Observability squad that builds workflows between metrics, logs, and traces.
Our Grafana Cloud pipeline moves millions of data points, log lines, and traces per second from our customers' environments into a highly available, low-latency stack that processes and stores the data, and serves it to dashboards and alerting tools. We aim to grow this to hundreds of millions per second, and it's critical that as we grow, we improve our performance, increase our reliability, and do it all more efficiently.
Infrastructure and observability roles at Grafana Labs require engineers with a passion for performance and reliability, and who enjoy taking projects from conception to production. Grafana Cloud hosts services in Kubernetes, and our Infrastructure squad owns and maintains the platform delivering Kubernetes and its required complementary services to Grafana Engineering.
Since we deploy production services, we have on-call rotations to ensure the health of the system. We dogfood our own services so being on call is an important way to understand our system and how to use the products we create.
Our culture is one of remote-first, and our engineering organization is largely remote. We provide guidance and meet regularly using video calls, and we need people who can work independently and can communicate well. Even if you are located near one of our small offices, working from home is both common and encouraged. Our teams also plan in-person team building meetups and also gather to attend industry conferences.
We care deeply about open source and the projects generally are open source, check them out: https://github.com/grafana.
We primarily use Go and Jsonnet.
- Maintain and improve Grafana Labs’ provisioning tools, allowing rapid deployment of infrastructure and services
- Maintain and improve Grafana Labs’ monitoring tools and best practices to maximise system uptime and health
- Operate and manage the core infrastructure platform, Kubernetes
- Work with other engineering teams to help them deploy and run their software in production
- Commercial experience as on site reliability and/or software engineer in a DevOps environment
- Experience with Linux system administration, Cloud Service Providers, hardware, networking and distributed architectures
- Experience with containers and orchestration -- we use Docker and Kubernetes
- Proficiency with infrastructure as code and/or configuration management, which may include Terraform, Puppet and Ansible
- Experience with dashboards and monitoring tools like Grafana and Prometheus
Nice to have
- Experience working in remote and/or distributed business environments, demonstrating self motivation and communication skills
- Experience with HashiCorp Vault
- Experience with continuous integration systems
- Familiarity with Jsonnet and/or Tanka
- Flexible hours
- Flexible location (EU or Americas time zones preferred)
- The equipment you need to get the job done
- Generous vacation policy of 30 days per annum with national holidays in your country of residence on top
- Grafana operates in 27+ countries. We try to operate as one team and focus on global benefits which our whole team can enjoy. Inevitably there are some regional variations and we discuss the benefits offered in your country of residence through our interview process.
- We offer a competitive healthcare plan (Medical, Dental & Vision) for our US based employees via our co-employer JustWorks.
- We offer a 4% employer contribution match on our 401K/pension plans or a one time 4% salary increase after 6 months tenure depending on your location