Site Reliability Engineer - Americas (Remote)
What you will do
Our customers trust us to provide critical infrastructure for their distributed IoT fleets, and we work hard to continuously improve the availability, resilience, and efficiency of our systems and services. Our reliability team takes an “Infrastructure as Product” approach and plays a key role in shaping the future of the balena platform. They are part operators and part product builders.
As a member of the team, you will ensure the smooth day-to-day running of the infrastructure powering the large and rapidly scaling “balena fleet”. You will facilitate frictionless deployments to production, develop monitoring solutions, create disaster recovery plans, investigate incidents, and manage outages. You will also be empowered to lead initiatives and develop systematic solutions to high-impact, high-complexity challenges such as building our self-service capabilities – enabling the success of both our product development teams and our end-users.
- Identify internal user needs, bottlenecks, and failure patterns in production, and build tools, solutions, and features to allow teams to self-serve, deploy, and manage services at scale
- Implement monitoring systems to collect health data, set error alerts, and increase app behavior visibility
- Leverage data model definitions to automatically generate code for provisioning reliable infrastructure
- Support developers with seamless, fault-tolerant deployments and production debugging
- Conduct load tests to ensure applications are ready to handle projected traffic
- Respond to incidents, drive blameless postmortems, and leverage learnings to prevent future issues
- Participate in on-call rotation and customer support – be a source of reliability advice for peers
This is a fully remote position for candidates in Americas timezones.
- Background in software development, infrastructure, and/or platform operations
- Experience working with Docker containers and running production-grade Kubernetes clusters
- Firm grasp of Linux operating system internals (e.g., filesystems, system calls) and networking including common networking failures and mitigations
- Proficiency in at least one programming language (we mostly use Typescript)
- Desire to make self and others more effective through documentation and automation
- Ability to manage ambiguity, push through friction, and solve complex challenges while clearly explaining the tradeoffs
- Excellent verbal and written communication skills, and fluency in English
- Experience designing large-scale, distributed systems and server load balancing architectures
- Experience with modern SRE practices and the Twelve Factor App methodology
- Conversant with cloud automation, APM, and log management (we use Grafana, Prometheus, Loki)
- Contributions to OSS projects and community involvement
- Familiarity with IoT, embedded computing, developer tools, or the balena platform as a user/contributor
- Background in leading projects and working across functions to build resilient systems
Make sure to let us know if any of these items apply to you!
Who we are
Balena is a highly distributed company that has embraced a remote-first approach since 2013. We are a group of individuals from across the globe working together to achieve our mission: “reduce friction for fleet owners and unlock the power of physical computing”. For us, this means removing the barriers to entry for developing IoT products, whether that’s easing software deployments with balenaCloud, simplifying image flashing with balenaEtcher, or offering our own hardware based on our experience seeing thousands of devices running in production. We are developing an end-to-end solution that makes it easy for developers to build applications at the Edge.
How we work
- We place trust and autonomy in our team to own the outcome of their work.
- We practice radical candor and transparency with open, honest, and clear communications.
- We embrace first-principles thinking and constantly challenge our assumptions.
- We organize ourselves based on the best use of our collective abilities to solve our highest priority problems at any given time, rather than by a strict hierarchy or departments.
- We’re not afraid to fail as long as we learn from our mistakes.
- We’re always looking for common patterns that allow us to reduce complexity.
- We embrace short-term pain for long-term gain, building products that will stand the test of time.
- Work with a talented and globally distributed team
- Equipment of your choice
- Flexible vacation policy
- Annual company gathering in an international location
- We send you hardware for side projects!
Explore more DevOps, Cloud and SRE career opportunities
- Open Automation Engineer Jobs
- Open Senior Infrastructure Security Engineer Jobs
- Open Manager of DevOps & Engineering Infrastructure Jobs
- Open Staff, Product Manager - Global Infrastructure Jobs
- Open Site Reliability Engineer II Jobs
- Open Cloud Infrastructure Architect Jobs
- Open Senior Software Engineer - Site Reliability - Toronto Hub Jobs
- Open Senior Test Automation Engineer Jobs
- Open Reliability Engineer Jobs
- Open Senior Automation Engineer Jobs
- Open Lead Site Reliability Engineer Jobs
- Open Data Platform Engineer Jobs
- Open Senior DevOps Engineer - Pleasanton Hub Jobs
- Open Principal Cloud Architect Jobs
- Open DevOps Infrastructure Engineer Jobs
- Open Sr. Site Reliability Engineer Jobs
- Open DevOps Security Engineer Jobs
- Open Cloud Security Engineer Jobs
- Open Senior Software Engineer, DevOps Jobs
- Open Senior DevOps Engineer - Boston Hub Jobs
- Open Staff DevOps Engineer Jobs
- Open Senior DevOps Engineer - New York Hub Jobs
- Open Staff Software Engineer (L4), Segment Infrastructure Jobs
- Open Data Infrastructure Engineer Jobs
- Open QA Automation Engineer - Workforce Engagement Management Jobs
- Open MySQL-related jobs
- Open Kafka-related jobs
- Open Unix-related jobs
- Open REST-related jobs
- Open CloudFormation-related jobs
- Open Prometheus-related jobs
- Open Elasticsearch-related jobs
- Open S3-related jobs
- Open Jira-related jobs
- Open PowerShell-related jobs
- Open Golang-related jobs
- Open Virtualization-related jobs
- Open High availability-related jobs
- Open TCP-related jobs
- Open EC2-related jobs
- Open VMware-related jobs
- Open JS-related jobs
- Open Redis-related jobs
- Open MongoDB-related jobs
- Open Grafana-related jobs
- Open Node-related jobs
- Open TCP/IP-related jobs
- Open Gitlab-related jobs
- Open PostgreSQL-related jobs
- Open NoSQL-related jobs