Senior Site Reliability Engineer
WHY BOX NEEDS YOU
The Observability Platforms team provides an end-to-end experience enabling Box engineers by leveraging frameworks, tools, APIs and visualizations to better understand the behavior of features, services, and infrastructure they own and maintain. The team also helps educate product, infrastructure, and systems teams on how to appropriately monitor features and services they own, provide visualizations for monitoring distributed systems, give guidance for reducing operational overhead, and supports the delivery of unmatched availability to our customers.
We need a Sr. SRE with the experience of having designed, operated, and implemented Observability frameworks at a very large scale, and well versed in the operation of scaled architectures. You should have deep operational knowledge of distributed systems and how to avoid limitations through innovative design.
WHY BOX NEEDS YOU
The main focus of the Observability Team is to build frameworks and systems that can manage the performance of Box systems while scaling to billions of events per second. Additionally, we are responsible to standardize observability across engineering teams, drive designs for high performing services and foster great observability practices. We build, scale, and operate low-latency, high-throughput data systems that power high resiliency of Box Systems. You will help us execute on this vision and ensure that Box continues to ship scalable services that can hold against the high-performance expectation from our customers.
We are looking for big thinkers and innovators who have experience working with scalable distributed systems and have a passion for high performance and reliability. We are a small team with big ambitions that values impact and is not afraid of huge, gnarly problems. If this excites you, come join us!
WHAT YOU'LL DO
You're going to have the unique opportunity to build, improve, and support our Observability (o11y) platform. You will get to work with cutting-edge technologies that are defining the future of Box's cloud platforms. You will have visibility and impact across all of Engineering.
Provide o11y products like ELK, Splunk, Sensu, Prometheus, AppDynamics, Dynatrace, etc. to engineering teams for centralized logging, APM tooling, monitoring and alerting, and distributed tracing.
You'll collaborate with other engineers on the team to foster solid engineering principles and represent our engineering values
As a senior member of the team, you'll use both technical and relational skills to lead large scale projects to completion
Manage, maintain and scale the infrastructure responsible for telemetry frameworks used throughout Box's infrastructure, cloud services, and products to capture, transport, store and analyze the telemetry data. Scale the observability infrastructure to support petabytes of logs and billions of metric data points daily.
You'll collaborate, influence and drive for improvement across scrum teams
You'll provide additional support & perform various pocs on new projects, frameworks for Observability
Define and educate platform consumers on observability best practices from a SRE perspective.
Participate in deep technical design discussions within your team, across partner teams, and ensure that we’re building the right systems.
WHO YOU ARE
You take an SRE-centric approach to everything you build/manage, ensuring reliability, availability and security
You act like an owner and strive to do work you're proud of, both technically and in your team interactions
You are a self-starter and a strong supporter of self service and automation within O11y (Observability)
Deep knowledge of OS system fundamentals (linux) & core internet technologies, including TCP/IP, DNS, NAT, SDN
Proven production service troubleshooting skills that span applications, systems and network within a primarily Linux environment
Solid understanding of infrastructure automation tools (Puppet, Ansible, or the like)
Experience in using industry standard DevOps CI/CD frameworks (Jenkins/Spinnaker, or the like)
Solid experience in building automations, frameworks preferably with Python and Go
Experience in running containerized services in Private/Public Cloud (GCP, AWS)
Experience in building, managing metrics and data driven observability platforms and peripherals
Experience in managing O11y (Observability) is a plus
You have a fair understanding of technologies like Elasticsearch, Apache Storm or other DAG technologies, and streaming technologies like Kafka (pub/sub, or Kinesis).
You have built distributed, high-throughput and low-latency systems with a strong focus on availability, resilience, and durability.
- Remote Friendly
- Visit this webpage to check out all of our exciting healthcare benefits: https://join.collectivehealth.com/box
- For all other benefits, please check out: Box Benefits + Perks
Explore more DevOps, Cloud and SRE career opportunities
- Open Linux Infrastructure Developer Jobs
- Open Automation Engineer Jobs
- Open Reliability Engineer Jobs
- Open Data Platform Engineer Jobs
- Open Devops Engineer Jobs
- Open Lead Site Reliability Engineer Jobs
- Open Senior Software Engineer - Site Reliability - Toronto Hub Jobs
- Open Senior Infrastructure Security Engineer Jobs
- Open Senior Test Automation Engineer Jobs
- Open Senior DevOps Engineer - Pleasanton Hub Jobs
- Open Sr. DevOps Engineer Jobs
- Open Senior Software Engineer, DevOps Jobs
- Open Principal Cloud Architect Jobs
- Open Senior Automation Engineer Jobs
- Open Site Reliability Engineer II Jobs
- Open Senior DevOps Engineer - Boston Hub Jobs
- Open Sr. Site Reliability Engineer Jobs
- Open Staff DevOps Engineer Jobs
- Open Senior Cloud Infrastructure Engineer Jobs
- Open Senior DevOps Engineer - New York Hub Jobs
- Open DevOps Infrastructure Engineer Jobs
- Open DevOps Engineer II Jobs
- Open Senior Software Engineer - Site Reliability - Raleigh Hub Jobs
- Open Senior Software Engineer - Site Reliability - Boston Hub Jobs
- Open DevOps Manager - Boston Hub Jobs
- Open Kafka-related jobs
- Open REST-related jobs
- Open Unix-related jobs
- Open CloudFormation-related jobs
- Open Prometheus-related jobs
- Open Elasticsearch-related jobs
- Open S3-related jobs
- Open Jira-related jobs
- Open PowerShell-related jobs
- Open Golang-related jobs
- Open High availability-related jobs
- Open Virtualization-related jobs
- Open TCP-related jobs
- Open VMware-related jobs
- Open JS-related jobs
- Open EC2-related jobs
- Open Redis-related jobs
- Open Node-related jobs
- Open TCP/IP-related jobs
- Open Grafana-related jobs
- Open MongoDB-related jobs
- Open PostgreSQL-related jobs
- Open Gitlab-related jobs
- Open NoSQL-related jobs