Systems Reliability Engineer - Hardware Platforms
About the department
The Core Metal Team will be the Hardware team's primary point of contact for new hardware specifications and working together to evaluate new hardware. The Hardware team will be our point of contact for escalations to the hardware manufacturers and specific requirements for hardware orders (BIOS settings, firmware versions, DMI information). We will also work toward the eventual goal of having the manufacturers run our acceptance test suite prior to delivering hardware
What you'll do
An engineering role at Cloudflare provides an opportunity to address some big challenges, at scale. We believe that with our talented team, we can solve some of the biggest security, reliability and performance problems facing the Internet. Just how big?
- We have in excess of 42 Terabits of network transit capacity
- We operate data centers in more than 200 cities in over 100 countries
- We serve 18 million HTTP requests per second on average, with more than 22 million HTTP requests per second at peak
- Interconnects with over 8,800 networks globally, including major ISPs, cloud services, and enterprises
- Anytime we push code, it affects over 200 million internet users
- More than 1 billion unique IP addresses pass through Cloudflare's network every day
We are looking for talented Systems Reliability Engineers to build and operate the platform which makes Cloudflare customers place their trust in us. Our SREs come from a variety of technical backgrounds and have built up their knowledge working in different environments. But the common factors across all of our reliability-focused engineers include a passion for automation, scalability, and operational excellence.We are still a small team, well-funded, growing quickly and focused on building an extraordinary company. This is a superb opportunity to join a high-performing team and scale our high-growth network as Cloudflare’s business grows. You will build tools to constantly improve availability, performance, uptime and response times. You will nurture a passion for an “automate everything” approach that makes systems failure-resistant and ready-to-scale.
Cloudflare SREs work in either the Core organization or the Edge organization. This role is within a sub-team of Core SRE which is responsible for hardware and data center infrastructure automation and management, building the layer between the physical infrastructure and the services that Engineering and other SREs uses on a day-to-day basis.You will be working from low-level projects (e.g. BIOS, kernel, BMC) to full software development projects (APIs for hardware resource management, provisioning automation, etc) to help manage and scale our infrastructure.This is a highly-collaborative role, working closely with other SRE, Software Engineering, Hardware Engineering, Capacity Planning, and Platform teams. Strong communication skills are required.
Examples of desirable skills, knowledge and experience
- Linux systems administration experience
- 3 years of relevant Site Reliability Engineering experience
- Proficient in one or more programming languages and willing to learn new ones when required (Go, Python, and Rust are the primary languages we use)
- Good understanding of software development fundamentals (e.g. OOP, design patterns)
- Understanding of network services, including DNS, TLS/SSL,HTTP; fundamentals of DHCP, ARP, subnetting, routing, firewalls, IPv6
- Experience with the Linux kernel and Linux software packaging
- Performance analysis and debugging with tools like perf, sar, strace, dtrace
- Configuration management systems such as Saltstack, Chef, Puppet or Ansible
- Load balancing and reverse proxies such as Nginx, Varnish, HAProxy, Apache
- Time series databases (Prometheus, Graphite, Grafana)
- SQL databases (Postgres or MySQL)
- Experience with continuous / rapid release engineering
- Experience working in a 24/7/365 service environment
- Experience working in between the hardware and software interfaces
- Experience automating bare metal hardware at scale (provisioning, diagnosis and remediation, firmware, observability, etc.)
- Familiarity with data center infrastructure--power, cooling, fiber, DCIM
- Internetworking and BGP
Some tools that we use
- Python, Go, Rust
What Makes Cloudflare Special?
We’re not just a highly ambitious, large-scale technology company. We’re a highly ambitious, large-scale technology company with a soul. Fundamental to our mission to help build a better Internet is protecting the free and open Internet.
Project Galileo: We equip politically and artistically important organizations and journalists with powerful tools to defend themselves against attacks that would otherwise censor their work, technology already used by Cloudflare’s enterprise customers--at no cost.
Athenian Project: We created Athenian Project to ensure that state and local governments have the highest level of protection and reliability for free, so that their constituents have access to election information and voter registration.
Path Forward Partnership: Since 2016, we have partnered with Path Forward, a nonprofit organization, to create 16-week positions for mid-career professionals who want to get back to the workplace after taking time off to care for a child, parent, or loved one.
Sound like something you’d like to be a part of? We’d love to hear from you!
This position may require access to information protected under U.S. export control laws, including the U.S. Export Administration Regulations. Please note that any offer of employment may be conditioned on your authorization to receive software or technology controlled under these U.S. export laws without sponsorship for an export license.
Cloudflare is proud to be an equal opportunity employer. We are committed to providing equal employment opportunity for all people and place great value in both diversity and inclusiveness. All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law. We are an AA/Veterans/Disabled Employer.
Cloudflare provides reasonable accommodations to qualified individuals with disabilities. Please tell us if you require a reasonable accommodation to apply for a job. Examples of reasonable accommodations include, but are not limited to, changing the application process, providing documents in an alternate format, using a sign language interpreter, or using specialized equipment. If you require a reasonable accommodation to apply for a job, please contact us via e-mail at firstname.lastname@example.org or via mail at 101 Townsend St. San Francisco, CA 94107.