Systems Reliability Engineering Manager - New Products
At Cloudflare, we have our eyes set on an ambitious goal: to help build a better Internet. Today the company runs one of the world’s largest networks that powers approximately 25 million Internet properties, for customers ranging from individual bloggers to SMBs to Fortune 500 companies. Cloudflare protects and accelerates any Internet application online without adding hardware, installing software, or changing a line of code. Internet properties powered by Cloudflare all have web traffic routed through its intelligent global network, which gets smarter with every request. As a result, they see significant improvement in performance and a decrease in spam and other attacks. Cloudflare was named to Entrepreneur Magazine’s Top Company Cultures list and ranked among the World’s Most Innovative Companies by Fast Company.
We realize people do not fit into neat boxes. We are looking for curious and empathetic individuals who are committed to developing themselves and learning new skills, and we are ready to help you do that. We cannot complete our mission without building a diverse and inclusive team. We hire the best people based on an evaluation of their potential and support them throughout their time at Cloudflare. Come join us!
About the department
As part of the Cloudflare Engineering organization, SREs are primarily responsible for production reliability. SREs are based in San Francisco, London, Singapore, Austin and Lisbon and use the global distribution to enable follow the sun coverage which allows work to be focused in business hours in each location.
SREs are supported by all engineering teams at Cloudflare who participate in on call schedules for their services. The SRE teams facilitate remediation and follow up of production issues and mature the tooling to enable all engineering teams to self-service on production. Incident follow up work across all engineering teams is prioritized above product innovation and the impact of production incidents influences the priority.
Currently SREs support two main environments: Edge SRE are focused on edge distribution where most client traffic is served. Core SRE are focused on the core services like control plane, data pipeline and other supporting supporting services
Edge SRE project work is organized in four development areas: Platform Engineering, Production Tooling, Hardware Lifecycle and Observability.
About the Role
An engineering manager role at Cloudflare provides an opportunity to address some big challenges, at scale. We believe that with our talented team, we can solve some of the biggest security, reliability and performance problems facing the Internet. Just how big?
- We have in excess of 59 Terabits of network transit capacity
- Our network spans more than 200 cities in over 100 countries, including 14 cities in mainland China.
- We serve 21 million HTTP requests per second on average, with more than 28 million HTTP requests per second at peak.
- We consistently do approximately 8.5 million DNS queries per second. That's around 738 billion queries per day, and 22 trillion queries a month.
- In Q4’20 Cloudflare blocked an average of 57 billion cyber threats each day. This is 7 billion more (average threats blocked each day) compared to the fourth quarter of 2019.
What you'll do
We are looking for talented Systems Reliability leader to build and operate automated systems and tools to help Cloudflare continue to scale. Our SREs come from a variety of technical backgrounds and have built up their knowledge working in different environments. But the common factors across all of our reliability-focused engineers include a passion for automation, scalability, and operational excellence.
You will build tools to automate operational tasks, streamline deployment processes and provide a platform for new unannounced products. You will nurture a passion for an “automate everything” approach that makes systems failure-resistant and ready-to-scale. You will be required to play a key role in system design and demonstrate the ability to bring an idea from design all the way to production.
Many of our SREs have had the opportunity to work at multiple offices on interim and long-term project assignments. The ideal SRE candidate has a passionate curiosity about how the Internet fundamentally works and has a strong knowledge of Linux and Internet protocols along with strong coding ability in Go, Python and Bash. Some other tools that we use: Nginx, Salt, Kubernetes, PostgreSQL, Docker, Prometheus, Consul & Nomad.
Examples of desirable skills, knowledge and experience
- Linux systems administration experience
- 3 years of relevant management or leadership expe9rience
- Experience managing systems that store critical customer data
- Strong software development skills in Go and Python
- Strong skills in network services, including DNS, TLS/SSL and HTTP
- Network fundamentals DHCP, ARP, subnetting, routing, firewalls, IPv6
- Handle a project from design phase to completion
- Experience with the Linux kernel and Linux software packaging
- Experience dealing with bare metal hardware
- Performance analysis and debugging with tools like perf, sar, strace, dtrace
- Configuration management systems such as Saltstack, Chef, Puppet or Ansible
- Load balancing and reverse proxies such as Nginx, Varnish, HAProxy, Apache
- SQL databases (Postgres or MySQL)
- Time series databases (Prometheus, Grafana, Thanos, Clickhouse)
- Internetworking and BGP
- Experience with network programming in C, C++ or Go
- Experience with continuous / rapid release engineering
- Strong tooling and automation development experience
- Experience working in a 24/7/365 service environment
- High-bandwidth transit Internet working and routing experience
What Makes Cloudflare Special?
We’re not just a highly ambitious, large-scale technology company. We’re a highly ambitious, large-scale technology company with a soul. Fundamental to our mission to help build a better Internet is protecting the free and open Internet.
Project Galileo: We equip politically and artistically important organizations and journalists with powerful tools to defend themselves against attacks that would otherwise censor their work, technology already used by Cloudflare’s enterprise customers--at no cost.
Athenian Project: We created Athenian Project to ensure that state and local governments have the highest level of protection and reliability for free, so that their constituents have access to election information and voter registration.
Path Forward Partnership: Since 2016, we have partnered with Path Forward, a nonprofit organization, to create 16-week positions for mid-career professionals who want to get back to the workplace after taking time off to care for a child, parent, or loved one.
22.214.171.124: We released 126.96.36.199 to help fix the foundation of the Internet by building a faster, more secure and privacy-centric public DNS resolver. This is available publicly for everyone to use - it is the first consumer-focused service Cloudflare has ever released. Here’s the deal - we don’t store client IP addresses never, ever. We will continue to abide by our privacy commitment and ensure that no user data is sold to advertisers or used to target consumers.
Sound like something you’d like to be a part of? We’d love to hear from you!
This position may require access to information protected under U.S. export control laws, including the U.S. Export Administration Regulations. Please note that any offer of employment may be conditioned on your authorization to receive software or technology controlled under these U.S. export laws without sponsorship for an export license.
Cloudflare is proud to be an equal opportunity employer. We are committed to providing equal employment opportunity for all people and place great value in both diversity and inclusiveness. All qualified applicants will be considered for employment without regard to their, or any other person's, perceived or actual race, color, religion, sex, gender, gender identity, gender expression, sexual orientation, national origin, ancestry, citizenship, age, physical or mental disability, medical condition, family care status, or any other basis protected by law. We are an AA/Veterans/Disabled Employer.
Cloudflare provides reasonable accommodations to qualified individuals with disabilities. Please tell us if you require a reasonable accommodation to apply for a job. Examples of reasonable accommodations include, but are not limited to, changing the application process, providing documents in an alternate format, using a sign language interpreter, or using specialized equipment. If you require a reasonable accommodation to apply for a job, please contact us via e-mail at email@example.com or via mail at 101 Townsend St. San Francisco, CA 94107.
Explore more DevOps, Cloud and SRE career opportunities
- Open Cloud Infrastructure Architect Jobs
- Open Staff, Product Manager - Global Infrastructure Jobs
- Open IT DevOps Engineer Jobs
- Open Manager of DevOps & Engineering Infrastructure Jobs
- Open Senior Automation Engineer Jobs
- Open Data Platform Engineer Jobs
- Open Site Reliability Engineer II Jobs
- Open DevOps Infrastructure Engineer Jobs
- Open Senior Software Engineer - Site Reliability - Toronto Hub Jobs
- Open Principal Cloud Architect Jobs
- Open Staff DevOps Engineer Jobs
- Open Reliability Engineer Jobs
- Open Sr. Site Reliability Engineer Jobs
- Open Senior DevOps Engineer - Pleasanton Hub Jobs
- Open Senior Software Engineer, DevOps Jobs
- Open Sr Software engineer (Infrastructure) Jobs
- Open DevOps Engineer - Raleigh Hub Jobs
- Open Senior Security Automation Engineer Jobs
- Open Software Development Engineer, AWS Security Jobs
- Open QA Automation Engineer - Workforce Engagement Management Jobs
- Open Senior Software Development Engineer, AWS Security Jobs
- Open Senior Devops Engineer Jobs
- Open Cloud DevOps Systems Engineer Jobs
- Open Senior Cloud Architect Jobs
- Open Solutions Architect - VMware Specialist Jobs
- Open MySQL-related jobs
- Open REST-related jobs
- Open CloudFormation-related jobs
- Open Prometheus-related jobs
- Open S3-related jobs
- Open Jira-related jobs
- Open Elasticsearch-related jobs
- Open Virtualization-related jobs
- Open High availability-related jobs
- Open Golang-related jobs
- Open Reliability engineering-related jobs
- Open EC2-related jobs
- Open VMware-related jobs
- Open Redis-related jobs
- Open JS-related jobs
- Open MongoDB-related jobs
- Open Node-related jobs
- Open Grafana-related jobs
- Open Gitlab-related jobs
- Open PostgreSQL-related jobs
- Open Jenkins-related jobs
- Open Perl-related jobs
- Open Web applications-related jobs
- Open Spark-related jobs
- Open Load Balancing-related jobs