Incident Commander, Cloud Infrastructure

United States (Remote)

HashiCorp logo

HashiCorp

HashiCorp delivers consistent workflows to provision, secure, connect, and run any infrastructure for any application.

View all employer listings

Apply now Apply later

About HashiCorp

HashiCorp is a fast-growing company that solves development, operations, and security challenges in infrastructure so organizations can focus on business-critical tasks. We build products to give organizations a consistent way to manage their move to cloud-based IT infrastructures for running their applications. Our products enable companies large and small to mix and match AWS, Microsoft Azure, Google Cloud, and other clouds as well as on-premises environments, easing their ability to deliver new applications for their business.

About HashiCorp

HashiCorp is a fast-growing startup that solves development, operations, and security challenges in infrastructure so organizations can focus on business-critical tasks. We build products to give organizations a consistent way to manage their move to cloud-based IT infrastructures for running their applications. Our products enable companies large and small to mix and match AWS, Microsoft Azure, Google Cloud, and other clouds as well as on-premises environments, easing their ability to deliver new applications for their business.

About the role...

HashiCorp is looking for an Incident Commander for our Global Support Engineering Organization. This highly visible position will be an integral part of the Support Engineering management team and would initially report to the SVP of Global Support. You are a fit if you thrive in a fast-paced environment that values crucial communication, alignment with our company's core principles, collaboration, and results. 

This is a senior role at HashiCorp requiring an individual who can take charge in escalated code-red situations and give direction to both customer personnel and to HashiCorp engineers to drive the resolution of critical incidents (catastrophic failures). We are looking for a natural leader and a confident decision-maker that has strong problem-solving skills in Cloud environments and is experienced in SRE best practices.

As a member of our global Incident Response Team, this individual will be responsible for all aspects of our emergency response to critical outages occurring in our customer environments. This includes quickly developing incident objectives, managing all incident operations, application of resources as well as responsibility for all persons involved.

In this role, you can expect to...

  • Take command of incidents by setting up or taking over a cross-functional technical investigation with internal and external stakeholders. HashiCorp conducts our critical investigations both in asynchronous communication tools and via videoconference (ex: Zoom). This role would lead the Zoom call and coordination of various Zoom rooms working with highly technical subject matter experts internally and on the customer side.
  • Lead the effort to bring impacted systems back online by coordinating investigation and resolution of technical issues, from hands-on investigations with product engineering teams to directing workarounds and failovers for complex environments.
  • Work with HashiCorp SMEs (Support & Engineering) and with the customer Platform/Dev-Ops teams to build an incident action plan and a restoration plan if needed 
  • Provide direction and time management and keep the resolution effort on track and moving forward
  • Draft and send regular communications to keep all stakeholders, both internal or external, aware of the latest status, progress made thus far, and action items
  • Own the technical incident retrospective  process by assembling the correct technical teams and working with HashiCorp Customer Success teams for permanent remediation and recommendations to the customer
  • Work closely with Engineering to improve our products monitoring and observability capabilities and their debuggability to decrease the Time-To-Detection (TTD) and Time-To-Restoration (TTR)
  • Work closely with our Customer Success team to drive changes in customer environments aiming at improving their robustness and scale, ideally following our products best practices
  • Develop and continuously update our Incident Playbooks 
  • Ensure internal readiness at all times by leading training sessions, simulations, and drills
  • Be part of the Incident Response Team on-call rotation and ensure flawless handover of critical issues to other regions 
  • Travel (<5%)

You may be a good fit for our team if you have...

  • 10 years of overall experience with 5+ years of proven experience within SRE, Operations, DevOps, Engineering, or Technical Support teams.
  • 3+ years experience as an Incident Commander or Escalation Manager 
  • Strong leadership skills, able to take command in a highly escalated situation
  • Executive-level communication skills, able to conduct high-level retrospectives and RCA discussions internally and with customers, able to communicate clearly and effectively to technical and business audiences, able to collaborate with various partners including senior leadership and multi-functional teams
  • Excellent problem solving, analytical, and troubleshooting skills especially on a multi-cloud environment (AWS, Azure, GCP) with complex deployment architectures (multiple-cluster, HA, DR)
  • Strong influencing, negotiation, and mediation skills to be able to steer the customer towards the optimal solution
  • Demonstrable knowledge of incident management frameworks (eg. ITIL) and best practices
  • Experience with major cloud platforms (AWS, Azure, GCP), distributed systems, microservice architecture, and containers
  • Experience with scripting tools (for example, Bash, Python), REST APIs, and command-line tools
  • Bachelor’s degree in Computer Science, IT, or equivalent professional experience

HashiCorp embraces diversity and equal opportunity. We are committed to building a team that represents a variety of backgrounds, perspectives, and skills. We believe the more inclusive we are, the better our company will be.

 

#LI-Remote

HashiCorp embraces diversity and equal opportunity. We are committed to building a team that represents a variety of backgrounds, perspectives, and skills. We believe the more inclusive we are, the better our company will be.

For more information regarding how HashiCorp collects, uses, and manages personal information, please review our Privacy Policy.

 

Job perks/benefits: Startup environment
Job region(s): Remote/Anywhere North America
Job stats:  0  0  0
  • Share this job via
  • or

Explore more DevOps, Cloud and Digital Infrastructure career opportunities