Site Reliability Engineering - Incident & CIRT Manager

Remote, USA

Addepar logo
Apply now Apply later

Posted 1 month ago

This position involves critical duties and responsibilities that must continue to be performed during crisis situations and contingency operations, which may necessitate extended hours of work. This position is expected to perform incident response and forensic analysis on advanced cases while leading the team.  You will establish Tier 1 and Tier 2 support teams to provide customers with world class service. You will create appropriate off hours coverage and escalation procedures to ensure that our support services exceed customer expectations and meet or exceed our service level agreements. You will coordinate with customers as well as internal stakeholders to ensure timely resolution and closure of issues. You will establish baseline performance metrics, monitoring capabilities, and drive continuous improvement efforts to increase customer satisfaction.


  • Own Addepar’s full incident response lifecycle, from defining and updating processes to coaching teams on incident management
  • Supervise, coach, and mentor Critical Incident Response Team Members
  • Establish and lead Tier 1, Tier 2, and off hours support teams.
  • Collaborate with teams to explore the changing limits of their systems and help drive prioritization decisions
  • Lead initiatives that focus on process improvements, risk mitigation, and improving customer experience
  • Collect data, analyse trends, and identify patterns of risks and vulnerabilities
  • Create detailed reports of investigative activity for consumption by internal and external organizations
  • Maintain, develop and report on metrics relative to Critical Incident Response Team activities for monthly business and flash reporting 
  • Socialize lessons learned among technology and business teams
  • Join our Incident Commander rotation leading incidents to completion
  • Drive post-incident investigations and analysis by conducting interviews, identifying contributing factors, reviewing incident response, and establishing remediation plans
  • Partner with Product & Technology leadership to help improve response during outages by advocating for the balance of reliability enhancements with feature work
  • Understand customer business processes and usage of our software
  • Build and maintain successful relationships with existing and prospective members
  • Ensure successful transition from implementation processes into core support
  • Excellent problem solving and critical thinking skills, and ability to function and communicate under pressure 
  • Coordinate and lead efforts involving incident response and root cause analysis for Site Reliability Engineering, Production Engineering organizations and represent Critical Incident Response Team in these engagements
  • Recommend improvements to policies, procedures, technologies, tools, techniques, and operational efficiencies
  • Establish, implement, and continuously improve processes for issue intake, management, progress, and resolution
  • Establish and promote a culture of problem solving within the support teams
  • Define, implement, and monitor baseline metrics and targets for appropriate support measurements including service level agreements
  • Create and maintain a knowledge base to drive customer self-help and quicker resolution on common issues
  • Implement chat capabilities and automation where possible within our support systems
  • Coordinate with key stakeholders in product and development to ensure customer issues are properly prioritized, managed, and resolved
  • Coordinate with quality assurance to eliminate bug duplication and enhance our overall testing processes
  • Regularly research issue data to identify potential areas of concern or opportunities for improvement
  • Implement overall customer health scorecard to gauge customer engagement and satisfaction
  • Establish and lead a Product Advisory Council
  • Lead and participate in quarterly leadership reviews to ensure alignment and value for our customers
  • Demonstrate outstanding communication, flexibility, teamwork and leadership
  • Participate, present and speak to KPI’s, metrics and uptime performance data in management and executive level meetings/debriefs.
  • Other tasks and responsibilities as assigned  

Knowledge & Skills

  • Excellent interpersonal communication skills and professional demeanor
  • 10+ years of full time experience in applications and root cause analysis and incident response
  • 7+ year leading and managing people
  • Experience delivering technical presentations and reports and ability to articulate highly technical processes and information to a non-technical audience
  • Thorough understanding of advanced principles, theories, standards, practices, protocols, and procedures used in Digital Forensics / Incident Response
  • Understand various operating systems and cloud systems (Linux/Unix, AWS) and command line tools, network protocols, and TCP/IP fundamentals
  • Ability to conduct applications and DevOps root cause analysis of single instance multi Tenant applications and decoupled cloud architecture
  • Understanding of incident command and operations principles.  
  • Experience with customer engagement/customer facing roles 
  • Project Management or Agile certification a plus
  • Ability to maintain strict confidentiality
  • BS, or MS degree in Computer Science or related technical field or equivalent industry experience.
  • An understanding of and experience in, Product/Project management and issue tracking systems, Jira, Smartsheets


Addepar is a wealth management platform that specializes in data aggregation, analytics and reporting for even the most complex investment portfolios. Founded in 2009 by Joe Lonsdale, who currently serves as an active Chairman of its Board of Directors and General Partner at 8VC, the company's platform aggregates portfolio, market and client data all in one place. It provides asset owners and advisors a clearer financial picture at every level, allowing them to make more informed and timely investment decisions. Addepar works with hundreds of leading financial advisors, family offices and large financial institutions that manage data for over $2 trillion of assets on the company's platform. In 2020, Addepar was named as a Forbes Fintech 50 company and honored as a member of the CB Insights Fintech 250. Addepar is headquartered in Silicon Valley and has offices in New York City and Salt Lake City. All brokerage services offered through Acervus Securities Inc., member FINRA / SIPC.

Addepar is proud to be an equal opportunity employer. We seek to bring together diverse ideas, experiences, skill sets, perspectives, backgrounds, and identities to drive innovative solutions. We commit to promoting a welcoming environment where inclusion and belonging are held as a shared responsibility.

In order to ensure the health and safety of all Addepeeps and our prospective candidates, we have instituted a virtual interview and onboarding experience.

Job tags: AWS Jira Linux Reliability engineering Salt Unix Vulnerabilities
Job region(s): North America Remote/Anywhere
Share this job: