Principal Site Reliability Engineer

Vancouver, BC

Arm Treasure Data logo
Arm Treasure Data
Apply now Apply later

Posted 1 month ago

Job Description
Treasure Data began by offering data warehousing and processing services; since then we’ve moved further up the value chain with our Customer Data Platform application (CDP), which is seeing a lot of traction with customers new and old. This growth has prompted a greater focus on Site Reliability Engineering as we’ve growing past our current practices and we’re looking to add a 9th member to our team, as such you’ll playing an essential role in maturing the company’s approach to service reliability and continuity.
The team and you will be directly responsible for solutions for the platform in these key areas:availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
This will require working with engineering teams on complex problems/projects where analysis of situations or data requires an in-depth evaluation of multiple factors and wise trade-offs between competing factors when arriving at a solution.Success in this role requires a passion for helping others and making their lives better, you do this by simplifying complex systems to make them understandable and operable. You are able to effectively communicate decisions, ideas, designs, and operation of systems and services in a clear and concise manner.
You are both a generalist, capable of picking up and working with multiple, disparate systems, and an expert, having an ability to dive deep into specific topics and quickly master them. You comfortably move between system, service, and instance level views.You have a love of stateful systems containing Treasured data, ensuring we continue to protect customer data from loss occurring from outages.

Things You Will Do

  • Build and maintain services, automation, and tooling that will positively impact key areas (see above)with our team, be responsible for the systems you build.
  • Drive continuous improvement by measuring and reducing the amount of manual operational work.
  • Help us measure and improve reliability and performance across the product line by working with product owners and engineering teams.
  • Make wise decisions balancing availability and delivery, and communicating those decisions clearly.
  • Be an active participant and internal evangelist for our shared processes, such as blameless post-mortems
  • Work with engineering teams as a subject matter expert on operating software and systems at scale, teaching them from your experience or know-how, and helping them reach their goals.
  • Investigate system performance, errors, and problems.

Your Background and Skills Will Include

  • A minimum of 5+ years relevant working experience.
  • Experience building and maintaining software addressing key SRE areas of responsibility (see above).
  • Strong Software Engineering experience, with an ability to work in multiple programming languages.
  • Experience with Distributed Systems and operating them as they scale.
  • Experience operating services running in the cloud (AWS primarily) or virtualized API-driven platforms.
  • Articulate and personable with strong spoken and written English language abilities.
  • Knowledge and experience in Systems Engineering, Administration, and Operations.
  • Demonstrate the ability to work independently and collaboratively as part of a specialized team.
  • Ability to slow down and communicate clearly and effectively across language barriers.

We Would Be Thrilled If You

  • Have experience automating datastore operations or datastores as a service.
  • Crafted APIs and specifications that allow for future non-breaking changes while remaining backwards compatible for as long as possible.
  • Had experience analyzing system-wide performance: latency, throughput, and efficiency.
  • A student of complex systems theory and how to build resilient and adaptive systems.
  • Able to build services backed by BLOB, relational, and/or document data stores, currently: S3, PostgreSQL, and DynamoDB.
  • Have experience working as part of a distributed or partially distributed team and thrive in an a highly collaborative and communicative work environment.
  • Pride yourself on giving back to your community: open source contributions, speaking, teaching, mentoring, or helping others.
  • Experience speaking and/or writing Japanese.
Working at Treasure Data
You can expect a work environment where the team is collaborative and open to your ideas, while we keep our collective eye on supporting our customers’ needs. 
Our team is committed to technical innovation in our product and in the world through customer collaboration, open-source projects, and by continuing to make our product an integral part of our customers’ growth and success.
We are an equal opportunity employer dedicated to building an inclusive and diverse workforce. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
About Us
Treasure Data provides an end-to-end, fully managed cloud service (data acquisition, storage and analysis capability) for Big Data that is trusted and simple.  As the original developers of Fluentd, an advanced open-source log collector specifically designed to solve the big data log collection problem, Treasure Data solves the problems for companies wanting the ability to manage their big data needs. 
Agencies and recruiters, we cannot consider your candidate(s) without a contract in place. Any resumes received without having an active agreement will be considered gratis referrals to us. Thank you for your understanding and cooperation!
Job tags: AWS Open source PostgreSQL Reliability engineering S3
Job region(s): North America
Share this job: