Site Reliability Engineer - REMOTE
Remote - Charlotte, North Carolina, United States
Quimbee is growing! We’re looking to add a new full-time member to our core team.
This position is 100% remote (U.S. only). All you need is an internet connection and a quiet place to work.
Who We AreFounded in 2007, Quimbee is one of the most widely used and respected study aids for law students. With a massive and growing library of case briefs, video lessons, definitions, and practice questions, Quimbee helps its members achieve academic success in law school and on the bar exam. With our newest products, we now are instrumental in helping attorneys meet their CLE requirements.
We prefer a small and highly effective engineering team, so every new team member is vital to the success of the company.
We are looking for an experienced site-reliability engineer (SRE). As a SRE, you must have strong experience with Ruby on Rails based applications. Ideally, you have experience ensuring uptime for Ruby on Rails based applications in addition to driving DevOps tasks related to CI/CD. Experience contributing to application codebases is strongly preferred as well. Your primary focus will be on improving our deployment practices, maintaining, troubleshooting, documenting, and improving the systems that keep our Heroku hosted system running securely and smoothly with the least downtime possible. Eventually, we might also consider alternative hosting platforms in the future, and we expect you to help with that too. There will be a lot of monitoring, alerting, and prioritizing what is worth our attention and what's not. You're expected to investigate and mitigate single points of failure, performance bottlenecks, slow SQL queries, errors, or any other identified issues attempting to solve them yourself or with the help of the other developers in the team.
You'll have the opportunity to help us define and shape processes, tools, and best practices in the context of our platform. You'll work closely with our team of developers to determine the current state of our platform as well as defining the future of it. Strong candidates will bring strong engineering and operations acumen, combined with the ability to move fast (and fix things).
We're looking for collaborative, responsive, communicative, and detail-oriented people who are ready for a challenge. In this role, you'll be responsible for working on the critical task of ensuring our backend systems are rock solid and scalable.
You’ll join a small, 100% remote tech team. Your voice will be heard when we need to make new technical decisions as our product grows. We expect you to go beyond coding to give input on the product roadmap, design, and architecture.
Who You Are
- An experienced SRE. You have experience in a position of responsibility for a web application’s uptime with demonstrated success in ensuring the accessibility and performance of web applications.
- A Ruby developer. You have software engineering experience and are comfortable debugging, optimizing, and writing code in Ruby.
- A Heroku expert. You have experience optimizing its usage and configuration for a SaaS platform. You have experience monitoring and proactively recognizing performance concerns within Heroku.
- A DevOps advocate. You believe in the benefits of immutable infrastructure and understand what it takes to implement it from the operating-system level up to datacenter deployments.
- A data-driven engineer. You know the difference between an MTTR and MTTD and have the skills necessary to optimize them.
- A great process and code debugger. You feel comfortable leading robust and thorough root cause analysis (RCA) sessions to attack problems at their core and ensure they don’t recur.
- A self-starter. You take responsibility for projects from idea to completion, proactively seeking assistance as needed while guiding the work to successful outcomes.
- A versatile engineer. You know what you don’t know and feel comfortable learning new skills. You’re not ashamed of recognizing mistakes and employ measures to avoid falling again.
- A team player. You share code ownership as much as possible. You don't mind fixing other people’s code or stepping in to help a teammate.
- A minimalist. You believe a new feature should be built only when the evidence supports it. You’re willing to push back when you believe this rule is being ignored or violated.
- A great communicator. You communicate your ideas, feedback, and criticism thoroughly, clearly, and courteously. You believe there’s no such thing as over-explaining or over-clarifying because that’s how miscommunication is avoided.
Working with us, you could be asked to (solo or as part of a team):
- Create and maintain documentation about our platform and all the third-party services it depends on, defining a plan of development for failover mechanisms to improve our platform's resilience.
- Proactively monitor health and performance of our infrastructure to identify and ideally address potential infrastructure related issues before they become more impactful.
- Investigate issues reported by our automated systems or our customer support or QA teams, determine impact and root cause, then prioritize and document them, and solve them yourself when possible or sync with our devs team to solve it.
- Streamline our deployment process so that deployments are as smooth as possible both for our users as well as for our teams, considering the possibility of having to rollback.
- Educate engineers throughout the company on how to ensure their projects meet our reliability, performance, and security requirements.
- Reduce the server-side and front-end latency of our application to deliver a lightning-fast user experience.
- Optimize our hosting bill by increasing throughput and resource efficiency, while planning capacity for the next two years of growth.
- Determine and configure a core set of metrics and alerts to make sure our apps and servers are running smoothly and that we can react fast if something bad happens.
- Develop and maintain performance and load tests.
- Write code as it pertains to platform support or reducing technical debt, which includes upgrading dependencies, refactoring code, removing unused code, and resolving minor bugs.
- Possible on-call responsibilities.
- Experience hosting apps in Heroku, monitoring, and scaling them up/down
- B.S. computer science or related field
- 2+ years of software-engineering experience
- 2+ years of site-reliability engineering (or similar) experience
- 1+ years of direct Ruby on Rails experience
- Experience with AWS tools such as S3, ECS, Route 53, Elastic Beanstalk, and CloudWatch
- Experience with Cloudflare or a similar CDN service
- Experience with CircleCI, Jenkins, Travis, or a similar CI/CD tool
- Strong experience profiling and optimizing applications for speed and resource consumption
- Extensive Git (or similar) experience solving complex merging conflicts
- Know how the web works under the hood: TCP, HTTP, DNS, IP, caches, etc.
- Experience working on a SaaS application or with subscription-based businesses generally
- Native fluency in English
- U.S. based
- Experience with New Relic or a similar performance monitoring tool
- Experience configuring and managing third party integrations
- Experience working on a remote team
- Experience with Rollbar or a similar automated error monitoring tool
- Experience and a strong interest in cybersecurity
- 100% remote. That’s one of the biggies. No more commute!
- Profit share. We set aside a percentage of profits each year and then pay them out across the entire team.
- Group health insurance coverage.
- 401k matching up to 4% (100% matching up to 3% and 50% between 3% and 5%).
- Unlimited paid time off. Our philosophy is that if you feel you need time off (for example, because of overwork, sickness, personal matters, etc.), we’re not going to question that. We just ask that you don’t abuse it and that you give us at least two weeks’ notice if you plan to be away.