Site Reliability Engineer
Bengaluru, Karnataka, India
6sense helps B2B marketing and sales organizations fully understand the complex ABM buyer journey. By combining intent signals from every channel with the industry’s most advanced AI predictive capabilities, it is finally possible to predict account demand and optimize demand generation in an ABM world. Equipped with the power of AI and the 6sense Demand Platform™, marketing and sales professionals can uncover, accelerate, and capture buyer demand to drive more revenue.
Infrastructure Software Engineers at 6sense are true hybrid developers and operations engineers. They are responsible for ensuring our services and infrastructure are fast, stable, and scalable. They build out any services and tooling we need that are not readily available via third-party packages or services. They provide guidance on best practices to the overall Software Engineering team.
Operational tasks such as infrastructure, build/release, CI/CD, database administration, and systems administration also fall within their realm of responsibilities.
The Reliability team focuses on the automation, integration, operation, and overall improvement of our monitoring, logging, and alerting services to ensure we can deliver product quickly, safely, and reliably.
- Develop and deploy services to improve the availability, ease of use/management, and visibility of 6sense systems
- Building and scaling out our services and infrastructure
- Learning and adopting technologies that may aide in solving our challenges
- Own our monitoring, logging, and alerting tools used by the overall Software Engineering team in order to ensure we are meeting reliability requirements
- Write/review/debug production code, develop documentation and capacity plans, and debug live production problems
- Contributing back to open-source projects if we need to add or patch functionality
- Support the overall Software Engineering team to resolve any issues they encounter
- Help respond to service issues and determine how to automatically alert the responsible parties along with context in order to make the service-owner a self-sufficient first-responder
- First-responder to issues with shared infrastructure and escalate to other team members as necessary
- Write configurations and scripts to pull data into our monitoring/logging/alerting systems
- Work with other teams to get automatic resolutions in place to alleviate need for human response
- Participate in on-call rotations to monitor platform/infrastructure issues
- 2+ years in a Software Engineering role or equivalent experience
- 2+ years in a reliability-type role (such as Site Reliability Engineering)
- 2+ years of experience with Linux/Unix system administration and networking fundamentals
- Strong coding fundamentals and good code-reading skills
- Good knowledge of Python and Java
- Experience monitoring and analyzing services/applications in service-oriented architecture at the network/server-level as well as in containerized space (such as Kubernetes and Docker)
- Experience with high-availability
- Experience with leveraging and configuring monitoring systems such as Datadog, Grafana, Grafana Loki, Promethus, Sumo Logic
- Knowledge of the Hadoop ecosystem (e.g. Hadoop, Hive, Presto) including deployment, scaling, and maintenance
- Knowledge of standard security practices