Senior Cloud Engineer

Bengaluru, Karnataka, India

Full Time Senior-level / Expert
Springboard logo
Springboard
Apply now Apply later

Posted 3 weeks ago

The Company
Springboard is redefining professional education for the 21st century through immersive, mentor-supported courses in cutting-edge fields like data science and design. Our self-paced, online offerings give anyone, anywhere access to world-class learning resources, with an emphasis on project-based learning, industry-relevant curriculum, and tangible outcomes. Through this hybrid approach, we’ve helped thousands of learners revamp their careers and, by extension, their lives.
The Opportunity
As a Senior Site Reliability Engineer (SRE) at Springboard, you will be a key member for our cloud infrastructure and tech-operations initiatives. You will utilise your diverse background in operations, cloud, systems engineering, and monitoring to ensure uptime, reliability, efficiency and health of our web-services on staging and production. You’ll learn quickly, be hands-on, own key processes, and make continuous improvements to the quality of our services and operations, as we scale.

Responsibilities

  • Being the primary person responsible for reliability, health, and performance of our cloud infrastructure and services. Ensuring uptime and service health.
  • Implementing cloud infrastructure strategies, network configurations & kubernetes cluster configurations for security, scale, performance, reliability, and efficiency.
  • Gaining a deep understanding of Springboard’s application ecosystem and services. Setting up monitoring, telemetry (logs, metrics, events) on production systems and deployment pipelines to improve product rollouts and efficient execution.
  • Thinking, innovating and engineering solutions to anticipate, detect and solve complex problems. Conducting tests and validating observations. Developing scripts and custom tools where conventional tools fall short. Evaluating solutions, and proof-of-concepts to improve the system.
  • Own Infrastructure Operations: Managing and addressing requests from engineering teams. Defining, implementing and streamline processes for service & audit.
  • Learning, improving continuously, setting a high bar for quality, while advocating and adopting industry best engineering practices. Identify bottlenecks and make recommendations to the engineering team to improve security and reliability.
  • Using your excellent communication skills, empathy and training skills to groom an engineering team towards building a strong SRE function at Springboard.

You:

  • Must have 5+ years of experience in Site Reliability Engineering, having cloud infrastructure management and administrative responsibilities in a production environment.
  • Must be an expert on kubernetes, with operational knowledge and experience in managing production clusters, with self-healing, auto-scaling, load balancing, probes, volumes.
  • Must have experience of VPCs, IAM, Load balancers, DNS, API gateways, firewalls, relational DBs, blob-stores and managed services from a popular cloud provider like AWS, Azure or GCP (preferred).
  • Must have excellent linux system-administration knowledge on tools, system health & performance, services & daemons, containers (docker)
  • Must know shell scripting (bash) with a strong familiarity with linux command line tools.
  • Are experienced with Infrastructure Monitoring using tools like DataDog, Graphana, Splunk etc. You are comfortable setting them up from scratch, and debugging issues with them.
  • Have operational experience on web-based applications, ReST APIs, GraphQL, authentication, certificate management, CDNs etc.
  • Have operational experience with Git, semantic versioning, CI/CD, package management tools on linux distros and language tools, like npm, pip, Helm.
  • You like to automate repetitive tasks. You follow KISS and DRY principles. You either know, or are interested in learning a language like JavaScript or Python.
  • Operate with minimal supervision. You are meticulous in activities. You prioritize tasks. You reason objectively. You are decisive. You are an excellent communicator.
  • You are a self-learner, who seeks to improve constantly. You share your knowledge generously, and you mentor engineers to meet and exceed your standards. You strive to learn and implement best practices, and define policies and guidelines for the team.
  • Are passionate about SRE. You are curious about tech, and honing your skills. You aim to learn, grow and excel as a site reliability engineer.
  • Are a preferred candidate if you have a Kubernetes Administrator (CKA) certification or a cloud certification from AWS or Google (preferable).
The Springboard team of 150 works out of offices in the heart of San Francisco and Bengaluru. We’re backed by top investors, including Costanoa Ventures, Learn Capital, 500 Startups, Rocketship.vc, and the founders of LinkedIn, Princeton Review, InMobi, and AppDynamics. Working with us, you’ll enjoy competitive compensation, medical insurance for you and your dependents, a generous learning budget, team lunches and snacks, and an opportunity to impact thousands of lives alongside a fun, dedicated and mission-driven team. To learn more about our team and culture, follow us on Instagram @springboardlife! We are an equal opportunity employer and value diversity at our company. We welcome applications from all backgrounds and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Job tags: AWS Azure Bash CD CI Docker GCP Git GraphQL JavaScript Kubernetes Linux Load Balancing Python Reliability engineering REST
Job region(s): Asia/Pacific
Job stats:  0  0  0
  • Share this job via
  • or