Watson AI Site Reliability Engineering Team Manager

Bangalore, Karnataka, IN

IBM logo
Apply now Apply later

Posted 2 weeks ago

Software Developers at IBM are the backbone of our strategic initiatives to design, code, test, and provide industry-leading solutions that make the world run today - planes and trains take off on time, bank transactions complete in the blink of an eye and the world remains safe because of the work our software developers do.  Whether you are working on projects internally or for a client, software development is critical to the success of IBM and our clients worldwide.  At IBM, you will use the latest software development tools, techniques and approaches and work with leading minds in the industry to build solutions you can be proud of.

Your Role and Responsibilities
Ready to grow your career in the cloud? Do you like the feeling that you are making a difference?
This is your chance to be a leader of a dynamic team of talented professionals deploying and maintaining innovative, industry-leading, cloud-based software.

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE is a key role in our growing and dynamic IBM Watson Cognitive AI, Planning Analytics, and Cognos Analytics business on Cloud. This leadership role is focused on managing a group of SREs who are deploying, maintaining, and automating wide ranges of operational tasks for the IBM Watson Cognitive AI, Planning Analytics, and Cognos Analytics services on IBM Cloud environments. You will work collaboratively with the entire global cloud organization and IBM vendors to support, maintain, and operationally improve the reliability of the application.

Watson AI, Planning and Cognos Analytics Site Reliability Engineer Manager is responsible for:
  • Ensuring team of local SREs are providing optimal production environment support and deployment for Watson AI, Planning and Cognos Analytics Services in the IBM Cloud public regions and dedicated environments.
  • Driving incident management process and support a blameless post-mortem culture.
  • Partnering with development teams to improve services via rigorous testing and release procedures.
  • Overseeing team of developers working on automation for deployments, upgrades and self-remediation.
  • Ensuring that local team is aware of and adhering to IBM Cloud processes and security/compliance initiatives
  • Developing metrics and reports that drive improvements in availability of the Watson AI, Planning and Cognos Analytics services as well as improvement in SRE team effectiveness.

Required Technical and Professional Expertise
  • 2+ years experience with managing team of developers working on software engineering, software development, or system operations
  • Ability to multi-task and solve operational issues prior to and during customer impacting events.
  • Strong communication skills - ability to communicate (often via slack and webex) observations and ideas for diagnosing and preventing issues or improving SRE processes to shorten diagnosis and resolution.
  • Ability to observe operational support techniques and make improvements to SRE processes.
  • Capability to work in a global, multicultural and diverse environment
  • Ability to work for AP shift hours (22:00-06:00 UTC from March to October, 23:00-07:00 UTC from November to February)
  • Ability to work as Emergency Response Manager during AP shift regularly and weekends on rotation basis (once every 5 weeks)
  • Experience with Agile methodologies including sprint planning, GitHub Enterprise and XenHub

Preferred Technical and Professional Expertise
  • Experience with customer escalations and/or operations war room.
  • Experience with troubleshooting issues in production systems
  • Experience with DevOps engineering or SRE
  • Experience using Watson AI services (especially Watson Assistant and Watson Discovery)
  • Experience with cloud technologies such as Docker, Kubernetes and Open Shift
  • Experience working with IBM Cloud (Bluemix) UI/CLI
  • Knowledge of IBM Cloud stack (IAM, CloudFoundry, ALB, Ingress, Cerberus, etc)
  • Knowledge of COS and ICD database services (e.g. Postgres, etcd, RabbitMQ, Redis, Elastic)
  • Knowledge of Networking (HTTP, DataPower, TLS, Akamai, DNS) to troubleshoot network issues
  • Hands-on experience using source control (Git, GitHub) and CI/CD pipeline (Jenkins, Ghenkins, Tekton, etc),
  • Experience with developing monitoring for production components and instrumenting code for observability using New Relic, LogDNA, Sysdig, Prometeus

About Business Unit
IBM’s Cloud and Cognitive software business is committed to bringing the power of IBM’s Cloud and Watson/AI technologies to life for our clients and ecosystem partners around the world. IBM provides you with the most comprehensive and consistent approach to development, security and operations across hybrid environments—with complete software solutions for business and IT operations, development, data science, security, and management. Our experts and software capabilities help organizations develop applications once and deploy them anywhere, integrate security across the breadth of their IT estate, and automate operations with management visibility. With IBM, you also have access to new skills and methods, governance and management approaches, and a deep ecosystem of industry experts and partners.

Your Life @ IBM
What matters to you when you’re looking for your next career challenge?

Maybe you want to get involved in work that really changes the world? What about somewhere with incredible and diverse career and development opportunities – where you can truly discover your passion? Are you looking for a culture of openness, collaboration and trust – where everyone has a voice? What about all of these? If so, then IBM could be your next career challenge. Join us, not to do something better, but to attempt things you never thought possible.

Impact. Inclusion. Infinite Experiences. Do your best work ever.

About IBM
IBM’s greatest invention is the IBMer. We believe that progress is made through progressive thinking, progressive leadership, progressive policy and progressive action. IBMers believe that the application of intelligence, reason and science can improve business, society and the human condition. Restlessly reinventing since 1911, we are the largest technology and consulting employer in the world, with more than 380,000 IBMers serving clients in 170 countries.

Location Statement
For additional information about location requirements, please discuss with the recruiter following submission of your application.

Being You @ IBM
IBM is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, pregnancy, disability, age, veteran status, or other characteristics. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status.

Job tags: CD CI CloudFoundry Docker Git Kubernetes Postgres RabbitMQ Redis Reliability engineering
Job region(s): Asia/Pacific
Share this job: