Senior Site Reliability Engineer II-UK
LogRhythm, a Thoma Bravo company, empowers more than 4,000 customers across the globe to measurably mature their security operations program. LogRhythm’s award-winning NextGen SIEM Platform delivers comprehensive security analytics; user and entity behavior analytics (UEBA); network detection and response (NDR); and security orchestration, automation, and response (SOAR) within a single, integrated platform for rapid detection, response, and neutralization of threats. Built by security professionals for security professionals, LogRhythm enables security professionals at leading organizations like NASA, XcelEnergy, and Temple University to promote visibility for their cybersecurity program and reduce risk to their organization each and every day. LogRhythm is the only provider to earn the Gartner Peer Insights Customers’ Choice for SIEM designation three years in a row.
Who we are looking for:
We are seeking an enthusiastic Site Reliability Engineer II to join our team!
Life is great at LogRhythm and we are growing our team! This is a challenging and dynamic opportunity, where you can use your creative problem solving, resourcefulness, and developer/operations experience to help us maintain and enhance a robust platform environment for our customers.
We are developing the Site Reliability Engineering discipline within our Engineering organization, and we need your help building the team.
You're someone who enjoys being directly accountable for the reliability of business-critical, large-scale enterprise system. You're comfortable guiding and making decision with limited information and are capable of operating within the trade-offs present when solving for immediate needs versus solving with bigger scale solutions. You might be considered a subject matter expert in systems reliability and you feel rewarded by working to develop operability culture in a quickly growing an changing environment. You're comfortable owning a wide and diverse set of problem areas and are willing to go out of your lane to affect change. You have developed one or more metrics, log aggregation or performance analysis systems in your career.
This is a fantastic opportunity to work and collaborate closely with our software engineering, architecture
and operations teams at LogRhythm. Our Site Reliability Engineering Development is responsible for ensuring LogRhythm products and services are highly available, reliable, secure and scalable. The ideal candidates are fluent in systems programming and/or automation and can leverage their experience to solve complex problems associated with running production environments at massive scale in multi-tenant environments. We’re creating cool, disruptive products …come join us!
Here’s an overview of the responsibilities & challenges ahead:
Maintain 24x7 production environment with a high level of service availability. Perform quality reviews, manage operational issues
- Create and monitor dashboards and alerts for key infrastructure metrics, and business KPIs that relate to site reliability. Make monitoring and alerting alert on symptoms and not on outages.
- Ensure services are designed with 24/7 availability and operational readiness and rigor
- Develop processes, tools, automation, and software changes to address operational issues
- Automate infrastructure management and maintenance with the aim of empowering the team and ensuring site reliability
- Implement automation and orchestration for manual processes required to operate and deploy cloud services, be at the heart of developing new ideas into internal OPS/SRE tools by working closely with advanced technology
- Document every action so your findings turn into repeatable actions–and then into automation.
- Define non-functional requirements as part of the product lifecycle to influence the new designs, standards, and methods for scalable, highly available distributed systems
- Resolution of product/service defects or design changes, infrastructure changes, or operational changes
- Identifies, evaluates and executes preventive measures to minimize/avoid impact to the customers experience. Proactive v/s Customer escalated
- A self-starter who's comfortable working independently without a ton of supervision
- A software engineer with a curiosity for operations, or an operations engineer that wants to work closely with software engineers to help improve response times, scalability and availability.
- You're obsessive compulsive, in a good way. Your systems and scripts are clean, well-documented and comprehensible.
- You hate doing the same thing twice, you'd rather spend the time to automate a problem away rather than having to spend time on it again.
- You are collaborative and are excited to empower the engineering team to work better and faster
- Fluency with at least one current generation scripting language used by DevOps professionals (Python, Perl, PHP, Ruby) + Java Development
- You have a passion for learning when it comes to working with new technologies or languages
- You live and breathe scalable web architectures.
- You're cool in a crisis and can align with others to ensure complex problems meet a timely and effective resolution.
- You've worked with Linux, containers/namespaces, and system automation tools for Unix and/or cloud platforms.
- You have 5+ years of relevant technical experience
- BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
BONUS POINTS FOR:
- Professional experience leveraging public cloud solutions, with an emphasis on AWS and GCP
- Professional experience in systems and operations
- Experience with Kubernetes
- You have experience with containerization and orchestration
- You have strong security and networking skills
- You have experience infrastructure-as-code tooling and approaches
- Advanced knowledge of Unix/Linux systems: feel very comfortable at the command line
- Familiarity with configuration management and remote execution tools
- Understanding of Docker and automated deployment via pipeline
- In-depth understanding of web operations best practices
- Familiarity with infrastructure as code, AWS cloud platform
- Experience with DevOps methodologies is a plus
Workplace equality & inclusion are not just words or topics for LogRhythm, they are part of our core values, beliefs, and integral to our company culture. We hire the best of the best and do not discriminate based on race, gender, age, religion, sexual orientation, identity, or other personal factors. LogRhythm was built on the principals of innovation, dedication, creativity, and commitment. It is through these key areas we were able to grow as an equal and inclusive workplace, one where our employees feel respected and safe in.