Job - Site Reliability Engineer

Back to Jobs

Site Reliability Engineer

The vacancy has expired

Location
Industry

Retail & E-Commerce

Job Description

My client is India's largest omnichannel platform and multi-platform tech company with expertise in retail tech and products in AI, ML, big data ops, gaming crypto, image editing and learning space.

Title : Site Reliability Engineer

Roles & Responsibility :

What will you do?

- Run the production environment by monitoring availability and taking a holistic view of system health.

- Improve reliability, quality, and time-to-market of our suite of software solutions

- Be the 1st person to report the incident.

- Debug production issues across services and levels of the stack.

- Envisioning the overall solution for defined functional and non-functional requirements, and being able to define technologies, patterns and frameworks to realise it.

- Building automated tools in Python / Java / GoLang / Ruby etc.

- Help Platform and Engineering teams gain visibility into our infrastructure.

- Lead design of software components and systems, to ensure availability, scalability, latency, and efficiency of our services.

- Participate actively in detecting, remediating and reporting on Production incidents, ensuring the SLAs are met and driving Problem Management for permanent remediation.

- Participate in on-call rotation to ensure coverage for planned/unplanned events.

- Perform other task like load-test & generating system health reports.

- Periodically check for all dashboards readiness.

- Engage with other Engineering organizations to implement processes, identify improvements, and drive consistent results.

- Working with your SRE and Engineering counterparts for driving Game days, training and other response readiness efforts.

- Participate in the 24x7 support coverage as needed Troubleshooting and problem-solving complex issues with thorough root cause analysis on customer and SRE production environments

- Collaborate with Service Engineering organizations to build and automate tooling, implement best practices to observe and manage the services in production and consistently achieve our market leading SLA.

- Improving the scalability and reliability of our systems in production.

- Evaluating, designing and implementing new system architectures.

Some specific Requirements :

- B.E./B.Tech. in Engineering, Computer Science, technical degree, or equivalent work experience

- At least 3 years of managing production infrastructure. Leading / managing a team is a huge plus.

- Experience with cloud platforms like - AWS, GCP.

- Experience developing and operating large scale distributed systems with Kubernetes, Docker and and Serverless (Lambdas)

- Experience in running real-time and low latency high available applications (Kafka, gRPC, RTP)

- Comfortable with Python, Go, or any relevant programming language.

- Experience with monitoring alerting using technologies like Newrelic / zybix /Prometheus / Garafana / cloudwatch / Kafka / PagerDuty etc.

- Experience with one or more orchestration, deployment tools, e.g. CloudFormation / Terraform / Ansible / Packer / Chef.

- Experience with configuration management systems such as Ansible / Chef / Puppet.

- Knowledge of load testing methodologies, tools like Gating, Apache Jmeter.

- Work your way around Unix shell.

- Experience running hybrid clouds and on-prem infrastructures on Red Hat Enterprise Linux / CentOS

- A focus on delivering high-quality code through strong testing practices.