- Location
-
IndustryRetail
My client is India's largest omnichannel platform and multi-platform tech company with expertise in retail tech and products in AI, ML, big data ops, gaming crypto, image editing and learning space.
Title : Site Reliability Engineer
Roles & Responsibility :
What will you do?
- Run the production environment by monitoring availability and taking a holistic view of system health.
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Be the 1st person to report the incident.
- Debug production issues across services and levels of the stack.
- Envisioning the overall solution for defined functional and non-functional requirements, and being able to define technologies, patterns and frameworks to realise it.
- Building automated tools in Python / Java / GoLang / Ruby etc.
- Help Platform and Engineering teams gain visibility into our infrastructure.
- Lead design of software components and systems, to ensure availability, scalability, latency, and efficiency of our services.
- Participate actively in detecting, remediating and reporting on Production incidents, ensuring the SLAs are met and driving Problem Management for permanent remediation.
- Participate in on-call rotation to ensure coverage for planned/unplanned events.
- Perform other task like load-test & generating system health reports.
- Periodically check for all dashboards readiness.
- Engage with other Engineering organizations to implement processes, identify improvements, and drive consistent results.
- Working with your SRE and Engineering counterparts for driving Game days, training and other response readiness efforts.
- Participate in the 24x7 support coverage as needed Troubleshooting and problem-solving complex issues with thorough root cause analysis on customer and SRE production environments
- Collaborate with Service Engineering organizations to build and automate tooling, implement best practices to observe and manage the services in production and consistently achieve our market leading SLA.
- Improving the scalability and reliability of our systems in production.
- Evaluating, designing and implementing new system architectures.
Some specific Requirements :
- B.E./B.Tech. in Engineering, Computer Science, technical degree, or equivalent work experience
- At least 3 years of managing production infrastructure. Leading / managing a team is a huge plus.
- Experience with cloud platforms like - AWS, GCP.
- Experience developing and operating large scale distributed systems with Kubernetes, Docker and and Serverless (Lambdas)
- Experience in running real-time and low latency high available applications (Kafka, gRPC, RTP)
- Comfortable with Python, Go, or any relevant programming language.
- Experience with monitoring alerting using technologies like Newrelic / zybix /Prometheus / Garafana / cloudwatch / Kafka / PagerDuty etc.
- Experience with one or more orchestration, deployment tools, e.g. CloudFormation / Terraform / Ansible / Packer / Chef.
- Experience with configuration management systems such as Ansible / Chef / Puppet.
- Knowledge of load testing methodologies, tools like Gating, Apache Jmeter.
- Work your way around Unix shell.
- Experience running hybrid clouds and on-prem infrastructures on Red Hat Enterprise Linux / CentOS
- A focus on delivering high-quality code through strong testing practices.
