Job - Datacenter Observability & Site Reliability Engineer

Back to Jobs

Datacenter Observability & Site Reliability Engineer

The vacancy has expired

Location

Tamil Nadu, India
Industry

Information Technology and Services

Job Description

Location: Open (should be flexible with Korea time zone)

Total Experience: 8+ Years

Notice Period: Immediate to 30 Days Preferred

Our client is looking for a skilled Observability & Site Reliability Engineer to join their team supporting large-scale, enterprise-grade infrastructure. The ideal candidate will have deep experience with observability tools—especially Grafana, Loki, Mimir, and Kubernetes metrics/logs—and a passion for performance, scale, and uptime.

Key Must-Have Skills:

5+ years in Observability Engineering
Expertise in Grafana, Loki, Mimir, Alloy agent
Strong understanding of infrastructure metrics (GPU/CPU/K8s)
Familiarity with scripting (Python, Go, Bash)
Prior exposure to Prometheus, ELK, Docker, Terraform
Flexible to work with Korean stakeholders & time zones

Role Highlights:

Design and manage observability stack across large datacentre infra.
Build scalable telemetry systems, dashboards, alerts & reports
Apply SRE practices to ensure system reliability and performance
Troubleshoot real-time issues and support ongoing optimisation.

Good to Have:

Prior experience working with Korean stakeholders
Knowledge of cloud platforms like AWS, GCP, Azure