Senior Site Reliability Engineer (SRE)
16 Days Old
A leading client is looking for a Senior Site Reliability Engineer (SRE) to lead efforts in ensuring the reliability, scalability, and performance of critical production systems. This is a hybrid role based in New York , requiring onsite presence from Day 1.
As a Senior SRE, you’ll partner with architecture, engineering, and security teams to drive operational excellence through observability, automation, and incident response best practices.
Key Responsibilities:
Design and develop enterprise-grade APIs and configuration management solutions
Drive enterprise and application architecture improvements
Lead initiatives around monitoring , alerting , dashboarding , and incident response
Build and maintain observability tools: Grafana , Prometheus , Splunk
Develop and manage detailed runbooks for operational procedures
Define and monitor SLAs , SLOs , and KPIs for mission-critical services
Evaluate new tools and technologies to improve system performance and reliability
Collaborate cross-functionally with development, infrastructure, and security teams
Required Skills & Experience:
Strong background in IT infrastructure , cloud platforms (AWS, Azure, or GCP), and modern SRE practices
Proven experience in building APIs and backend systems
Solid understanding of enterprise/application architecture
Hands-on experience with:
Monitoring & Observability: Grafana, Prometheus, Splunk
ITSM & Operations Tools: ServiceNow, OpsRamp
Incident Tracking: JIRA
Experience in:
Managing large-scale distributed systems
Building alerts, dashboards, and operational runbooks
Excellent leadership, communication, and problem-solving skills
Preferred Qualifications:
Exposure to OpenShift and Azure
Certifications such as: SRE Foundation, ITIL, relevant cloud certifications (AWS, Azure, GCP)
- Location:
- New York