Senior Site Reliability Engineer (SRE)

16 Days Old

A leading client is looking for a Senior Site Reliability Engineer (SRE) to lead efforts in ensuring the reliability, scalability, and performance of critical production systems. This is a hybrid role based in New York , requiring onsite presence from Day 1. As a Senior SRE, you’ll partner with architecture, engineering, and security teams to drive operational excellence through observability, automation, and incident response best practices. Key Responsibilities: Design and develop enterprise-grade APIs and configuration management solutions Drive enterprise and application architecture improvements Lead initiatives around monitoring , alerting , dashboarding , and incident response Build and maintain observability tools: Grafana , Prometheus , Splunk Develop and manage detailed runbooks for operational procedures Define and monitor SLAs , SLOs , and KPIs for mission-critical services Evaluate new tools and technologies to improve system performance and reliability Collaborate cross-functionally with development, infrastructure, and security teams Required Skills & Experience: Strong background in IT infrastructure , cloud platforms (AWS, Azure, or GCP), and modern SRE practices Proven experience in building APIs and backend systems Solid understanding of enterprise/application architecture Hands-on experience with: Monitoring & Observability: Grafana, Prometheus, Splunk ITSM & Operations Tools: ServiceNow, OpsRamp Incident Tracking: JIRA Experience in: Managing large-scale distributed systems Building alerts, dashboards, and operational runbooks Excellent leadership, communication, and problem-solving skills Preferred Qualifications: Exposure to OpenShift and Azure Certifications such as: SRE Foundation, ITIL, relevant cloud certifications (AWS, Azure, GCP)
Location:
New York

We found some similar jobs based on your search