Principal Site Reliability Engineer - Enterprise AI Platform

New Today

Principal Site Reliability Engineer - Enterprise AI PlatformPrincipal Site Reliability Engineer - Enterprise AI PlatformNVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Join the team and see how you can make a lasting impact on the world.NVIDIA is looking to hire a deeply technical and creative Site Reliability Engineer to build, support and maintain the next generation AI powered enterprise products that improve engineering efficiency, data security, and power our product development. This role will give an opportunity to collaborate with the Cloud and AI/ML workforce in a dynamic and agile working environment.What You Will Be DoingCollaborate on translating business objectives into actionable plansAddress operational challenges, automate processes, and iterate for efficiencyTackle systemic reliability issues with multi-functional teams.Monitor, optimize, and manage system performance and resources.Institute validated practices for reliability, remediations, and troubleshooting.Design, deploy, and automate production support, documenting essential knowledge.Navigate intricate tasks with a deep understanding of SRE principles.Lead cross-organizational projects from inception to completion.Mentor and train junior engineers for professional development.Serve as a subject matter expert in core team functions.What We Need To See15+ years of working experience in cloud, platform or SRE rolesA Bachelors or Masters Degree in an Engineering or Computer Science or related field or equivalent experienceProficient in one or more programming languages: Python, Go, Perl, or Ruby.Hands-on experience handling and scaling distributed systems in a public, private, or hybrid cloud, on-prem environment 24x7x365Has delivered software with full understanding of deploying applications in Kubernetes clusters along with GPU and CPU pod scheduling (Ability to understand on Prem)Has maintained and managed Micro-services relating to AI platforms (Inference, Training, Evaluation, Ingestion)Hands-on experience in deploying, supporting, and supervising new and existing services, platforms, and application stacks.Experience with CI/CD systems such as Jenkins, GitHub Actions, etc.Background with Infrastructure as Code (IaC) methodologies and relevant tools.Extensive experience working with MS Windows Server and/or Linux operating systems.Solid communication skills, demonstrating the ability to comprehend and articulate technical issues to a non-technical audience.Ways To Stand Out From The CrowdCloud expertise in Azure and AWS.Passionate and experienced in AI methodologies.Strong background in software design and development.Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and driveNVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and enjoy learning while having fun, then what are you waiting for? Apply today!Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 248,000 USD - 391,000 USD.You will also be eligible for equity and benefits .Applications for this job will be accepted at least until July 29, 2025.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.JR2000391Seniority levelSeniority levelDirectorEmployment typeEmployment typeFull-timeJob functionIndustriesComputer Hardware Manufacturing, Software Development, and Computers and Electronics ManufacturingReferrals increase your chances of interviewing at NVIDIA by 2xGet notified about new Site Reliability Engineer jobs in Santa Clara, CA.Palo Alto, CA $100,000.00-$200,000.00 1 month agoFremont, CA $117,000.00-$173,000.00 2 weeks agoHayward, CA $100,000.00-$150,000.00 6 months agoSoftware Engineer, AI Platform - New GradMountain View, CA $145,000.00-$170,000.00 19 hours agoFremont, CA $147,000.00-$208,000.00 2 days agoMountain View, CA $138,225.00-$207,575.00 1 day agoSanta Clara, CA $89,000.00-$142,000.00 5 days agoSite Reliability Engineer, Global E-CommerceSan Jose, CA $136,800.00-$259,200.00 3 weeks agoSan Jose, CA $133,900.00-$242,000.00 3 days ago[HackerRank] Site Reliability Engineer Graduate (Technical Infrastructure) - 2025 Start (BS/MS)San Jose, CA $118,657.00-$187,200.00 19 hours agoRobotics Software Engineer - Intelligent Manufacturing AutomationMountain View, CA $154,050.00-$167,875.00 1 week agoPalo Alto, CA $99,000.00-$175,000.00 2 weeks agoNew Grads 2025 - General Software EngineerSan Jose, CA $120,000.00-$165,000.00 6 months agoPalo Alto, CA $180,000.00-$440,000.00 2 weeks agoAssociate Site Reliability Engineer/Site Reliability EngineerRedwood City, CA $116,000.00-$168,000.00 1 week agoSan Mateo, CA $150,000.00-$185,000.00 2 weeks agoMountain View, CA $145,000.00-$170,000.00 18 hours agoSanta Clara, CA $175,000.00-$195,000.00 1 month agoFoster City, CA $160,000.00-$250,000.00 4 months agoSan Francisco Bay Area $150,000.00-$225,000.00 3 weeks agoSunnyvale, CA $174,000.00-$258,000.00 4 days agoSenior DevOps & Site Reliability Engineer (SRE)We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI. #J-18808-Ljbffr
Location:
Santa Clara, CA, United States