Production Systems Engineer, Fleet AI Systems
New Yesterday
Production Systems Engineer, Fleet AI Systems Meta is seeking an experienced Production Systems Engineer to join our Release to Production (RTP) team. Our servers and data centers are the foundation upon which our rapidly scaling infrastructure operates efficiently to deliver our cutting-edge services. The RTP team is responsible for the Hardware Lifecycle of all Meta servers including pre-production hands-on system and hardware debugging and stress testing, enabling production-ready system monitoring, automated provisioning and automated remediation of issues. RTP Engineers work closely with hardware designers, system manufacturers, component vendors, capacity engineering, production engineering, Meta services, and data center operations teams to test systems before release to our production data centers, and to track the health and lifecycle of servers in production.
Responsibilities Responsibilities include:
Drive interfacing with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to develop and execute the test suites for various architectures
Contribute as a leading member of the team, owning and proactively creating experiments and tooling to detect and diagnose hardware/firmware/software health issues, in organized and collaborative efforts
Develop test framework for large-scale test automation inside fleet during product development and after mass production
Implement remediations across software and hardware stack according to plan, while keeping a thorough procedure record and data log
Develop and publish updates on resolutions and communicate findings internally
Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders
Develop visibility through data visualization and implement systematic solutions to hardware health issues
Drive necessary discussion with external and internal teams on test specification and methodologies to improve test quality continuously
Develop robust, industry leading practices for supporting hardware infrastructure at scale
Minimum Qualifications Minimum qualifications include:
Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
6+ years of experience in hardware server system support, troubleshooting server architecture and components, analyzing, triaging, and solving systems level issues
Expertise with Linux and scripting (Python or similar)
2+ years of experience in changing system configurations and measuring change impact, working through full lifecycle progressions of computer systems products
2+ years of experience engineering innovations in support of different server system/data center products
Preferred Qualifications Preferred qualifications include:
2+ years of experience in post-production, hyperscale environments, delivering solutions to complex systems issues
Master's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
2+ years of experience working in a matrix organization, owning or driving initiatives as a leading contributor
3+ years of experience supporting AI or HPC systems and/or related systems, at scale
5+ years of experience in production support at scale (e.g. - 10K storage servers and over 100K HDD) working through full system technologies
- Location:
- Menlo Park, CA, United States
- Category:
- Arts, Design, Entertainment, Sports, And Media Occupations
We found some similar jobs based on your search
-
New Yesterday
Production Systems Engineer, Fleet AI Systems
-
Menlo Park, CA, United States
- Arts, Design, Entertainment, Sports, And Media Occupations
Production Systems Engineer, Fleet AI Systems Meta is seeking an experienced Production Systems Engineer to join our Release to Production (RTP) team. Our servers and data centers are the foundation upon which our rapidly scaling infrastructure ope...
More Details -
-
4 Days Old
Production Systems Engineer, Fleet AI Systems
-
Bellevue, WA, United States
- Arts, Design, Entertainment, Sports, And Media Occupations
Production Systems Engineer, Fleet AI Systems Meta is seeking an experienced Production Systems Engineer to join our Release to Production (RTP) team. Our servers and data centers are the foundation upon which our rapidly scaling infrastructure ope...
More Details -