Principal Software Engineer, Distributed Systems
New Yesterday
At d-Matrix , we are focused on unleashing the potential of generative AI to power the transformation of technology. We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. Our culture is one of respect and collaboration .
We value humility and believe in direct communication. Our team is inclusive , and our differing perspectives allow for better solutions. We are seeking individuals passionate about tackling challenges and are driven by execution. Ready to come find your playground? Together , we can help shape the endless possibilities of AI.
Location:
Hybrid, working onsite at our Santa Clara, CA headquarters 3-5 days per week.
The Role: Principal Engineer/Technical Director, Distributed Systems
What You Will Do:
The Principal Engineer/Director role involves helping productize the software stack for our AI compute engine. You will lead the development, enhancement, and maintenance of the distributed systems software stack for scale-out of the next-generation AI hardware. This includes working on runtime software, host drivers, inference engines, embedded chip software, and support for large language models across multiple cards and nodes in data centers. You will build software to enable scale-out, supporting data plane operations like protocol translation and hardware NICs, as well as control and management plane operations such as telemetry, monitoring, micro-services, container orchestration, and datacenter network tooling.
You should have experience building large-scale systems for novel hardware architectures, a strong understanding of scale-out and data communication collectives, and experience across the full stack tool chain for accelerators. Collaboration with hardware and software architecture teams, compiler teams, data science, testing, benchmarking, and simulation teams is essential.
What You Will Bring:
BS in Computer Science, Engineering, Math, Physics, or related field; MS preferred
Strong grasp of computer architecture, data structures, system software, and ML fundamentals
Leadership experience at manager or senior manager level with AI accelerator software
Proficiency in C/C++ and Python in Linux environment with standard tools
Experience designing and implementing algorithms in C/C++ and Python
Experience with host bring-up of specialized hardware like NICs, network FPGAs, smart NICs
Experience with distributed systems software such as message passing, MPI
Experience designing systems for reliability, high availability, fault tolerance, failover
Experience with cluster orchestration, containers, Kubernetes
Performance benchmarking and tuning of large-scale distributed systems
Self-motivated team player with ownership and leadership qualities
Desired:
Startup, small team, or incubation experience
Experience scaling large ML systems, especially large language models with parallelism
Work experience at cloud providers or AI compute/sub-system companies
Equal Opportunity Employment Policy:
d-Matrix is proud to be an inclusive, equal opportunity employer. We are committed to fostering a welcoming environment where everyone can do their best work. We hire based on talent, regardless of race, religion, color, age, disability, sex, gender identity, sexual orientation, ancestry, genetic information, marital status, national origin, political affiliation, or veteran status. We value humility, kindness, dedication, and a willingness to learn and embrace challenges.
We do not accept resumes from external agencies. Interested candidates should apply directly through our official channels. Thank you for your understanding.
#J-18808-Ljbffr
- Location:
- Santa Clara, CA, United States
- Salary:
- $250,000 +
- Category:
- IT & Technology