Principal Software Engineer, Distributed Systems

New Yesterday

At d-Matrix , we are focused on unleashing the potential of generative AI to power the transformation of technology. We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. Our culture is one of respect and collaboration . We value humility and believe in direct communication. Our team is inclusive , and our differing perspectives allow for better solutions. We are seeking individuals passionate about tackling challenges and are driven by execution. Ready to come find your playground? Together , we can help shape the endless possibilities of AI. Location: Hybrid, working onsite at our Santa Clara, CA headquarters 3-5 days per week. The Role: Principal Engineer/Technical Director, Distributed Systems What You Will Do: The Principal Engineer/Director role involves helping productize the software stack for our AI compute engine. You will lead the development, enhancement, and maintenance of the distributed systems software stack for scale-out of the next-generation AI hardware. This includes working on runtime software, host drivers, inference engines, embedded chip software, and support for large language models across multiple cards and nodes in data centers. You will build software to enable scale-out, supporting data plane operations like protocol translation and hardware NICs, as well as control and management plane operations such as telemetry, monitoring, micro-services, container orchestration, and datacenter network tooling. You should have experience building large-scale systems for novel hardware architectures, a strong understanding of scale-out and data communication collectives, and experience across the full stack tool chain for accelerators. Collaboration with hardware and software architecture teams, compiler teams, data science, testing, benchmarking, and simulation teams is essential. What You Will Bring: BS in Computer Science, Engineering, Math, Physics, or related field; MS preferred Strong grasp of computer architecture, data structures, system software, and ML fundamentals Leadership experience at manager or senior manager level with AI accelerator software Proficiency in C/C++ and Python in Linux environment with standard tools Experience designing and implementing algorithms in C/C++ and Python Experience with host bring-up of specialized hardware like NICs, network FPGAs, smart NICs Experience with distributed systems software such as message passing, MPI Experience designing systems for reliability, high availability, fault tolerance, failover Experience with cluster orchestration, containers, Kubernetes Performance benchmarking and tuning of large-scale distributed systems Self-motivated team player with ownership and leadership qualities Desired: Startup, small team, or incubation experience Experience scaling large ML systems, especially large language models with parallelism Work experience at cloud providers or AI compute/sub-system companies Equal Opportunity Employment Policy: d-Matrix is proud to be an inclusive, equal opportunity employer. We are committed to fostering a welcoming environment where everyone can do their best work. We hire based on talent, regardless of race, religion, color, age, disability, sex, gender identity, sexual orientation, ancestry, genetic information, marital status, national origin, political affiliation, or veteran status. We value humility, kindness, dedication, and a willingness to learn and embrace challenges. We do not accept resumes from external agencies. Interested candidates should apply directly through our official channels. Thank you for your understanding.
#J-18808-Ljbffr
Location:
Santa Clara, CA, United States
Salary:
$250,000 +
Category:
IT & Technology