
Meta
Job title:
AI/HPC Systems Production Engineer
Company:
Meta
Job description
Meta’s AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI. We need to build and evolve our network infrastructure that connects myriads of training accelerators like GPUs together. In addition, we need to ensure that the network is running smoothly and meets stringent performance, availability and reliability requirements of RDMA workloads that expects a loss-less fabric interconnect. To improve performance of these systems we constantly look for opportunities across the stack: network fabric and host networking, communication libraries and scheduling infrastructure.AI/HPC Systems Production Engineer ResponsibilitiesResponsible for the overall reliability of the communication system, including monitoring, troubleshooting and proactive identification of production issues.Develop, extend and maintain CI/CD, testing pipelines for host components of training stack infrastructure, e.g. collective communication libraries (NCCL, RCCL), RDMA host stack dependencies.Active member of a multi-disciplinary team to develop solutions for large scale training systems. Work with performance engineers to ensure safe and robust rollout of new features.Minimum QualificationsBS/MS/PhD in relevant fields (EE, CS), with 4+ years work experience.Python, C/C++ coding skillsKnowledge of Linux and foundational networking principlesPreferred QualificationsExperience working with up-to-date AI training workload packaging, CI/CD and distribution processes, containerization principles.Understanding of RDMA network stack principles and pain points on InfiniBand and RoCE Networks. Experience in development of systems and applications utilizing RDMA technologies. Experience with using communication libraries, such as MPI, NVIDIA Collective Communication Library (NCCL).Experience with GPU accelerator development frameworks, for example CUDA, OpenCLExperience in developing and troubleshooting system level softwareAbout MetaMeta builds technologies that help people connect, find communities, and grow businesses. When Facebook launched in 2004, it changed the way people connect. Apps like Messenger, Instagram and WhatsApp further empowered billions around the world. Now, Meta is moving beyond 2D screens toward immersive experiences like augmented and virtual reality to help build the next evolution in social technology. People who choose to build their careers by building with us at Meta help shape a future that will take us beyond what digital connection makes possible today-beyond the constraints of screens, the limits of distance, and even the rules of physics.Equal Employment OpportunityMeta is proud to be an Equal Employment Opportunity employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, political views or activity, or other applicable legally protected characteristics. You may view our Equal Employment Opportunity notice .Meta is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, fill out the .Apply for this jobTake the first step toward a rewarding career at Meta.APPLY NOWFind your roleExplore jobs that match your skills and experience. Search by technology, team or location to find an opening that’s right for you.CareersFollow usCareer programsTeamsWorking at MetaMy accountAbout usEqual Employment OpportunityMeta is proud to be an Equal Employment Opportunity employer. We do not discriminate based upon race, religion, color, national origin, sex (including pregnancy, childbirth, reproductive health decisions, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, genetic information, political views or activity, or other applicable legally protected characteristics. You may view our Equal Employment Opportunity notice .Meta is committed to providing reasonable support (called accommodations) in our recruiting processes for candidates with disabilities, long term conditions, mental health conditions or sincerely held religious beliefs, or who are neurodivergent or require pregnancy-related support. If you need assistance or an accommodation due to a disability, fill out the
Expected salary
Location
London
Job date
Thu, 20 Mar 2025 00:10:59 GMT
To help us track our recruitment effort, please indicate in your email/cover letter where (vacanciesin.eu) you saw this job posting.