About the role:
We are looking for a talented Engineer with a focus on High-Performance Computing that will work with a growing multidisciplinary team of talented research scientists and machine learning engineers to improve and scale the efficiency within our computing capacity. Stability AI operates a very large HPC cluster for training foundational AI models across several modalities. Operating, automating, monitoring and troubleshooting issues with the cluster is strategically important to the long-term success of the business. This HPC Engineer role is critically important to our company and the ideal candidate will possess a passion for making incremental, measurable improvements, as well as solving unique problems that have yet to be solved in our industry.
Responsibilities:
- Maintain HPC Clusters Operations: Ensure the smooth operation of HPC clusters, including routine maintenance, software updates, and hardware optimizations.
- Monitor and Recover Dead Nodes: Continuously monitor cluster nodes, identify dead nodes, and implement recovery procedures to minimize downtime.
- Documentation: Maintain detailed documentation of dead node incidents, their root causes, and resolutions for future reference and improvement.
- Shared Volumes Management: Monitor the health and usage of shared volumes, and collaborate with users to enforce cleanup procedures.
- POSIX Permissions Enforcement: Monitor and contact users who do not adhere to POSIX permissions standards on shared storage to enhance security.
- HPC Help Center Support: Monitor and respond to user queries and issues submitted to the HPC Help Center, providing timely solutions and assistance.
- Job Launch Support: Assist users in launching jobs efficiently, reducing the need for constant supervision and ensuring optimal job execution.
- Optimizing Low-Priority Jobs: Guide users on maximizing the utilization of low-priority jobs through strategies such as preemption robustness and auto-requeueing.
- S3 Access Permissions: Maintain and troubleshoot S3 access permissions, resolving access issues and ensuring data integrity.
- Interactive Job Monitoring: Monitor all CPU clusters for users who forget to end interactive jobs and take appropriate actions to maintain cluster availability.
- Authentication and Authorization: Develop and maintain processes related to authentication, authorization, and accounting for cluster usage, ensuring secure access management.
- Security Measures: Implement and enhance security protocols for HPC clusters, including tools for rapid access removal in case of security risks.
- Slurm Scheduling Deployment: Convert and deploy Slurm scheduling for various cloud resources, including Kubernetes (K8s), TPUs, and Trainium.
- Slurm Support: Issue and resolve Slurm support tickets with external Slurm support providers to address scheduling and cluster management issues.
- AWS Resource Management: Maintain and manage AWS resources associated with HPC clusters, including login nodes, S3 buckets, FSx volumes, VPCs, subnets, NAT Gateways, S3 VPC Endpoints, and routing tables.
Requirements:
- Bachelor's degree in computer science, information technology, or a related field. Master's degree preferred.
- Proven experience in high-performance computing (HPC) administration and maintenance.
- Proficiency in HPC cluster management tools and technologies, with a strong focus on Slurm scheduling.
- Knowledge of cloud computing platforms, particularly AWS, and experience with managing associated resources.
- Strong scripting and programming skills (e.g., Bash, Python) for automation and system optimization.
- Familiarity with authentication, authorization, and accounting (AAA) processes for cluster usage.
- Understanding of security best practices and the ability to quickly respond to security threats.
- Excellent communication skills to effectively collaborate with users, solve issues, and provide guidance.
- Attention to detail and the ability to document processes and solutions effectively.
Equal Employment Opportunity:
We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.
Stability AI is a community and mission driven, open-source artificial intelligence company that cares deeply about real-world implications and applications. Our most considerable advances grow from our diversity in working across multiple teams and disciplines. We are unafraid to go against established norms and explore creativity. We are motivated to generate breakthrough ideas and convert them into tangible solutions. Our vibrant communities consist of experts, leaders and partners across the globe who are developing cutting-edge open AI models for Image, Language, Audio, Video, 3D and Biology.