@ Catchmaker

ML Performance Engineer - Deep Learning

Stability AI Japan

remote: London & Japan

added Tue Aug 08, 2023

Apply to Stability AI

About the role:

We are looking for a talented MLOps Engineer with a focus on Deep Learning and High-Performance Computing that will work with a growing multidisciplinary team of talented research scientists and machine learning engineers to improve and scale the efficiency within our computing capacity.

Responsibilities:

Optimizing Deep Learning Workflows:

Monitor reports and dashboards and detect low utilization jobs, projects, users
Partner with researchers to check their workflow when they lack performance
Identify bottlenecks and suggest scripting optimisations
For high-scale jobs, introduce AWS proprietary profiler and libraries to boost performance
Scale-up gating process: check the scripts performance and vet requests to scale up
Build a knowledge base / best practices documentation for all researchers
Implement and monitor CPU usage levels for our CPU clusters; identify users that need assistance in properly coding to maximize usage of CPU’s
Train researchers on best practices on how to implement automatization strategies to minimize human oversight on jobs.

Develop and Test Strategies for Future Workloads:

Benchmark new systems capabilities and identify strategies to properly utilize them (H100, TRN2, TPUv5, Intel Gaudi)
Define the minimum needs for storage speeds and find better data loading strategies to support high processing demands of the new accelerators

High-Performance Computing:

Maintain HPC cluster operations
Monitor dead nodes and recover them; document dead nodes and their fixes
Monitor shared volumes health, usage, and clean-up needs, pursue users to clean-up
Partner with users that do not adequately use POSIX permissions on shared storage
Monitor the HPC Help Center and solve user problems
Assist users in properly launching their jobs
Maintain the future S3 access permissions, debug problems, etc
Monitor all CPU clusters for users
Create and maintain processes around authentication, authorization and accounting for clusters usage
Develop processes around security aspects of the HPC clusters, including tools to in case of security risks are identified (globally, by user, by team, by location, etc)
Convert and deploy SLURM scheduling for all clouds and all resource types; integrate TPUs into our larger enterprise approach when SLURM becomes available.
If needed to use k8s infrastructure for research, then maintain SLURM on top of K8S
Solve SLURM support tickets with Sched MD's bug management tools
Maintain AWS resources associated with the HPC clusters (login nodes, S3 buckets, FSx volumes, VPCs, subnets, NAT Gateways, S3 VPC Endpoints, routing tables)

Qualifications:

At least 8+ years of relevant experience
Applied programming experience in Python, C, and/or C++
Experience with libraries and tools like PyTorch and CUDA
Experience in building, productizing and monitoring orchestration pipelines for AI and Machine Learning pipelines
Experience with training frameworks like Megatrong, NVIDIA or similar frameworks
Experience in leading more junior engineers
Experience with AWS and/or GCP
Experience/exposure to CI tools infra tools is a nice to have (Kubernetes)
Experience with Linux-based environments and scripting (Shell Scripting, Python, Powershell)
Ability to work well as an individual contributor as well as within a multidisciplinary team environment
Strong communicator with excellent interpersonal skills and can-do attitude to work and thrive in a fast-paced team environment

Equal Employment Opportunity:

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.

Stability AI

Stability AI is a community and mission driven, open-source artificial intelligence company that cares deeply about real-world implications and applications. Our most considerable advances grow from our diversity in working across multiple teams and disciplines. We are unafraid to go against established norms and explore creativity. We are motivated to generate breakthrough ideas and convert them into tangible solutions. Our vibrant communities consist of experts, leaders and partners across the globe who are developing cutting-edge open AI models for Image, Language, Audio, Video, 3D and Biology.

stability.ai