overlay
ML Performance Engineer - Deep Learning
Stability AI Japan
remote: London & Japan
added Tue Aug 08, 2023
link-outApply to Stability AI

About the role:

We are looking for a talented MLOps Engineer with a focus on Deep Learning and High-Performance Computing that will work with a growing multidisciplinary team of talented research scientists and machine learning engineers to improve and scale the efficiency within our computing capacity.

Responsibilities:

Optimizing Deep Learning Workflows:

  • Monitor reports and dashboards and detect low utilization jobs, projects, users
  • Partner with researchers to check their workflow when they lack performance
  • Identify bottlenecks and suggest scripting optimisations
  • For high-scale jobs, introduce AWS proprietary profiler and libraries to boost performance
  • Scale-up gating process: check the scripts performance and vet requests to scale up
  • Build a knowledge base / best practices documentation for all researchers
  • Implement and monitor CPU usage levels for our CPU clusters; identify users that need assistance in properly coding to maximize usage of CPU’s
  • Train researchers on best practices on how to implement automatization strategies to minimize human oversight on jobs.

Develop and Test Strategies for Future Workloads:

  • Benchmark new systems capabilities and identify strategies to properly utilize them (H100, TRN2, TPUv5, Intel Gaudi)
  • Define the minimum needs for storage speeds and find better data loading strategies to support high processing demands of the new accelerators

High-Performance Computing:

  • Maintain HPC cluster operations
  • Monitor dead nodes and recover them; document dead nodes and their fixes
  • Monitor shared volumes health, usage, and clean-up needs, pursue users to clean-up
  • Partner with users that do not adequately use POSIX permissions on shared storage
  • Monitor the HPC Help Center and solve user problems
  • Assist users in properly launching their jobs
  • Maintain the future S3 access permissions, debug problems, etc
  • Monitor all CPU clusters for users
  • Create and maintain processes around authentication, authorization and accounting for clusters usage
  • Develop processes around security aspects of the HPC clusters, including tools to in case of security risks are identified (globally, by user, by team, by location, etc)
  • Convert and deploy SLURM scheduling for all clouds and all resource types; integrate TPUs into our larger enterprise approach when SLURM becomes available.
  • If needed to use k8s infrastructure for research, then maintain SLURM on top of K8S
  • Solve SLURM support tickets with Sched MD's bug management tools
  • Maintain AWS resources associated with the HPC clusters (login nodes, S3 buckets, FSx volumes, VPCs, subnets, NAT Gateways, S3 VPC Endpoints, routing tables)

Qualifications:

  • At least 8+ years of relevant experience
  • Applied programming experience in Python, C, and/or C++
  • Experience with libraries and tools like PyTorch and CUDA
  • Experience in building, productizing and monitoring orchestration pipelines for AI and Machine Learning pipelines
  • Experience with training frameworks like Megatrong, NVIDIA or similar frameworks
  • Experience in leading more junior engineers
  • Experience with AWS and/or GCP
  • Experience/exposure to CI tools infra tools is a nice to have (Kubernetes)
  • Experience with Linux-based environments and scripting (Shell Scripting, Python, Powershell)
  • Ability to work well as an individual contributor as well as within a multidisciplinary team environment
  • Strong communicator with excellent interpersonal skills and can-do attitude to work and thrive in a fast-paced team environment

Equal Employment Opportunity:

We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.

Stability AI is a community and mission driven, open-source artificial intelligence company that cares deeply about real-world implications and applications. Our most considerable advances grow from our diversity in working across multiple teams and disciplines. We are unafraid to go against established norms and explore creativity. We are motivated to generate breakthrough ideas and convert them into tangible solutions. Our vibrant communities consist of experts, leaders and partners across the globe who are developing cutting-edge open AI models for Image, Language, Audio, Video, 3D and Biology.