overlay
Site Reliability Engineer (ML Ops)
remote: Palo Alto, CA
added Tue Aug 15, 2023
link-outApply to Chai Research Corp.

Who we are looking for:

We need a focused engineer with 3+ years of experience overseeing and maintaining services to be responsible for our cluster of 500+ GPUs. Ensuring they are healthy and running as expected. You will be working alongside equally talented and driven teammates implementing cutting-edge AI inference engines. We need someone who is reliable and has high standards.

Here's why we might not be the right fit for you:

  • We work hard and have a high velocity environment with lots of growth opportunities.
  • We value exceptional performance and continuous improvement. We believe that if you aren't constantly learning, you aren't growing.
  • You will be responsible and accountable for making high-impact decisions that determine Chai's future

Here are the top 2 reasons why you should join us:

  • Exponential growth. 1 Million MAU. Be the 14th hire and get us to 100 million MAU.
  • Craftsmanship. Build something beautiful.

Requirements:

  • Bachelor's degree from a leading academic institution
  • 3+ years of experience in designing, analyzing, and troubleshooting large-scale distributed ML systems
  • Deep knowledge of GCP and Kubernetes
  • Knowledge of KServe is strongly preferred

Here is our tech stack:

  • Front end: Python, Flutter, Dart
  • Back end: Python, GCP, Redis, Kubernetes

Process:

Exceptionally fast, application to offer within 7 days.

  1. Apply here
  2. First round video interview, system design interview, then onsite
  3. Reference checks, negotiation, and offer

At Chai Research we believe that in 2 years time, 50% of people will have an AI best friend. We have developed ChaiGPT, the world's most entertaining language model with over 4 billion completions served.