@ Catchmaker

Site Reliability Engineer (ML Ops)

remote: Palo Alto, CA

added Tue Aug 15, 2023

Apply to Chai Research Corp.

Who we are looking for:

We need a focused engineer with 3+ years of experience overseeing and maintaining services to be responsible for our cluster of 500+ GPUs. Ensuring they are healthy and running as expected. You will be working alongside equally talented and driven teammates implementing cutting-edge AI inference engines. We need someone who is reliable and has high standards.

Here's why we might not be the right fit for you:

We work hard and have a high velocity environment with lots of growth opportunities.
We value exceptional performance and continuous improvement. We believe that if you aren't constantly learning, you aren't growing.
You will be responsible and accountable for making high-impact decisions that determine Chai's future

Here are the top 2 reasons why you should join us:

Exponential growth. 1 Million MAU. Be the 14th hire and get us to 100 million MAU.
Craftsmanship. Build something beautiful.

Requirements:

Bachelor's degree from a leading academic institution
3+ years of experience in designing, analyzing, and troubleshooting large-scale distributed ML systems
Deep knowledge of GCP and Kubernetes
Knowledge of KServe is strongly preferred

Here is our tech stack:

Front end: Python, Flutter, Dart
Back end: Python, GCP, Redis, Kubernetes

Process:

Exceptionally fast, application to offer within 7 days.

Apply here
First round video interview, system design interview, then onsite
Reference checks, negotiation, and offer

Chai Research Corp.

At Chai Research we believe that in 2 years time, 50% of people will have an AI best friend. We have developed ChaiGPT, the world's most entertaining language model with over 4 billion completions served.

chai-research.com