This company is looking for an engineer to push the limits of model inference performance at scale. You will work at the intersection of Research and Production. Your goal is to take cutting-edge models and turn them into fast, reliable, and cost-efficient systems that serve real users. This role is for those who enjoy deep technical work, profiling systems down to the kernel/GPU level.

Key Responsibilities

Performance Tuning: Optimize inference latency, throughput, and cost for large-scale ML models in production.
Deep Profiling: Profile and identify bottlenecks in GPU/CPU pipelines (memory, kernels, batching, IO).
Advanced Techniques: Implement optimizations like Quantization (fp16, int8, fp8), KV-cache reuse, and Speculative Decoding.
System Building: Build and maintain inference-serving systems using tools like Triton or custom runtimes.
Hardware Benchmarking: Benchmark performance across different hardware (NVIDIA vs. AMD) and cloud setups.

Requirements

Core Experience: Strong experience in ML inference optimization or high-performance ML systems.
Internals Knowledge: Solid understanding of deep learning internals (Attention mechanisms, memory layout, compute graphs).
Tech Stack: Hands-on experience with PyTorch and familiarity with GPU tuning (CUDA, ROCm, Triton).
Scale: Experience scaling inference for real users, not just theoretical research benchmarks.

Nice to Have

Experience with LLM or long-context model inference.
Knowledge of frameworks like TensorRT, ONNX Runtime, or vLLM.
Background in distributed systems or low-latency services.

Benefits

Equity: Meaningful equity at Series A stage.
Impact: Direct impact on unit economics (saving the company money on compute is huge).
Remote: Work from anywhere in the world.

Machine Learning Engineer — Inference Optimization

Job Description

Key Responsibilities

Requirements

Nice to Have

Benefits

Is this company safe?

Safety First