Jan 15, 2025INFRA12 min read

Eliminating Cold Starts with Predictive Preloading

Cold starts are the silent killer of user experience in AI inference. When a model hasn't been used recently, it needs to be loaded from storage into GPU memory — a process that can take anywhere from 5 seconds to several minutes depending on model size.

The Problem

At Axiom, we serve thousands of models across hundreds of customers. Each model has its own traffic pattern — some are hit continuously, others only a few times per day. The traditional approach of keeping all models warm is economically impossible.

The cost of keeping a 70B parameter model warm on an A100 is roughly $3/hour. Multiply by thousands of models and the economics break down.

Predictive Preloading

Our solution uses temporal pattern analysis. By examining historical request patterns, we can predict with 94% accuracy when a model will be needed next. We start preloading the model 30 seconds before the predicted request time.

This approach reduced cold-start latency by 97% while only increasing warm-model costs by 12%. The key insight: most models have predictable usage patterns, even if they're not continuous.