Team API Documentation Notebook Examples

Reduce ML inference cost by 20–50% without retraining models

moco automatically converts existing classifiers into routed cascaded systems that avoid unnecessary computation while preserving model behavior.

Stop Running Full Inference on Easy Predictions

Most production ML systems send every input through the full model — even when many predictions are trivial to classify.

moco analyzes model behavior and automatically routes easy predictions through cheaper decision paths while reserving full inference for difficult cases.

Smaller GPU Fleet

Reduce unnecessary inference compute.

Lower Power Usage

Reduce energy consumed per prediction.

Higher Throughput

Process more requests on existing hardware.

Free Compute-Waste Audit

I will analyze historical inference logs, model outputs, or representative datasets to identify unnecessary model computation and estimate potential savings.

The audit evaluates:

  • Potential GPU-hour reduction
  • Inference latency improvements
  • Whether cascaded routing can preserve model accuracy
Book a Free Audit

Products and Services

I offer my services + algorithms to optimize ML inference pipelines.

License to Self-Serve API: you only pay based on the value it provides you. (50% of first year of cost-savings)

Self-Serve API + Support: you only pay based on the value it provides you (50% of first year of cost-savings)

Enterprise: I will quantize, prune models and apply my algorithms to build cascaded systems. (50% of first year of cost-savings)

The Product API docs are here.

moco in Action


  import moco

  rules = moco.analyze(dataset: np.ndarray, predictions: np.ndarray)

  selected_rules = [rule for rule in rules if rule.class in (0, 1, 2)]

  optimized_model = moco.build_cascaded_system(original_model, selected_rules)

How the algorithms are different from other ML system and model optimization techniques

Quantization/Pruning/Just using a lower fidelity model

Can be applied in combination with quantization/pruned models. Quantization & pruning alike risk accuracy degradation. Accepting lower accuracy in favor of cost/energy constraints is a tough place to be in.

Knowledge Distillation

Requires re-training, and thus significant development cost.

Cascaded Systems

Cascaded systems are great! Why evaluate a complex ML model if a simple rule will suffice? moco scales this idea and automatically generates these rules, and builds a routed system automatically.

Horizontal Scaling

Adding more CPUs/GPUs to meet compute needs is expensive and disruptive. Delaying that until next quarter is a win.

Vertical Scaling

Migrating to faster CPU/GPUs to meet compute needs is expensive and disruptive. Delaying that until next quarter is a win.

Benchmarked Results

Task Dataset Base Model Fine Tuned Model Compute Reduction Accuracy Change
Image Classification CIFAR-10 ResNet-18 HF Model finetuned on CIFAR-10 34.6% -0.3%
Text Classification IMDB Reviews BERT Fine-Tuned TinyBERT 21.5% +0.1%

Why not give it a try?

Evaluation Steps

Book an evaluation