Team API Documentation Notebook Examples

Reduce Machine Learning system inference cost by 20–50% without retraining models

Determine which inputs need complex, deep analysis, and reserve compute-time for those inputs.

moco automatically converts existing classifiers into routed cascaded systems that avoid unnecessary computation while preserving model behavior.

Built for production-scale inference systems

Verticals

Content Moderation

Transformer moderation systems processing billions of events per day.

  • Typical models: BERT, RoBERTa, multimodal transformers
  • Typical latency: 5–20ms
  • Typical scale: 100k–750k+ requests/sec
  • Typical annual inference cost: Millions/year
  • 20% lower average FLOPs/inference can save hundreds of thousands to millions per year in GPU infrastructure and compute-time costs.
  • Lower inference load also reduces peak GPU utilization, enabling smaller serving fleets and higher throughput on existing hardware.
Search + Tagging

Large-scale indexing and retrieval pipelines operating on billions of assets.

  • Typical models: BERT, CLIP, embedding models
  • Typical latency: 5–50ms
  • Typical scale: Billions of classifications/day
  • Typical annual inference cost: Millions/year
  • 20% lower average FLOPs/inference can save hundreds of thousands annually in indexing and embedding compute costs.
  • Reduced compute per classification enables faster index refreshes, higher indexing throughput, and more models within the same infrastructure budget.

Stop Running Full Inference on Easy Predictions

Most production ML systems send every input through the full model — even when many predictions are trivial to classify.

moco analyzes model behavior and automatically routes easy predictions through cheaper decision paths while reserving full inference for difficult cases.

Smaller GPU Fleet for real-time applications.

Reduce unnecessary inference compute, reducing peak load, directly reducing GPUs needed at peak.

Smaller GPU Fleet Size for batched applications.

Reduce unnecessary inference compute, reducing the size of the batch, directly reducing the compute time that's needed.

Lower Power Usage/Thermal Throttling for Data Centers and Edge Applications

Reduce energy consumed per prediction, reducing energy costs, saving hardware deterioration and slowdowns

Higher Throughput

Process more events/transactions/frames per second on existing hardware to keep up with a dynamic environment.

Shift Inference to the Edge

Avoid high-latency network calls and infrastructure costs for obviously safe transactions.

Make your compute-time go further -- fit more models in the same budget

Enable smarter systems that classify more objects/issues

Free Compute-Waste Audit

I will analyze historical inference logs, model outputs, or representative datasets to identify unnecessary model computation and estimate potential savings.

The audit evaluates:

  • Potential GPU-hour reduction
  • Inference latency improvements
  • Whether cascaded routing can preserve model accuracy
Book a Free Audit

Products and Services

I offer my services + algorithms to optimize ML inference pipelines.

License to Self-Serve API

Self-Serve API + Support

Enterprise: I will quantize, prune models and apply my algorithms to build cascaded systems.

The Product API docs are here.

Pricing is dependent on value gained (cost reduction or throughput gained)

moco in Action


  import moco

  rules = moco.analyze(dataset: np.ndarray, predictions: np.ndarray)

  selected_rules = [rule for rule in rules if rule.class in (0, 1, 2)]

  optimized_model = moco.build_cascaded_system(original_model, selected_rules)

How the algorithms are different from other ML system and model optimization techniques

Quantization/Pruning/Just using a lower fidelity model

Can be applied in combination with quantization/pruned models. Quantization & pruning alike risk accuracy degradation. Accepting lower accuracy in favor of cost/energy constraints is a tough place to be in.

Knowledge Distillation

Requires re-training, and thus significant development cost.

Cascaded Systems

Cascaded systems are great! Why evaluate a complex ML model if a simple rule will suffice? moco scales this idea and automatically generates these rules, and builds a routed system automatically.

Horizontal Scaling

Adding more CPUs/GPUs to meet compute needs is expensive and disruptive. Delaying that until next quarter is a win.

Vertical Scaling

Migrating to faster CPU/GPUs to meet compute needs is expensive and disruptive. Delaying that until next quarter is a win.

Benchmarked Results

Task Dataset Base Model Fine Tuned Model Compute Reduction Accuracy Change
Image Classification CIFAR-10 ResNet-18 HF Model finetuned on CIFAR-10 34.6% -0.3%
Text Classification IMDB Reviews BERT Fine-Tuned TinyBERT 21.5% +0.1%

Why not give it a try?

Evaluation Steps

Book an evaluation