Team API Documentation Notebook Examples

Reduce ML inference cost by 20–50% without retraining models

moco automatically converts existing classifiers into routed cascaded systems that avoid unnecessary computation while preserving model behavior.

Stop Running Full Inference on Easy Predictions

Most production ML systems send every input through the full model — even when many predictions are trivial to classify.

moco analyzes model behavior and automatically routes easy predictions through cheaper decision paths while reserving full inference for difficult cases.

Smaller GPU Fleet

Reduce unnecessary inference compute.

Lower Power Usage

Reduce energy consumed per prediction.

Higher Throughput

Process more requests on existing hardware.

Free Compute-Waste Audit

Not sure whether your ML system is a fit?

I will analyze historical inference logs, model outputs, or representative datasets to identify unnecessary model computation and estimate potential savings.

The audit evaluates:

  • Potential GPU-hour reduction
  • Inference latency improvements
  • Inputs that may not require full model execution
  • Whether cascaded routing can preserve model accuracy

Ideal for fraud detection, NLP classification, cybersecurity, edge AI, and large batch inference workloads.

Book a Free Audit

Benefits

Edge Systems: Extend battery life, reduce thermal usage of GPUs.

Large Batched Workloads: Reduce compute costs.

Real-Time Systems: Increase headroom for the more complex analysis of the inputs that need it.

Services

I offer my services + algorithms to optimize ML inference pipelines. I will quantize, prune models and apply my algorithms to build cascaded systems.

Product (Alpha)

API Endpoints

moco exposes a variety of API endpoints that optimize ML models.

These endpoints return routed systems that output the same predictions as the original model.

Teams can "plug-and-play" these routed systems into their pipelines.

A preview of the API docs is here.

The product is available for alpha testing.

Python Example of the Python pip installable package. Inquire if you would prefer this.


  import moco

  rules = moco.analyze(dataset: np.ndarray, predictions: np.ndarray)

  selected_rules = [rule for rule in rules if rule.class in (0, 1, 2)]

  optimized_model = moco.build_cascaded_system(original_model, selected_rules)


How It Works

moco is a post-training optimization tool that improves classification efficiency. Given a calibration dataset and a chosen representation of the data, moco identifies low-complexity rules that simplify the decision problem itself by decomposing the decision into easier-to-solve subproblems.

Input

  • Pre-trained model
  • Calibration dataset

Output

  • Optimized cascaded system

How the algorithms are different from other ML system and model optimization techniques

Quantization/Pruning

Can be applied in combination with quantization/pruned models. Quantization & pruning alike risk accuracy degradation.

Knowledge Distillation

Knowledge Distillation occurs at training time, and requires significant development cost to optimize the model for deployment, meeting both latency & accuracy targets.

Cascaded Systems

Cascaded systems are great! Why evaluate a complex ML model if a simple rule will suffice? moco scales this idea and automatically generates these rules, and builds a routed system automatically.

Benchmarked Results

Task Dataset Base Model Fine Tuned Model Compute Reduction Accuracy Change
Image Classification CIFAR-10 ResNet-18 HF Model finetuned on CIFAR-10 34.6% -0.3%
Text Classification IMDB Reviews BERT Fine-Tuned TinyBERT 21.5% +0.1%

Why not give it a try?

Evaluation Steps

Book an evaluation