Determine which inputs need complex, deep analysis, and reserve compute-time for those inputs.
moco automatically converts existing classifiers into routed cascaded systems that avoid unnecessary computation while preserving model behavior.
Transformer moderation systems processing billions of events per day.
Large-scale indexing and retrieval pipelines operating on billions of assets.
Most production ML systems send every input through the full model — even when many predictions are trivial to classify.
moco analyzes model behavior and automatically routes easy predictions through cheaper decision paths while reserving full inference for difficult cases.
Reduce unnecessary inference compute, reducing peak load, directly reducing GPUs needed at peak.
Reduce unnecessary inference compute, reducing the size of the batch, directly reducing the compute time that's needed.
Reduce energy consumed per prediction, reducing energy costs, saving hardware deterioration and slowdowns
Process more events/transactions/frames per second on existing hardware to keep up with a dynamic environment.
Avoid high-latency network calls and infrastructure costs for obviously safe transactions.
Enable smarter systems that classify more objects/issues
I will analyze historical inference logs, model outputs, or representative datasets to identify unnecessary model computation and estimate potential savings.
The audit evaluates:
License to Self-Serve API
Self-Serve API + Support
Enterprise: I will quantize, prune models and apply my algorithms to build cascaded systems.
The Product API docs are here.
Pricing is dependent on value gained (cost reduction or throughput gained)
import moco
rules = moco.analyze(dataset: np.ndarray, predictions: np.ndarray)
selected_rules = [rule for rule in rules if rule.class in (0, 1, 2)]
optimized_model = moco.build_cascaded_system(original_model, selected_rules)
Can be applied in combination with quantization/pruned models. Quantization & pruning alike risk accuracy degradation. Accepting lower accuracy in favor of cost/energy constraints is a tough place to be in.
Requires re-training, and thus significant development cost.
Cascaded systems are great! Why evaluate a complex ML model if a simple rule will suffice? moco scales this idea and automatically generates these rules, and builds a routed system automatically.
Adding more CPUs/GPUs to meet compute needs is expensive and disruptive. Delaying that until next quarter is a win.
Migrating to faster CPU/GPUs to meet compute needs is expensive and disruptive. Delaying that until next quarter is a win.
| Task | Dataset | Base Model | Fine Tuned Model | Compute Reduction | Accuracy Change |
|---|---|---|---|---|---|
| Image Classification | CIFAR-10 | ResNet-18 | HF Model finetuned on CIFAR-10 | 34.6% | -0.3% |
| Text Classification | IMDB Reviews | BERT | Fine-Tuned TinyBERT | 21.5% | +0.1% |
Evaluation Steps