Team API Documentation Notebook Examples

ML Inference is expensive and power-hungry

Reduce cost/joules per inference with no accuracy loss, no re-training and no new hardware. Just math.

The Solution

moco automatically converts machine learning classifiers into validated cascaded systems by analyzing the dataset and model activations.

moco analyzes the dataset to find inputs (logs, transactions, events) that are easy-to-classify and can be classified with simpler models than the original model.

moco reduces the FLOPS per inference, saving your system expensive compute time, and reduces the number of GPUs you need to keep turned on.

20-50% more computationally-efficient models in high-throughput settings reduces GPU costs by 20-50% or extends battery/hardware life.

Services

I offer my services + algorithms to optimize ML inference pipelines. I will quantize, prune models and apply my algorithms to build cascaded systems.

Product (Alpha)

API Endpoints

moco exposes a variety of API endpoints that optimize ML models.

These endpoints return routed systems that output the same predictions as the original model.

Teams can "plug-and-play" these routed systems into their pipelines.

A preview of the API docs is here.

The product is available for alpha testing.

How It Works

moco is a post-training optimization tool that improves classification efficiency. Given a calibration dataset and a chosen representation of the data, moco identifies low-complexity rules that simplify the decision problem itself by decomposing the decision into easier-to-solve subproblems.

Input

  • Pre-trained model
  • Calibration dataset

Output

  • Optimized cascaded system

How the algorithms are different from other ML system and model optimization techniques

Quantization/Pruning

Can be applied in combination with quantization/pruned models. Quantization & pruning alike risk accuracy degradation.

Knowledge Distillation

Knowledge Distillation occurs at training time, and requires significant development cost to optimize the model for deployment, meeting both latency & accuracy targets.

Cascaded Systems

Cascaded systems are great! Why evaluate a complex ML model if a simple rule will suffice? moco scales this idea and automatically generates these rules, and builds a routed system automatically.

Benchmarked Results

Task Dataset Base Model Fine Tuned Model Compute Reduction Accuracy Change
Image Classification CIFAR-10 ResNet-18 HF Model finetuned on CIFAR-10 34.6% -0.3%
Text Classification IMDB Reviews BERT Fine-Tuned TinyBERT 21.5% +0.1%

Use Cases by Industry

Fraud Detection

Transaction risk monitoring systems make real-time decisions (100ms) to avoid business losses because of fraudulent activity, but suffer from immense operational costs due to false positives.

  • Problem: Real-Time systems' accuracy are constrained by average latency budget.
  • With moco: Create rule to identify transactions that can avoid model execution.
  • Impact: Reduced compute required for easy transactions -> enabling intelligence + better features where they're needed.
  • Outcome: Enables additional model capacity and feature complexity within existing latency constraints.
Credit Card Fraud Detection — Inference Acceleration

Delivered a 55% speed-up and major FLOP reduction on a production-style fraud classifier.

Cybersecurity

High-volume threat detection systems (phishing, malware, intrusion detection) must scan massive streams of events. There is a cost tradeoff between analyzing data at scale and accuracy.

  • Problem: Every event runs full model inference, even low-risk traffic
  • With moco: Early-exit on high-confidence benign or malicious patterns
  • Impact: Increased coverage under fixed compute budget
  • Outcome: More threats evaluated per second -> reduced risk for organizations

Autonomous Vehicles

Perception systems must process a large volume of continously generated sensor data under strict real-time constraints. Critical decisions rely on perception systems which constantly scan their environment for pedestrians, stoplights and obstacles.

  • Problem: Full model inference on every frame increases latency
  • With moco: Early exit on easy frames (clear roads, static scenes)
  • Impact: Reduced compute per frame
  • Outcome: Lower energy consumption → longer battery life

NLP/Chatbot/Agentic guardrails

Classifiers are commonplace in chatbots to run customer sentiment analysis, intent classification for proper routing, topic tagging, and guardrails that protect the system from out-of-bounds user queries.

  • Problem: Each of these classifiers run fully for every single user query.
  • With moco: Some of these classifiers only need to run partially.
  • Impact: Reduction of cost and improved energy efficiency

Why not give it a try?

Evaluation Steps

  • Model audit: you give me historical logs: and I will return a diagnostic audit on rules I can design with that data.
  • You can then provide the model to me, I will apply those rules to the model.
  • We will validate that there is no performance degradation on a withheld test set, verify computational improvements on production system.
  • Run optimized system side-by-side the real system for 1 retraining cycle.
  • Deploy when the team's confidence is high.
Book an evaluation