Team

Reduce compute per ML decision

Eliminate unnecessary inference to free up capacity in your existing infrastructure

Reallocate that capacity to:

  • Process more decisions enabling higher throughput, better coverage
  • Reduce infrastructure usage enabling lower cost at the same workload
  • Allocate more compute to the complex decisions enabling higher detection accuracy
  • Run more models in parallel enabling more accurate systems

Pilot Program

Run a 30-day evaluation!

Book an evaluation

The Problem

Organizations must provision large GPU fleets to run at-scale classification in order to meet strict latency requirements.

Given a trained ML model, every request runs through the full model.

This wastes significant compute.

The Solution

Many inputs can be classified correctly long before the final layer of a model.

moco adds early-exit paths to existing classifiers so easy inputs skip the remaining layers or avoid model execution all together.

Example GPU savings

Typical Inference Cluster
$2.6M / year 100 GPUs
With moco Optimization
$1.8M / year 70 GPUs
↓ 30% GPU usage • ↓ $800K/year

How It Works

As a post-training optimization step, moco analyzes a calibration dataset (e.g training), considers a specific representation (either the dataset itself or a representation pulled from an intermediate layer) and determines regions of the dataset where only one class is present. If this rule activates, it means the rule is confident in its decision, and can avoid executing the rest of the model.

Input

  • Pre-trained model
  • Calibration dataset

Output

  • Optimized model
  • Lower compute per request

Integration

  • Standard PyTorch model artifact
  • Compatible with existing inference pipelines

How It Is Different

Quantization/Pruning

Can be applied in combination with quantization/pruned models. Quantization & pruning alike risk accuracy degradation.

Knowledge Distillation

Knowledge Distillation occurs at training time, and requires significant development cost to optimize the model for deployment, meeting both latency & accuracy targets.

Cascaded Systems

Cascaded systems are great! Why evaluate a complex ML model if a simple rule will suffice? moco scales this idea and automatically generates these rules, and builds a routed system automatically.

Benchmarked Results

Task Dataset Base Model Fine Tuned Model Compute Reduction Accuracy Change
Image Classification CIFAR-10 ResNet-18 HF Model finetuned on CIFAR-10 34.6% -0.3%
Text Classification IMDB Reviews BERT Fine-Tuned TinyBERT 21.5% +0.1%

Use Cases by Industry

Fraud Detection

Transaction risk monitoring systems make real-time decisions (100ms) to avoid business losses because of fraudulent activity, but suffer from immense operational costs due to false positives.

  • Problem: Real-Time systems' accuracy are constrained by average latency budget.
  • With moco: Create rule to identify transactions that can avoid model execution.
  • Impact: Reduced compute required for easy transactions -> enabling intelligence + better features where they're needed.
  • Outcome: Reduction in false positives
Credit Card Fraud Detection — Inference Acceleration

Delivered a 55% speed-up and major FLOP reduction on a production-style fraud classifier.

Cybersecurity

High-volume threat detection systems (phishing, malware, intrusion detection) must scan massive streams of events. There is a cost tradeoff between analyzing data at scale and accuracy.

  • Problem: Every event runs full model inference, even low-risk traffic
  • With moco: Early-exit on high-confidence benign or malicious patterns
  • Impact: Increased coverage under fixed compute budget
  • Outcome: More threats evaluated per second -> reduced risk for organizations

Autonomous Vehicles

Perception systems must process a large volume of continously generated sensor data under strict real-time constraints. Critical decisions rely on perception systems which constantly scan their environment for pedestrians, stoplights and obstacles.

  • Problem: Full model inference on every frame increases latency
  • With moco: Early exit on easy frames (clear roads, static scenes)
  • Impact: Reduced compute per frame
  • Outcome: Lower energy consumption → longer battery life

NLP/Chatbot/Agentic guardrails

Classifiers are commonplace in chatbots to run customer sentiment analysis, intent classification for proper routing, topic tagging, and guardrails that protect the system from out-of-bounds user queries.

  • Problem: Each of these classifiers run fully for every single user query.
  • With moco: Some of these classifiers only need to run partially.
  • Impact: Reduction of cost and improved energy efficiency