Team

Reduce compute per ML decision

Eliminate unnecessary inference

Reallocate compute capacity to:

  • Process more decisions enabling higher throughput, better coverage
  • Reduce infrastructure usage enabling lower cost at the same workload
  • Allocate more compute to the complex decisions enabling higher detection accuracy
  • Run more models in parallel enabling more accurate systems

Shift decision-making to the edge

Pilot Program

Run a 30-day evaluation!

Book an evaluation

The Problem

Organizations must provision large GPU fleets to run at-scale classification in order to meet strict latency requirements.

Given a trained ML model, every request runs through the full model.

This wastes significant compute.

The Solution

Many inputs can be classified correctly long before the final layer of a model.

moco adds early-exit paths to existing classifiers so easy inputs skip the remaining layers or avoid model execution all together.

Cost-savings for a 100 GPU cluster

By making inference more computationally efficient in large-scale / high-throughput settings.

Typical Inference Cluster
$2.6M / year 100 GPUs
With moco Optimization
$1.8M / year 70 GPUs
↓ 30% GPU usage • ↓ $800K/year

Product

API Endpoints

moco exposes a variety of API endpoints that ML performance engineers can use to optimize ML models. A preview of the API docs is here.

Services

Given my early stage, I seek to take on clients where I use my algorithms to build cascaded systems.

How It Works

As a post-training optimization step, moco analyzes a calibration dataset (e.g training), considers a specific representation (either the dataset itself or a representation pulled from an intermediate layer) and identify conditions: low-complexity rules that simplify the classification decision: i.e split the classification problem into two easier problems.

Input

  • Pre-trained model
  • Calibration dataset

Output

  • Optimized cascaded system

How the algorithms are different from other ML system and model optimization techniques

Quantization/Pruning

Can be applied in combination with quantization/pruned models. Quantization & pruning alike risk accuracy degradation.

Knowledge Distillation

Knowledge Distillation occurs at training time, and requires significant development cost to optimize the model for deployment, meeting both latency & accuracy targets.

Cascaded Systems

Cascaded systems are great! Why evaluate a complex ML model if a simple rule will suffice? moco scales this idea and automatically generates these rules, and builds a routed system automatically.

Benchmarked Results

Task Dataset Base Model Fine Tuned Model Compute Reduction Accuracy Change
Image Classification CIFAR-10 ResNet-18 HF Model finetuned on CIFAR-10 34.6% -0.3%
Text Classification IMDB Reviews BERT Fine-Tuned TinyBERT 21.5% +0.1%

Use Cases by Industry

Fraud Detection

Transaction risk monitoring systems make real-time decisions (100ms) to avoid business losses because of fraudulent activity, but suffer from immense operational costs due to false positives.

  • Problem: Real-Time systems' accuracy are constrained by average latency budget.
  • With moco: Create rule to identify transactions that can avoid model execution.
  • Impact: Reduced compute required for easy transactions -> enabling intelligence + better features where they're needed.
  • Outcome: Reduction in false positives
Credit Card Fraud Detection — Inference Acceleration

Delivered a 55% speed-up and major FLOP reduction on a production-style fraud classifier.

Cybersecurity

High-volume threat detection systems (phishing, malware, intrusion detection) must scan massive streams of events. There is a cost tradeoff between analyzing data at scale and accuracy.

  • Problem: Every event runs full model inference, even low-risk traffic
  • With moco: Early-exit on high-confidence benign or malicious patterns
  • Impact: Increased coverage under fixed compute budget
  • Outcome: More threats evaluated per second -> reduced risk for organizations

Autonomous Vehicles

Perception systems must process a large volume of continously generated sensor data under strict real-time constraints. Critical decisions rely on perception systems which constantly scan their environment for pedestrians, stoplights and obstacles.

  • Problem: Full model inference on every frame increases latency
  • With moco: Early exit on easy frames (clear roads, static scenes)
  • Impact: Reduced compute per frame
  • Outcome: Lower energy consumption → longer battery life

NLP/Chatbot/Agentic guardrails

Classifiers are commonplace in chatbots to run customer sentiment analysis, intent classification for proper routing, topic tagging, and guardrails that protect the system from out-of-bounds user queries.

  • Problem: Each of these classifiers run fully for every single user query.
  • With moco: Some of these classifiers only need to run partially.
  • Impact: Reduction of cost and improved energy efficiency