Eliminate unnecessary inference to free up capacity in your existing infrastructure
Run a 30-day evaluation!
Organizations must provision large GPU fleets to run at-scale classification in order to meet strict latency requirements.
Given a trained ML model, every request runs through the full model.
This wastes significant compute.
Many inputs can be classified correctly long before the final layer of a model.
moco adds early-exit paths to existing classifiers so easy inputs skip the remaining layers or avoid model execution all together.
Example GPU savings
As a post-training optimization step, moco analyzes a calibration dataset (e.g training), considers a specific representation (either the dataset itself or a representation pulled from an intermediate layer) and determines regions of the dataset where only one class is present. If this rule activates, it means the rule is confident in its decision, and can avoid executing the rest of the model.
Can be applied in combination with quantization/pruned models. Quantization & pruning alike risk accuracy degradation.
Knowledge Distillation occurs at training time, and requires significant development cost to optimize the model for deployment, meeting both latency & accuracy targets.
Cascaded systems are great! Why evaluate a complex ML model if a simple rule will suffice? moco scales this idea and automatically generates these rules, and builds a routed system automatically.
| Task | Dataset | Base Model | Fine Tuned Model | Compute Reduction | Accuracy Change |
|---|---|---|---|---|---|
| Image Classification | CIFAR-10 | ResNet-18 | HF Model finetuned on CIFAR-10 | 34.6% | -0.3% |
| Text Classification | IMDB Reviews | BERT | Fine-Tuned TinyBERT | 21.5% | +0.1% |
Transaction risk monitoring systems make real-time decisions (100ms) to avoid business losses because of fraudulent activity, but suffer from immense operational costs due to false positives.
High-volume threat detection systems (phishing, malware, intrusion detection) must scan massive streams of events. There is a cost tradeoff between analyzing data at scale and accuracy.
Perception systems must process a large volume of continously generated sensor data under strict real-time constraints. Critical decisions rely on perception systems which constantly scan their environment for pedestrians, stoplights and obstacles.
Classifiers are commonplace in chatbots to run customer sentiment analysis, intent classification for proper routing, topic tagging, and guardrails that protect the system from out-of-bounds user queries.