Mathematical optimization to reduce inference latency
Problem
Classification models (or models that make decisions, categorize data, recognize intent, etc.) must meet inference latency, throughput, and energy standards and SLAs to be deployed in real-time embedded systems.
Engineers address these technical requirements using iterative methods and must ultimately make informed trade-offs between accuracy, cost, and latency.
Solution
moco is a mathematical optimization library that analyzes a model’s input data and predictions and derives rules. Those rules replace model inference for portions of the input space. This results in avoided computation for some of the data, and produces lower energy and lower latency models.
Use Cases by Industry
Benchmarked Results
| Dataset | Latency Improvement (Seq) | Latency Improvement (Raced) | Latency Improvement | FLOPs Reduction | Accuracy Change |
|---|---|---|---|---|---|
| MNIST 8x8 Optical Character Recognition | 50.8% | 14.3% | 50.8% | 84.3% | -0.0017 |
| Iris Flower Classification | 26.6% | 22.5% | 26.6% | 66.9% | 0 |
| Credit Card Fraud Detection | 11.7% | 58.5% | 58.5% | 31.9% | -3.5e-06 |