Computationally Efficient ML: Inference Optimization

30% fewer FLOPs in inference — reduce compute costs with no accuracy loss with a plug and play product.
Optimization library

Mathematical optimization to improve computational efficiency of classifiers

~30% fewer FLOPS ~30% fewer GPUs needed ~30% higher QPS ~30% reduction in average latency

Problem

Serving High-Throughput Machine Learning models require a large number of GPUs, costing organizations upwards of $100,000 per year.

Solution

moco is a mathematical optimization library that contains algorithms that analyze input data, model embeddings and weight matrices to optimize machine learning models, resulting in smaller computational loads, reducing floating point operations (FLOPS). This has a range of benefits including enabling higher throughput, fewer GPUs required, avoiding SLAs by meeting strict latency requirements more frequently, shifting workloads from the Cloud to the device, reducing energy footprint, extending battery life, increasing GPU headroom, etc.

Consulting

I work directly with teams to analyze their ML inference pipelines and optimize models to reduce compute usage and latency.

Pricing is performance-based: you pay 50% of the verified infrastructure savings.

Example: If optimization reduces compute costs by $100k per year, the consulting fee is $50k.

Product

A software product based on the same optimization framework is currently in development. If you are interested in early access or piloting the system, please reach out.

Use Cases by Industry

Finance
Cybersecurity
  • Increase throughput for high-volume attacks (e.g., DDoS).
  • Reduce latency for fast-propagating attacks (e.g., malware).
Energy
  • Predictive maintenance for smart grid sensors.
  • Battery charge/discharge forecasting.
  • Fault detection.
Security
  • Real-time threat detection in images and live video streams.
Defense
  • Low-latency, edge-based threat detection for constrained systems (e.g., naval/subsurface platforms).

Benchmarked Results

Dataset Latency Improvement (Seq) Latency Improvement (Raced) Latency Improvement FLOPs Reduction Accuracy Change
MNIST 8x8 Optical Character Recognition 50.8% 14.3% 50.8% 84.3% -0.0017
Iris Flower Classification 26.6% 22.5% 26.6% 66.9% 0
Credit Card Fraud Detection 11.7% 58.5% 58.5% 31.9% -3.5e-06

Notebooks and Reports

Natural Language Processing
Tabular Models
Computer Vision
Audio
  • Client Deployment — Audio Classification Optimization

    Reduced GPU hours by 13% for an audio event detection pipeline (fine-tuned ResNet50, PyTorch), with no measurable accuracy degradation.

    Ideal for: call-center analytics, voice moderation, edge audio monitoring.