aretoThe Run-Time Compiler for LLMs.
gzip for Intelligence. 30% fewer tokens. 48% lower cost.
LLM cost scales quadratically with prompt length. Transformers compute attention between every pair of tokens — doubling length quadruples compute. Pareto strips the filler before it reaches the model.
▌
review authentication middleware in Express server JWT verification logic in auth.ts check whether token expiration rejects expired sessions before they reach protected route handlers.
How It Works
Every request passes through five stages before the response reaches your application.
Categorize
Prompt classified as RAG, agent, conversational, or system. Each category gets tailored compression thresholds.
4 categoriesPrune
Saliency kernel scores every token and strips natural-language filler. Code, math, entities, and operators are locked.
30.4% avg compressionSelect Model
Request scored for complexity and routed to the cheapest model that can handle it. Same quality, fraction of the cost.
48.3% cost savedValidate Quality
Compressed prompt checked against quality thresholds. If degradation exceeds limits, compression is rolled back.
100% safety rateSend & Stream
Dense prompt sent to chosen LLM. Response streamed back to client with no additional processing.
~6.5ms overheadDynamic Pruning
Not all tokens are equal. The saliency kernel assigns a weight to every token and applies lock rules before pruning. Protected categories are never touched — only natural-language filler is removed.
Named entities preserved
Code blocks immune
Operators & math locked
JSON, SQL, markup kept
Routing Tiers
The router scores request complexity and picks the cheapest model that can handle it.
Lookups, formatting, short Q&A. Routed to a lightweight model.
Summarization, analysis, RAG retrieval. Routed to a mid-range model.
Reasoning, code generation, multi-step agents. Stays on premium.
The Scoring
Under the hood, a distilled BERT-based encoder with 110M parameters scores the semantic weight of every token. It runs on CPU only — roughly 10,000x smaller than the target LLM — so it adds negligible overhead. Tokens above the saliency threshold are kept; everything below is pruned.
The Output
Filler tokens dilute attention across the context window. By removing noise before it reaches the model, Pareto reduces attention loss — and in 42.9% of cases, the compressed prompt actually produces a better answer than the original.
SALIENCY_SANDBOX
Use Cases
Extend conversation history without context limits
Compress prior turns in multi-turn chat so your model sees the full conversation without hitting context limits. Maintain coherence across long support threads while cutting input costs.
Simple Integration
1-Line Change
client = OpenAI(
base_url="https://proxy.pareto.ai/v1",
api_key="sk-..."
)Point your existing SDK at our proxy. No other code changes needed.
Full API Round-Trip
{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "I would really
appreciate it if you could
review the auth middleware..."
}
]
}{
"choices": [ ... ],
"usage": {
"prompt_tokens": 14,
"original_tokens": 22,
"tokens_saved": 8,
"compression_ratio": 0.36
}
}Validated Results
Pareto evaluates real model outputs — not just token counts. Every compression decision is gated by a quality check. These numbers come from benchmark runs on real prompts and a 100-page SEC filing.
Before / After Examples
Same model, same temperature. Baseline arm vs. Pareto-optimized arm.
Verbose filler around a simple classification task was stripped. The compressed prompt removed noise that was diluting model attention — the optimized version produced a correct, concise answer where the original did not.
Routing Benchmark
Same prompt text across all arms. Quality and cost measured per arm. Adaptive routing selects the cheapest model that meets quality thresholds.
| ARM | AVG QUALITY | AVG COST |
|---|---|---|
| cheap_only | 0.464 | $0.011 |
| mid_only | 0.605 | $0.083 |
| premium_only | 0.745 | $0.216 |
| adaptive_routed | 0.746 | $0.112 |
Tested on a 100-Page SEC Filing
Pareto was benchmarked against the PepsiCo 2024 Annual Report (SEC EDGAR) — a real, publicly verifiable 100+ page financial document with dense tables, legal language, and cross-referenced disclosures.
Dense financial facts (specific numbers, table values) were correctly preserved — the quality gate blocked any cut that changed meaning. Redundant system instructions were compressed aggressively.
Let's start compressing.
Book with Us