areto

The Run-Time Compiler for LLMs.

gzip for Intelligence. 30% fewer tokens. 48% lower cost.

LLM cost scales quadratically with prompt length. Transformers compute attention between every pair of tokens — doubling length quadruples compute. Pareto strips the filler before it reaches the model.

INPUT

▌

OUTPUT

review authentication middleware in Express server JWT verification logic in auth.ts check whether token expiration rejects expired sessions before they reach protected route handlers.

COMPRESSION: ---COST: ---LATENCY: ---

Get in Touch

How It Works

Every request passes through five stages before the response reaches your application.

CLASSIFY

Categorize

Prompt classified as RAG, agent, conversational, or system. Each category gets tailored compression thresholds.

4 categories

→

COMPRESS

Prune

Saliency kernel scores every token and strips natural-language filler. Code, math, entities, and operators are locked.

30.4% avg compression

→

ROUTE

Select Model

Request scored for complexity and routed to the cheapest model that can handle it. Same quality, fraction of the cost.

48.3% cost saved

→

GUARD

Validate Quality

Compressed prompt checked against quality thresholds. If degradation exceeds limits, compression is rolled back.

100% safety rate

→

FORWARD

Send & Stream

Dense prompt sent to chosen LLM. Response streamed back to client with no additional processing.

~6.5ms overhead

Dynamic Pruning

Not all tokens are equal. The saliency kernel assigns a weight to every token and applies lock rules before pruning. Protected categories are never touched — only natural-language filler is removed.

Could you please write a function called calculateTax(income) that returns income * 0.35 where income > $50,000 for John Smith

NER_LOCK

Named entities preserved

CODE_LOCK

Code blocks immune

OP_LOCK

Operators & math locked

STRUCT_LOCK

JSON, SQL, markup kept

PROTECTED: immune to pruningFILLER: removedSEMANTIC_SIMILARITY: 0.9993

Routing Tiers

The router scores request complexity and picks the cheapest model that can handle it.

TIER_1 — SIMPLE

Lookups, formatting, short Q&A. Routed to a lightweight model.

"What's the status of order #4821?"→ lightweight

COST: lowest

TIER_2 — MODERATE

Summarization, analysis, RAG retrieval. Routed to a mid-range model.

"Summarize Q3 earnings and flag risks"→ mid-range

COST: balanced

TIER_3 — COMPLEX

Reasoning, code generation, multi-step agents. Stays on premium.

"Write a recursive parser for nested JSON"→ premium

COST: premium

PREMIUM_ONLY: $0.216 avgADAPTIVE_ROUTED: $0.112 avgQUALITY_DELTA: +0.001COST_SAVED: 48.3%

The Scoring

Under the hood, a distilled BERT-based encoder with 110M parameters scores the semantic weight of every token. It runs on CPU only — roughly 10,000x smaller than the target LLM — so it adds negligible overhead. Tokens above the saliency threshold are kept; everything below is pruned.

ARCHITECTUREBERT encoder

PARAMETERS110M

COMPUTECPU-only

VS_TARGET_LLM~10,000x smaller

The Output

Filler tokens dilute attention across the context window. By removing noise before it reaches the model, Pareto reduces attention loss — and in 42.9% of cases, the compressed prompt actually produces a better answer than the original.

IMPROVED42.9%

EQUIVALENT57.1%

DEGRADED >1%0%

SALIENCY_SANDBOX

This runs the real usr_prune compiler on the server (DistilBERT lexing, GlobEnc / DecompX saliency, span guards, adaptive cut-rate, embedding quality gate). Visualization shows tokenizer wordpieces on the effective text seen by saliency (after system compilation / paragraph gating when applicable).

INPUT_PROMPT

Could you please write me a function called calculateTax(income) that returns income * 0.35 where income > $50,000 for the user John Smith

> Choose a preset or custom prompt, then run the compiler. Requires Python + usr_prune on the host (see API route env vars).

Use Cases

Customer Support

Extend conversation history without context limits

Compress prior turns in multi-turn chat so your model sees the full conversation without hitting context limits. Maintain coherence across long support threads while cutting input costs.

30%avg token reduction per thread

Simple Integration

1-Line Change

client = OpenAI(
  base_url="https://proxy.pareto.ai/v1",
  api_key="sk-..."
)

Point your existing SDK at our proxy. No other code changes needed.

Full API Round-Trip

POST proxy.pareto.ai/v1/chat/completions

{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "I would really
      appreciate it if you could
      review the auth middleware..."
    }
  ]
}

response

{
  "choices": [ ... ],
  "usage": {
    "prompt_tokens": 14,
    "original_tokens": 22,
    "tokens_saved": 8,
    "compression_ratio": 0.36
  }
}

BENCHMARK_SANDBOX

Validated Results

Pareto evaluates real model outputs — not just token counts. Every compression decision is gated by a quality check. These numbers come from benchmark runs on real prompts and a 100-page SEC filing.

AVG_COMPRESSION30.4%across benchmark suite

QUALITY_IMPROVED42.9%of prompts scored higher

QUALITY_SAFE_RATE100%zero degraded >1%

ROUTING_COST_SAVED48.3%vs premium-only

Before / After Examples

Same model, same temperature. Baseline arm vs. Pareto-optimized arm.

CATEGORYRedundancy-heavy classification

PROMPT_TYPEuser

ORIG_TOKENS158

OPT_TOKENS107

COMPRESSION32.3%

ORIG_QUALITY0.000

OPT_QUALITY1.000

DELTA+1.000

Verbose filler around a simple classification task was stripped. The compressed prompt removed noise that was diluting model attention — the optimized version produced a correct, concise answer where the original did not.

BASELINE OUTPUT

Category: Technical I've categorized this issue as Technical because it involves an error code (503)… [+ lengthy commentary]

OPTIMIZED OUTPUT

Technical

Routing Benchmark

Same prompt text across all arms. Quality and cost measured per arm. Adaptive routing selects the cheapest model that meets quality thresholds.

ARM	AVG QUALITY	AVG COST	ROUTING
cheap_only	0.464	$0.011	—
mid_only	0.605	$0.083	—
premium_only	0.745	$0.216	—
adaptive_routed	0.746	$0.112	14% cheap · 57% mid · 29% premium

QUALITY_VS_PREMIUM: +0.001COST_VS_PREMIUM: −48.3%QUALITY_VS_CHEAP: +0.282TIER_ALIGNMENT: 85.7%

REAL_DOCUMENT_BENCHMARK

Tested on a 100-Page SEC Filing

Pareto was benchmarked against the PepsiCo 2024 Annual Report (SEC EDGAR) — a real, publicly verifiable 100+ page financial document with dense tables, legal language, and cross-referenced disclosures.

Dense financial facts (specific numbers, table values) were correctly preserved — the quality gate blocked any cut that changed meaning. Redundant system instructions were compressed aggressively.

QUALITY_SAFE_RATE100%

IMPROVED_QUALITY33.3%

DEGRADED >1%0%

SYSTEM_PROMPT EXAMPLE

ORIG_TOKENS2,783

OPT_TOKENS2,158

SAVED625 / call

Request Full Benchmark View Source Document ↗

Let's start compressing.

Book with Us