Pareto

The Run-Time Compiler for LLMs.

gzip for Intelligence. 30% fewer tokens. 48% lower cost.

LLM cost scales quadratically with prompt length. Transformers compute attention between every pair of tokens — doubling length quadruples compute. Pareto strips the filler before it reaches the model.

INPUT

OUTPUT

review authentication middleware in Express server JWT verification logic in auth.ts check whether token expiration rejects expired sessions before they reach protected route handlers.

COMPRESSION: ---COST: ---LATENCY: ---
Get in Touch

How It Works

Every request passes through five stages before the response reaches your application.

CLASSIFY

Categorize

Prompt classified as RAG, agent, conversational, or system. Each category gets tailored compression thresholds.

4 categories
COMPRESS

Prune

Saliency kernel scores every token and strips natural-language filler. Code, math, entities, and operators are locked.

30.4% avg compression
ROUTE

Select Model

Request scored for complexity and routed to the cheapest model that can handle it. Same quality, fraction of the cost.

48.3% cost saved
GUARD

Validate Quality

Compressed prompt checked against quality thresholds. If degradation exceeds limits, compression is rolled back.

100% safety rate
FORWARD

Send & Stream

Dense prompt sent to chosen LLM. Response streamed back to client with no additional processing.

~6.5ms overhead

Dynamic Pruning

Not all tokens are equal. The saliency kernel assigns a weight to every token and applies lock rules before pruning. Protected categories are never touched — only natural-language filler is removed.

Could you please write a function called calculateTax(income) that returns income * 0.35 where income > $50,000 for John Smith
NER_LOCK

Named entities preserved

CODE_LOCK

Code blocks immune

OP_LOCK

Operators & math locked

STRUCT_LOCK

JSON, SQL, markup kept

PROTECTED: immune to pruningFILLER: removedSEMANTIC_SIMILARITY: 0.9993

Routing Tiers

The router scores request complexity and picks the cheapest model that can handle it.

TIER_1 — SIMPLE

Lookups, formatting, short Q&A. Routed to a lightweight model.

"What's the status of order #4821?"→ lightweight
COST: lowest
TIER_2 — MODERATE

Summarization, analysis, RAG retrieval. Routed to a mid-range model.

"Summarize Q3 earnings and flag risks"→ mid-range
COST: balanced
TIER_3 — COMPLEX

Reasoning, code generation, multi-step agents. Stays on premium.

"Write a recursive parser for nested JSON"→ premium
COST: premium
PREMIUM_ONLY: $0.216 avgADAPTIVE_ROUTED: $0.112 avgQUALITY_DELTA: +0.001COST_SAVED: 48.3%

The Scoring

Under the hood, a distilled BERT-based encoder with 110M parameters scores the semantic weight of every token. It runs on CPU only — roughly 10,000x smaller than the target LLM — so it adds negligible overhead. Tokens above the saliency threshold are kept; everything below is pruned.

ARCHITECTUREBERT encoder
PARAMETERS110M
COMPUTECPU-only
VS_TARGET_LLM~10,000x smaller

The Output

Filler tokens dilute attention across the context window. By removing noise before it reaches the model, Pareto reduces attention loss — and in 42.9% of cases, the compressed prompt actually produces a better answer than the original.

IMPROVED42.9%
EQUIVALENT57.1%
DEGRADED >1%0%

SALIENCY_SANDBOX

INPUT_PROMPT
Could you please write me a function called calculateTax(income) that returns income * 0.35 where income > $50,000 for the user John Smith
> Select a preset or paste a prompt, then run the saliency kernel.

Use Cases

Customer Support

Extend conversation history without context limits

Compress prior turns in multi-turn chat so your model sees the full conversation without hitting context limits. Maintain coherence across long support threads while cutting input costs.

30%avg token reduction per thread

Simple Integration

1-Line Change

client = OpenAI(
  base_url="https://proxy.pareto.ai/v1",
  api_key="sk-..."
)

Point your existing SDK at our proxy. No other code changes needed.

Full API Round-Trip

POST proxy.pareto.ai/v1/chat/completions
{
  "model": "gpt-4",
  "messages": [
    {
      "role": "user",
      "content": "I would really
      appreciate it if you could
      review the auth middleware..."
    }
  ]
}
response
{
  "choices": [ ... ],
  "usage": {
    "prompt_tokens": 14,
    "original_tokens": 22,
    "tokens_saved": 8,
    "compression_ratio": 0.36
  }
}
BENCHMARK_SANDBOX

Validated Results

Pareto evaluates real model outputs — not just token counts. Every compression decision is gated by a quality check. These numbers come from benchmark runs on real prompts and a 100-page SEC filing.

AVG_COMPRESSION30.4%across benchmark suite
QUALITY_IMPROVED42.9%of prompts scored higher
QUALITY_SAFE_RATE100%zero degraded >1%
ROUTING_COST_SAVED48.3%vs premium-only

Before / After Examples

Same model, same temperature. Baseline arm vs. Pareto-optimized arm.

CATEGORYRedundancy-heavy classification
PROMPT_TYPEuser
ORIG_TOKENS158
OPT_TOKENS107
COMPRESSION32.3%
ORIG_QUALITY0.000
OPT_QUALITY1.000
DELTA+1.000

Verbose filler around a simple classification task was stripped. The compressed prompt removed noise that was diluting model attention — the optimized version produced a correct, concise answer where the original did not.

BASELINE OUTPUT
Category: Technical I've categorized this issue as Technical because it involves an error code (503)… [+ lengthy commentary]
OPTIMIZED OUTPUT
Technical

Routing Benchmark

Same prompt text across all arms. Quality and cost measured per arm. Adaptive routing selects the cheapest model that meets quality thresholds.

ARMAVG QUALITYAVG COST
cheap_only0.464$0.011
mid_only0.605$0.083
premium_only0.745$0.216
adaptive_routed0.746$0.112
QUALITY_VS_PREMIUM: +0.001COST_VS_PREMIUM: −48.3%QUALITY_VS_CHEAP: +0.282TIER_ALIGNMENT: 85.7%
REAL_DOCUMENT_BENCHMARK

Tested on a 100-Page SEC Filing

Pareto was benchmarked against the PepsiCo 2024 Annual Report (SEC EDGAR) — a real, publicly verifiable 100+ page financial document with dense tables, legal language, and cross-referenced disclosures.

Dense financial facts (specific numbers, table values) were correctly preserved — the quality gate blocked any cut that changed meaning. Redundant system instructions were compressed aggressively.

QUALITY_SAFE_RATE100%
IMPROVED_QUALITY33.3%
DEGRADED >1%0%
SYSTEM_PROMPT EXAMPLE
ORIG_TOKENS2,783
OPT_TOKENS2,158
SAVED625 / call

Let's start compressing.

Book with Us