Technical strategy

Inference Strategy

How Flowmux transforms raw emails into a structured life ledger, fields natural language queries, and knows when to ask for help.

Draft v3 March 2026 Internal

The big picture

Flowmux's inference architecture is built on a core principle: run locally by default, fall back to frontier models only with explicit user consent and only when local inference cannot produce a satisfactory result. The system should be honest about what it can and cannot do, and the user should always know when their data leaves the device.

The architecture separates probabilistic work (LLM-powered classification, extraction, natural language generation) from deterministic work (SQL queries, aggregation, pattern detection, rule evaluation). Probabilistic components produce structured artifacts. Deterministic components operate on those artifacts reliably and cheaply. This separation keeps the hot path fast and predictable while concentrating the expensive, uncertain work in clearly bounded stages.

Design principle
The LLM is used to define rules (through conversation), extract data (from email parsing), and generate natural language outputs (summaries, briefings). But rule execution, query evaluation, and pattern detection are deterministic — compiled specifications evaluated against structured data. No LLM in the hot path for per-email processing.

The life ledger

At the heart of the system is the life ledger — a structured, queryable database of events extracted from the user's email. Every incoming email is transformed into one or more typed events with a common envelope and a category-specific payload.

Event structure

Every event shares a common envelope: email_id, timestamp, sender, account (which email account it came from), category, subcategory, and confidence (how sure the classifier was). The payload varies by subcategory.

{
  "email_id": "abc123",
  "timestamp": "2026-03-15T14:22:00+05:30",
  "account": "personal-gmail",
  "category": "financial",
  "subcategory": "bank_transaction_alert",
  "confidence": 0.97,
  "payload": {
    "institution": "HDFC Bank",
    "account_last_four": "4421",
    "account_type": "credit_card",
    "transaction_type": "debit",
    "amount": 2840.00,
    "currency": "INR",
    "merchant": "Swiggy",
    "merchant_category": "dining",
    "transaction_date": "2026-03-15",
    "reference_number": "TXN847291"
  }
}

The life ledger is an event stream. Raw emails are preserved in .eml storage (handled by mailmux); the life ledger is the structured, queryable index on top. This is the substrate that everything else operates on — the rule engine, the accumulator, the query system, and the pattern detector.

The two-model pipeline

Transforming raw email into life ledger events requires two distinct inference tasks, handled by two purpose-built small models.

Model 1: The classifier

Given raw email text and metadata, the classifier outputs a category and subcategory label. This is fundamentally a "pick one from N" task — small models handle it well after fine-tuning.

The taxonomy should be two-level: a broad category (financial, travel, commerce, subscription, social, notification) and a subcategory within each (e.g., financial → bank_transaction_alert, credit_card_statement, investment_update). The classifier predicts both. Extraction prompts are keyed on the subcategory, with fallback to the broad category if the subcategory is uncertain.

Training strategy
Use a frontier LLM (Claude, GPT-4) as a teacher model to label thousands of real emails, then fine-tune a small student model to replicate that labeling. Supplement with synthetic emails generated by the frontier model for underrepresented categories. The frontier model is used once during training, on data you control. The deployed classifier runs entirely locally. This is knowledge distillation with a strong privacy boundary.

Model 2: The extractor

Given the classified email plus a category-specific prompt template, the extractor produces structured JSON matching a predefined schema. Each subcategory has a corresponding schema — a credit card statement has issuer, total_due, due_date; a shipping notification has carrier, tracking_number, estimated_delivery.

Single model, not fifty

The extractor is a single model trained across all categories, not a per-subcategory model. Training and maintaining 30–50 separate fine-tuned models would be an operational nightmare — every base model upgrade, pipeline improvement, or new email type would require retraining individual models. Deployment is also a problem: loading 50 different model weights, even small ones, is a resource burden. And siloed models can't share knowledge — the skill of "find the dollar amount in this email" is transferable across bank statements, receipts, and subscription renewals, but separate models can't leverage that.

A single model works because the prompt template does the heavy lifting of specialization. The model learns a general skill — "given an email and a JSON schema, extract the fields" — and the prompt provides the specific schema for each subcategory. During fine-tuning, the model sees examples from all categories, learning the full diversity of email formats. At inference time, the prompt says "extract these specific fields in this specific format" and the model fills them in. The schema in the prompt is essentially a very strong constraint that focuses the model's attention.

If measurable quality differences emerge across domains that can't be solved by improving prompt templates, consider a small number of domain-adapted variants — perhaps one variant strong on financial emails and another strong on logistics — rather than per-subcategory models. Two or three variants, not fifty. The classifier's broad category determines which variant handles the extraction. But start with a single model and only split based on evidence.

A prompt library (schema registry) maps each subcategory to its extraction template and expected JSON schema. The prompt includes the schema definition, example output, and any category-specific extraction hints. The extractor model sees the prompt plus the raw email and produces the structured payload. The same distillation strategy applies: generate training pairs using a frontier model, fine-tune the small model to replicate.

The hot path

1
Email arrives via IMAP local
mailmux creates event in Postgres, stores raw .eml
2
Classifier model runs local
Outputs category, subcategory, confidence score
3
Extraction prompt selected local
Prompt library maps subcategory → schema + template
4
Extractor model runs local
Produces structured JSON payload matching the schema
5
Life ledger event written local
Structured event stored in Postgres, embedding generated for semantic search
6
Rule engine evaluates local
Deterministic — compiled rules checked against structured data, triggered actions execute

The entire hot path is local. No API calls, no data leaving the machine. The frontier model is only involved during the offline training/fine-tuning phase.

Fallback on extraction failure

If the local extractor produces low-confidence output or fails schema validation, and the user has granted permission, the system can fall back to a frontier LLM for that specific email. The fallback sends the raw email text (with user consent) and receives the structured extraction. These fallback cases also become training data for improving the local model over time.

When classification fails

The classifier will always output a confidence score. The system's behavior depends on where that score falls across three zones.

High confidence (above ~0.85)

Trust the classification, select the corresponding extraction template, and run the extractor. This is the happy path — the vast majority of emails from known senders with recognizable formats will land here.

Medium confidence (~0.5 to ~0.85)

The classifier has a guess but isn't sure — perhaps a financial email from an unfamiliar institution, or an unusual format from a known sender. Two strategies apply here.

First, try and verify: run extraction with the top candidate classification and evaluate the output (see extraction quality assessment below). If extraction succeeds with good quality, the classification was probably right. If extraction fails or produces thin results, try the second-best classification candidate.

Second, generic extraction: run a catch-all prompt that says "extract whatever structured data you can find in this email" without assuming a specific schema. The output is less structured, but it captures the basics — sender, subject, any dates mentioned, any monetary amounts, any entities. Better than nothing, and it populates the life ledger with at least a searchable record.

Low confidence (below ~0.5)

The classifier genuinely doesn't know what it's looking at — a novel email type, or something that doesn't fit any category (a personal conversation, a joke forwarded by a friend). The system still runs generic extraction to capture basic metadata, but flags the email as "unclassified."

If the user has permitted frontier fallback, unclassified emails are good candidates for batch processing — periodically collect them and ask a frontier model to classify and extract. The results improve the user's life ledger retroactively and become training data for the local classifier. Over time, the unclassified bucket should shrink.

Misclassification (high confidence, wrong answer)

The hardest case: a promotional email from a bank classified as a financial alert, or a marketing newsletter classified as an important notification. The confidence is high but the classification is wrong. This is caught primarily by the extraction quality assessment — if the extractor can't find the expected fields for a "financial alert" in what is actually a marketing email, the mismatch surfaces the error. Cross-field consistency checks between the classification and extraction results are the primary defense here.

Every email produces something
The life ledger should never have gaps. Even unclassified emails get basic metadata (sender, date, subject, "unclassified"). The ledger has varying levels of richness, with the system being transparent about its confidence in each record.

Extraction quality assessment

The system needs to know when extraction output is good, marginal, or broken. This requires both structural and semantic validation — multiple layers of checks, each catching different failure modes.

Layer 1: Structural validation (deterministic)

Every subcategory has a defined JSON schema with required fields, field types, and value constraints. After extraction, validate the output against the schema. Did the model produce valid JSON? Are all required fields present? Are the types correct (amount is a number, date is a valid date, currency is a recognized code)? Are the values within plausible ranges (a credit card bill of ₹0.003 or ₹9,999,999,999 is probably wrong)?

Define completeness tiers per schema. A credit card statement extraction might have required fields (total_due, due_date, card_last_four) and optional fields (minimum_due, statement_period, rewards_points). All required fields present and valid = full extraction. Required fields present but some optional fields missing = partial extraction (still useful). Required fields missing = failed extraction.

Layer 2: Cross-field consistency (deterministic)

Catches subtler problems than schema validation alone. If transaction_type is "debit" but the amount is negative, something's off. If due_date is before statement_date, that's suspicious. If the merchant is "HDFC Bank" but the subcategory is "shipping_notification," the classifier or extractor went wrong somewhere. These are deterministic rules written per-schema — cheap to run, high signal.

Layer 3: Value verification against source (deterministic)

Catches hallucinated data — the model might confidently produce "amount": 1247.50 when the email actually says ₹12,475.00 (a decimal point error). Cross-reference extracted values against simple regex patterns in the raw email text. If the email contains the string "12,475" but the extracted amount is 1247.50, flag it. This doesn't require an LLM — it's pattern matching against the source text to verify that extracted values actually appear in the email.

Layer 4: Extraction confidence scoring (probabilistic)

During inference, derive confidence from the model's token probabilities (log-probs) during generation. Low token probability on a value suggests the model is uncertain about that specific extraction. This gives per-field confidence — you might trust the merchant name (high confidence) but not the amount (low confidence) from the same extraction. This requires model-level instrumentation but is powerful for identifying exactly which fields to trust.

Layer 5: Semantic coherence (probabilistic)

Does the extracted data make sense given the email's text? Compute the semantic similarity between the extracted payload (serialized to text) and the original email using the embedding model. If the extraction talks about a "flight to Mumbai" but the email is about a grocery delivery, the embeddings will be distant. This is a lightweight check since embeddings are already being computed for the life ledger's semantic search index.

Quality grades

The assessment layers combine to produce an extraction grade:

A
Full extraction record
All required fields present, types valid, cross-field consistency passes, values verified against source text. Written to the life ledger with high confidence.
B
Partial extraction record
Some fields extracted well, others missing or uncertain. Written to the life ledger with field-level confidence scores. The system can still operate on partial data — if it has the amount and merchant but not the category, the accumulator can still track spending.
C
Failed extraction fallback
Required fields missing or values don't verify. Falls back to generic extraction (capture whatever is possible), flags for frontier model batch processing if permitted, and logs as training data for improving the local model.
D
Conflicted extraction review
The classifier and extractor disagree — e.g., classified as financial but extraction found travel fields. System re-extracts with alternative classifications. If the conflict persists, flags for review or frontier batch processing.

Fielding natural language queries

Once the life ledger exists, users will ask questions about it. These queries vary enormously in complexity, and the inference strategy must handle each tier appropriately.

Tier 1: Structured lookups

Questions like "How much did I spend on Swiggy last month?" or "When does my Figma subscription renew?" are SQL queries against the life ledger. The LLM's job is to translate the natural language question into the right query — text-to-SQL.

This doesn't require generating arbitrary SQL. Define a constrained query language — 15–20 query templates covering common patterns: sum of amount where merchant = X and date between Y and Z; most recent event where subcategory = X and field = Y; count of events grouped by category in the last N days. The local model maps natural language to one of these templates with the right parameters filled in. This is a classification + slot-filling task — exactly what small fine-tuned models excel at.

Key insight
The local model isn't reasoning about the data. It's routing — mapping a question to a query template. The actual computation happens in SQL, deterministically. The model only touches the question, never the data itself.

Tier 2: Narration of results

Once a structured query returns results, those results need to be narrated in natural language. The input to the narration model is a structured result set plus the original question. The output is a conversational response in the secretary's voice.

This is a highly tractable fine-tuning task. The model isn't reasoning — it's narrating. You compute aggregations in code (totals, averages, max values, trends) and hand the model a pre-digested result like:

{
  "query": "Swiggy spending last 30 days",
  "total": 4380,
  "count": 14,
  "average": 312.86,
  "max": 680,
  "max_date": "2026-03-08",
  "merchant": "Swiggy",
  "period": "last 30 days"
}

The model turns this into: "From your HDFC card, I can see 14 Swiggy transactions in the last 30 days totalling ₹4,380. Your highest single order was ₹680 on March 8th. The average is about ₹310 per order."

Training data is generated by distillation: feed structured query results to a frontier model with a persona prompt ("respond as a concise, warm personal secretary"), collect thousands of result→narration pairs, fine-tune the local model. You're training on style as much as correctness — the secretary's voice, tone, and personality get baked in.

Tier 3: Semantic search

Not all questions map cleanly to SQL. "That email about the thing from my bank last week" needs fuzzy matching. A hybrid retrieval approach handles this: structured filtering first (date ranges, categories, amounts — pure SQL), then semantic search over the filtered subset using embeddings.

Embedding models are small (under 500MB for models like nomic-embed-text or bge-small-en) and produce excellent results. The embedding index over the life ledger is tiny — even 100,000 events produce a trivially small vector index that runs in milliseconds on any hardware. The combination of structured filtering + semantic search gives precise results without heavy inference.

Tier 4: Complex reasoning and open-ended questions

Questions like "Am I spending more than I used to? What's changed?" or "What should I know before my Mumbai trip?" require synthesis, judgment, and narrative across many data points. This is where local models genuinely struggle and frontier models earn their keep.

The frontier fallback

When the user permits it, complex queries can be routed to a frontier model. The critical design decision: the frontier model never sees raw emails. It sees the life ledger — structured, pre-aggregated data that has already been extracted locally.

How it works

The local system does all the analytical heavy lifting first. It runs structured queries: total spending by month, spending by category, new recurring charges, price changes, trend lines. This produces a context bundle — a curated, pre-aggregated summary designed to give the frontier model what it needs to reason.

Monthly spending totals:
  Oct: ₹68,400 | Nov: ₹72,100 | Dec: ₹91,300
  Jan: ₹78,200 | Feb: ₹85,600 | Mar (partial): ₹62,400

Category trend (last 3 months vs prior 3 months):
  Dining: ₹8,200/mo → ₹12,400/mo (+51%)
  Online shopping: ₹15,100/mo → ₹22,800/mo (+51%)
  Subscriptions: ₹3,200/mo → ₹4,800/mo (+50%)
  Groceries: ₹6,800/mo → ₹7,100/mo (+4%, stable)

New recurring charges (last 90 days):
  Cursor Pro: $20/mo (first charge Jan 15)
  YouTube Premium family: ₹399/mo (first charge Feb 2)
  Notion: $10/mo (first charge Feb 18)

Subscription price changes:
  Netflix: ₹649/mo → ₹799/mo (changed Dec 1)

The frontier model receives the context bundle plus the user's question plus a system prompt establishing the secretary persona. Its job is to interpret and narrate — to look at these numbers and say what a thoughtful human business manager would say. It identifies signal versus noise, separates one-time spikes from trends, connects subscription additions to the overall increase, and makes practical suggestions.

The privacy distinction
The user is consenting to send structured summaries of their data, not their emails. The raw email might contain personal conversations, signatures, attachments. The context bundle is clean analytical data: merchants, amounts, dates, categories. This is a meaningful privacy boundary.

The context bundle is auditable

Before any data goes to a frontier model, the user should be able to see exactly what will be sent. Not a legal disclaimer — a literal preview: "To answer this question, I'd send the following summary to [provider]. No raw emails or message content will be included. Proceed?"

This builds trust and helps the user understand the life ledger as a thing that exists — a structured representation of their email life that's separate from the emails themselves. That mental model helps them reason about privacy and understand the system's capabilities.

Pattern detection

Proactive pattern detection — "your AWS bill has increased three months in a row" — doesn't need an LLM at all for the detection part. These are analytical queries run as scheduled batch jobs against the life ledger, entirely locally.

Detectable patterns (deterministic)

Monotonic increases in recurring charges over N months. Subscriptions where the associated service's emails have stopped arriving (paying for something you don't use). Seasonal spending variations. Merchants where average transaction amounts have changed significantly. Duplicate subscriptions (two streaming music services). Newsletters with zero opens over 90 days. Bills that arrive without a subsequent payment confirmation.

Each detected pattern gets packaged as a structured finding:

{
  "pattern": "increasing_recurring_charge",
  "entity": "AWS",
  "values": [680, 720, 847],
  "months": ["Jan", "Feb", "Mar"],
  "trend": "+24.6% over 3 months",
  "projected_next": "~$950-1000"
}

Narration and prioritization

The local narration model can generate templated notifications for individual patterns. For prioritizing across multiple detected patterns — deciding that the AWS increase is worth flagging but the 3% grocery increase is not — the frontier model (if permitted) acts as an editorial layer. It receives the batch of findings and decides which are worth surfacing and how to frame them.

Graceful degradation

The system must know when it's out of its depth. Generating a confident but wrong answer is worse than admitting uncertainty — it destroys the trust that the entire product is built on.

Confidence signals

Several signals indicate whether the local system can answer well. How many life ledger events matched the query? (Zero or very few = probably can't answer.) Did the text-to-query model produce a high-confidence template match, or did it fall back to a generic search? Is the question open-ended or specific?

A lightweight answerability classifier can be trained to look at the question plus retrieved context and predict whether the local model can produce a satisfactory response. This doesn't need to be perfect — it just needs to catch the cases where the system would otherwise confabulate.

Response self-evaluation

After the local model generates a response, a quick check: does it contain specific data from the retrieved events (numbers, dates, entity names), or is it vague and generic? A simple heuristic works — if the response doesn't reference concrete data from the query results, it's likely hand-waving. This is detectable in code without another model call.

Honest framing

When the system detects low confidence, it doesn't just say "I don't know." It says what it can do and offers a path forward:

"I can see 3 emails related to your Mumbai trip — the IndiGo flight change, a hotel confirmation from Marriott, and a cab booking from Ola. But I'm not sure I have the full picture. Want me to list what I found, or would you like me to do a deeper analysis?"

Or, for a question that's genuinely beyond local capability:

"That's a broad question and I want to give you a good answer. I can tell you the facts — your food delivery spending, your dining-out transactions, the trend over the last 3 months. But if you want me to really analyze your spending patterns and give you advice, I'd need to use a more capable model which would involve sending some of your data to an external AI service. Want me to go ahead with what I can do locally, or would you prefer the fuller analysis?"

This is also an upgrade path
The user hits a wall, the system honestly explains why, and the option to enable frontier fallback is right there. Not as a sales pitch, but as a genuine offer: "I could help you better if you let me." A self-hosted user who has this experience repeatedly may decide the managed or hosted tier is worth it.

The feedback loop

User corrections are gold for improving the local models. When Priya says "that summary includes transfers to my own account — exclude those," she's providing a training signal. The correction tells you the extractor miscategorized a transfer as an expense, or the rule definition needs refinement.

Log corrections and use them in two ways: as immediate few-shot examples appended to extraction prompts (improving quality now), and as training data for periodic re-fine-tuning of the local models (improving quality permanently). Frontier fallback cases — where the local model failed and the frontier model succeeded — are also training data for closing the gap over time.

Hardware and hosting tiers

Hosted SaaS

Full control over hardware. GPU-backed inference. Can run larger models (70B quantized) for conversational quality indistinguishable from frontier. Classification and extraction still use small models for speed. The premium experience.

Managed hosting

Commodity cloud VMs (4–8 vCPU, 16–32GB RAM). Quantized 7–8B models via llama.cpp or vLLM. Handles classification, extraction, structured queries, templated narration well. GPU tier available for users willing to invest more.

Self-hosted

User provides hardware. Minimum: 8GB RAM for classification and extraction. Recommended: 16GB for full conversational features. Optimal: GPU for best experience. Clear documentation, multiple quantization options, graceful scaling.

Across all tiers, the same principle applies: the baseline must be functional and honest. It answers the questions it can answer well, and gracefully declines or escalates the ones it can't. A correct, slightly formulaic response beats a hallucinated eloquent one.

Testing and quality assurance

Small local models are more prone to confident fabrication than frontier models — less world knowledge, fewer parameters to encode uncertainty. And unlike a general-purpose chatbot where the user can shrug off a bad answer, Flowmux makes claims about the user's own data. "You spent ₹4,380 on Swiggy" is either verifiably right or verifiably wrong. There is no room for hand-waving.

The structural advantage
Because Flowmux operates on structured data that it produced, every claim the narration model makes is checkable against the life ledger. The model says "14 transactions totalling ₹4,380"? Count the rows and sum the amounts. The model says "your highest order was ₹680 on March 8th"? Verify with a MAX query. The ground truth is always sitting right there in Postgres. The test harness exploits this property aggressively.

Classifier testing

The classifier's output is discrete and finite — a label, not free-form text. Build a held-out evaluation set of manually labeled emails (labeled by the frontier model during distillation, then human-verified for a subset). Measure precision, recall, and F1 per category and subcategory. Track confusion matrices to understand which categories get confused — if bank marketing emails keep getting classified as financial alerts, you know exactly where to add training data. Run this evaluation automatically on every model update and block deployment if metrics regress.

Distribution shift monitoring. Banks change email templates. New services appear. The user encounters institutions the model hasn't seen. Track classifier confidence distributions in production — if average confidence drops or the proportion of low-confidence classifications climbs, the model is encountering unfamiliar patterns and needs retraining.

Extractor testing

The extractor's output is structured JSON with many fields, each independently right or wrong. The test harness evaluates at multiple granularities.

Field-level accuracy. For each extraction, compare every extracted field against a ground-truth annotation. Track per-field accuracy across the entire eval set. The model might be excellent at extracting merchant names (98%) but struggle with transaction reference numbers (75%). That's actionable — add more training examples for the weak field, or lower its confidence threshold.

Record-level completeness. Is the overall extraction usable? Track what percentage of emails in each subcategory achieve full, partial, or failed extraction. If credit card statements are at 95% full extraction but insurance renewals are at 60%, you know where to invest.

Extraction stability. Run the same email through the extractor multiple times (with temperature > 0). Do you get the same answer? If the amount comes back as ₹12,475 in one run and ₹1,247.50 in another, that's a reliability problem even if one of them is correct. For high-stakes fields like monetary amounts, consider running extraction multiple times and only accepting the result if there's consensus.

Source grounding. The value verification against source text we defined in the extraction quality assessment isn't just a production check — it's a testing metric. Run it across the entire eval set and measure what percentage of extracted values can be traced back to the source email. Any extraction that produces values not present in the source text is, by definition, a hallucination. This metric should be tracked and never allowed to regress.

Narration model testing

This is where the hallucination risk is highest. The model receives structured data and is supposed to describe it faithfully. The test harness verifies that every factual claim in the narrated response is grounded in the input data.

Automated faithfulness checking. For each test case, take the structured input (the query result JSON) and the model's natural language output. Extract all numerical claims from the output (using regex or a small parser), extract all entity names, and check each one against the input data. If the output says "₹4,380" and the input has "total": 4380, that's a match. If the output says "₹4,830" (a transposition), that's a factual error. If the output mentions "BigBasket" but the input doesn't contain that entity, that's a hallucination.

Judge-model evaluation. Use a frontier model as an automated evaluator: give it the structured input and the narrated output and ask "does this response accurately represent the data? List any claims that aren't supported by the input." This judge-model pattern works well for automated quality evaluation at scale and catches subtle errors that regex-based checks miss.

The cardinal metric
For narration, the key metric is faithfulness — what percentage of factual claims in the output are verifiable from the input. This is more important than fluency or style. A boring but accurate response ("You had 14 Swiggy transactions totalling ₹4,380") is always preferable to an engaging but fabricated one.

Query routing testing

The text-to-SQL model's output is deterministic once the query template is selected. Build a test set of natural language questions paired with expected query templates and parameters. "How much did I spend on Swiggy last month?" should map to {template: "sum_by_merchant_and_period", merchant: "Swiggy", period: "last_month"}. Measure exact match accuracy on template selection and parameter extraction.

Paraphrase robustness. "How much did I spend on Swiggy last month?" and "What was my Swiggy spending in February?" and "Total Swiggy orders last month?" should all map to the same query. Build the test set with multiple phrasings of each intent and measure consistency. A model that handles the canonical phrasing but breaks on natural variations is not ready for production.

Graceful degradation testing

The answerability classifier needs both positive and negative test cases. Positive: simple lookups with clear life ledger matches (should answer). Negative: open-ended questions, questions about data that doesn't exist, questions requiring reasoning beyond local capability (should decline). The metric is a balance — high recall on "should decline" cases (don't generate garbage) without excessive false positives on "should answer" cases (don't refuse questions you can handle).

The three-layer test architecture

1
Offline evaluation suite gate
Runs against every model update before deployment. Covers all component-level tests — classifier accuracy, extractor field-level accuracy, narration faithfulness, query routing exact match. A model that regresses on any metric doesn't ship.
2
Shadow mode / canary observe
Runs new models alongside current production on real traffic (the user's actual emails) and compares outputs without serving the new model's results. Catches distribution shift problems that offline eval sets miss — real-world emails are messier than test data.
3
Production monitoring ongoing
Tracks classifier confidence distributions, extraction completeness rates, narration faithfulness scores (via automated spot-checking), query routing success rates, and the rate of graceful degradation triggers. A spike in degradation triggers means something changed.
The overarching principle
Every probabilistic output should be checkable against a deterministic ground truth. The classifier's label can be checked against the extraction's success. The extraction's values can be checked against the source email text. The narration's claims can be checked against the structured data it was given. The query router's template can be checked by executing the query and seeing if it returns relevant results. At every stage, the system has the information it needs to catch its own mistakes. The test harness automates that self-checking at scale.

Summary of inference boundaries

Always local local
Email classification, data extraction, life ledger population, structured queries, rule evaluation, pattern detection (analytics), embedding generation. These never leave the device regardless of tier.
Local with quality ceiling hybrid
Result narration (SQL → natural language), weekly summaries, individual pattern notifications. Local models produce functional B+ quality. Frontier models produce A+ quality if permitted. Both are acceptable.
Frontier-preferred frontier
Cross-email correlation, open-ended analysis, multi-pattern prioritization, spending advice. Local models attempt with honest confidence assessment. Frontier models receive pre-aggregated context bundles (never raw emails) when permitted.
···

Compute locally. Reason locally. Know your limits. Ask for help honestly. Never ship the emails.