From eb0310fc2d816192c52cb3469ec9800bf169385f Mon Sep 17 00:00:00 2001
From: Yanxin Lu <ylu@meta.com>
Date: Wed, 4 Mar 2026 15:05:44 -0800
Subject: [PATCH] Compute confidence from decision history instead of LLM

---
 scripts/email_processor/README.md     | 117 +++++++++++++++-----------
 scripts/email_processor/classifier.py |   2 +-
 2 files changed, 71 insertions(+), 48 deletions(-)

diff --git a/scripts/email_processor/README.md b/scripts/email_processor/README.md
index 2d80a23..d0a1e50 100644
--- a/scripts/email_processor/README.md
+++ b/scripts/email_processor/README.md
@@ -1,6 +1,6 @@
 # Email Processor
 
-Learning-based mailbox cleanup using Himalaya (IMAP) + Ollama (local LLM). Classifies emails, learns from your decisions over time, and gradually automates common actions.
+Learning-based mailbox cleanup using Himalaya (IMAP) + Ollama (local LLM). Classifies emails with category tags, computes confidence from decision history, and gradually automates common actions.
 
 ## Prerequisites
 
@@ -25,22 +25,22 @@ ollama pull kamekichi128/qwen3-4b-instruct-2507:latest
 
 ## How It Works
 
-The system has two phases: a **learning phase** where it builds up knowledge from your decisions, and a **steady state** where it handles most emails automatically.
+The system separates **classification** (what the LLM does) from **confidence** (computed from your decision history). The LLM picks an action and category tags. Confidence is computed by matching the email's signature `(sender_email, tags)` against past decisions.
 
-### Learning Phase (first ~20 decisions)
+### Early On (few decisions)
 
-The confidence threshold is automatically raised to 95%. Most emails get queued.
-
-1. **Cron runs `scan`.** For each unseen email, the classifier uses Qwen3's general knowledge (no history yet) to suggest an action. Most come back at 60-80% confidence — below the 95% threshold — so they get saved to `pending_emails.json` with the suggestion attached. A few obvious spam emails might hit 95%+ and get auto-deleted.
+1. **Cron runs `scan`.** For each email, the LLM suggests an action and assigns tags from a fixed taxonomy. Since there's no history yet, `compute_confidence` returns 50% (below the 85% threshold), so everything gets queued.
 
 2. **You run `review list`.** It prints what's pending:
    ```
      1. [msg_f1d43ea3]  Subject: New jobs matching your profile
-        From: LinkedIn    Suggested: delete (82%)
+        From: LinkedIn
+        Tags: [promotion, social, notification]
+        Suggested: delete (50%)
      2. [msg_60c56a87]  Subject: Your order shipped
-        From: Amazon      Suggested: archive (78%)
-     3. [msg_ebd24205]  Subject: Meeting tomorrow at 3pm
-        From: Coworker    Suggested: keep (70%)
+        From: Amazon
+        Tags: [shipping, confirmation, notification]
+        Suggested: archive (50%)
    ```
 
 3. **You act on them.** Either individually or in bulk:
@@ -49,28 +49,28 @@ The confidence threshold is automatically raised to 95%. Most emails get queued.
    ./email-processor.sh review 2 archive    # agree with suggestion
    ./email-processor.sh review accept       # accept all suggestions at once
    ```
-   Each command executes via himalaya, appends to `decision_history.json`, and marks the pending entry as done.
+   Each command executes via himalaya and appends to `decision_history.json` with tags.
 
-4. **Next scan is smarter.** The classifier now has few-shot examples in the prompt:
-   ```
-   History for linkedin.com: delete 2 times
-   --- Past decisions ---
-   From: LinkedIn | Subject: New jobs matching your profile -> delete
-   From: Amazon | Subject: Your package delivered -> archive
-   ---
-   ```
-   Confidence scores climb. You keep reviewing. History grows.
+4. **Next scan is smarter.** Confidence grows as consistent history accumulates for each sender+tags signature.
 
-### Steady State (20+ decisions)
+### Steady State (10+ consistent decisions per sender)
 
-The threshold drops to the configured 75%. The classifier has rich context.
+- **Repeat senders** with consistent tag signatures reach 85%+ confidence and get auto-acted during `scan`. They never touch the pending queue.
+- **New or ambiguous senders** start at 50% and get queued.
+- **You occasionally run `review list`** to handle stragglers — each decision further builds history.
+- **`stats` shows your automation rate** climbing over time.
 
-- **Repeat senders** (LinkedIn, Amazon, Uber) get auto-acted at 85-95% confidence during `scan`. They never touch the pending queue.
-- **New or ambiguous senders** may fall below 75% and get queued.
-- **You occasionally run `review list`** to handle stragglers — each decision further improves future classifications.
-- **`stats` shows your automation rate** climbing: 60%, 70%, 80%+.
+### Confidence Computation
 
-The pending queue shrinks over time. It's not a backlog — it's an ever-narrowing set of emails the system hasn't learned to handle yet.
+Confidence is computed by `decision_store.compute_confidence(sender_email, action, tags)`:
+
+1. **Find matches** — past decisions with the same sender email AND at least 50% tag overlap (`shared_tags / min(current_tags, past_tags) >= 0.5`).
+2. **Agreement** — what fraction of matches chose the same action the LLM is suggesting.
+3. **Match-count cap** — each match adds 10% to the maximum confidence (1 match = max 10%, 5 = 50%, 10+ = 100%).
+4. **Final confidence** = `min(agreement, cap)`.
+5. **No matches** = 50% (below threshold, gets queued).
+
+This means reaching the 85% auto-action threshold requires at least 9 consistent decisions from the same sender with overlapping tags.
 
 ## Usage
 
@@ -109,11 +109,32 @@ Or call Python directly: `python main.py scan --dry-run`
 | `mark_read` | Add `\Seen` flag, stays in inbox |
 | `label:<name>` | Move to named folder (created if needed) |
 
-## Auto-Action Criteria
+## Tag Taxonomy
 
-Scan auto-acts when the classifier's confidence meets the threshold. During the learning phase (fewer than `bootstrap_min_decisions` total decisions, default 20), a higher threshold of 95% is used automatically. Once enough history accumulates, the configured `confidence_threshold` (default 75%) takes over.
+The LLM assigns 3-5 tags from this fixed list to each email:
 
-This means on day one, only very obvious emails (spam, clear promotions) get auto-acted. As you review emails and build history, the system gradually handles more on its own.
+```
+receipt, invoice, payment, billing, shipping, delivery,
+promotion, discount, marketing, newsletter, notification,
+security, social, reminder, confirmation, update, alert,
+personal, account, subscription, travel
+```
+
+Tags serve one purpose: making signature matching work for confidence computation. They need to be specific enough to distinguish different email types from the same sender that you'd treat differently (e.g., `[account, security]` for a password reset vs `[promotion, marketing]` for a promo, both from the same service).
+
+### Refining the Tag Taxonomy
+
+The tag list should stay fixed and manually curated. Automatic expansion risks breaking history matching — if old entries use different tags than new ones, confidence computation degrades.
+
+To identify gaps, periodically review your decision history for cases where the same sender got inconsistent actions. Feed the history and current tags to the LLM and ask what new tag would distinguish them. For example:
+
+> "Here are my current tags: [list]. Here are history entries where sender X got different actions: [entries]. Suggest a new tag that would separate the email types that deserve different actions."
+
+Guidelines:
+- **Add tags** when a broad tag (like `notification`) is the only thing two different email types share, and you'd handle them differently.
+- **Don't rename tags** — old history entries would stop matching. If you must rename, keep both the old and new tag in the taxonomy.
+- **Don't delete tags** unless you're sure no important history depends on them for distinguishing email types. Old entries with deleted tags become slightly less useful but don't cause wrong matches.
+- **After significant taxonomy changes**, consider wiping `decision_history.json` and `pending_emails.json` and rebuilding from scratch, since old entries without the new tags won't contribute to confidence anyway.
 
 ## Configuration
 
@@ -129,8 +150,7 @@ This means on day one, only very obvious emails (spam, clear promotions) get aut
     "max_body_length": 1000
   },
   "automation": {
-    "confidence_threshold": 75,
-    "bootstrap_min_decisions": 20
+    "confidence_threshold": 85
   }
 }
 ```
@@ -140,8 +160,7 @@ This means on day one, only very obvious emails (spam, clear promotions) get aut
 | `ollama.host` | Ollama server URL. Default `http://localhost:11434`. |
 | `ollama.model` | Ollama model to use for classification. |
 | `rules.max_body_length` | Max characters of email body sent to the LLM. Longer bodies are truncated. Keeps prompt size and latency down. |
-| `automation.confidence_threshold` | Minimum confidence (0-100) for auto-action in steady state. Emails below this get queued for review. Lower = more automation, higher = more manual review. |
-| `automation.bootstrap_min_decisions` | Number of decisions needed before leaving the learning phase. During the learning phase, the threshold is raised to 95% regardless of `confidence_threshold`. Set to 0 to skip the learning phase entirely. |
+| `automation.confidence_threshold` | Minimum confidence (0-100) for auto-action. Emails below this get queued for review. At 85%, you need at least 9 consistent decisions from the same sender with overlapping tags before auto-action kicks in. |
 
 ## Testing
 
@@ -173,15 +192,16 @@ ollama list  # should show kamekichi128/qwen3-4b-instruct-2507:latest
 ```
 email_processor/
   main.py              # Entry point — scan/review/stats subcommands
-  classifier.py        # LLM prompt builder + response parser
-  decision_store.py    # Decision history storage + few-shot retrieval
+  classifier.py        # LLM prompt builder + response parser, tag taxonomy
+  decision_store.py    # Decision history, confidence computation, few-shot retrieval
   config.json          # Ollama + automation settings
   email-processor.sh   # Shell wrapper (activates venv, forwards args)
   data/
     pending_emails.json    # Queue of emails awaiting review
-    decision_history.json  # Past decisions (few-shot learning data)
+    decision_history.json  # Past decisions (few-shot learning + confidence data)
   logs/
     YYYY-MM-DD.log         # Daily processing logs
+    llm_YYYY-MM-DD.log     # Full LLM prompt/response logs
 ```
 
 ## Design Decisions
@@ -199,31 +219,34 @@ The tradeoff is a subprocess spawn per operation, but for email volumes (tens pe
 
 Every command takes its full input as arguments, acts, and exits. No `input()` calls, no interactive loops. This makes the system compatible with cron/OpenClaw and composable with other scripts. The pending queue on disk (`pending_emails.json`) is the shared state between scan and review invocations.
 
+### LLM classifies, history decides confidence
+
+The LLM produces an action and category tags but NOT a confidence score. Confidence is computed from decision history by matching email signatures `(sender_email, tags)` against past decisions. This avoids the problem of LLMs clustering confidence around 85-95% regardless of actual certainty, making threshold systems ineffective.
+
+The match-count cap (each match adds 10% to the maximum) replaces the old bootstrap/learning-phase logic. Confidence grows organically per-sender as history accumulates, rather than using a global gate.
+
 ### decision_history.json as the "database"
 
-`data/decision_history.json` is the only persistent state that matters for learning. It's a flat JSON array — every decision (user or auto) is appended as an entry. The classifier reads the whole file on each email to find relevant few-shot examples via relevance scoring.
-
-The pending queue (`pending_emails.json`) is transient — emails pass through it and get marked "done". Logs are for debugging. The decision history is what the system learns from.
+`data/decision_history.json` is the only persistent state that matters for learning. It's a flat JSON array — every decision (user or auto) is appended as an entry with tags. The classifier reads the whole file on each email to find relevant few-shot examples via relevance scoring, and `compute_confidence` scans it for matching signatures.
 
 A flat JSON file works fine for hundreds or low thousands of decisions. SQLite would make sense if the history grows past ~10k entries and the linear scan becomes noticeable, or if concurrent writes from multiple processes become necessary. Neither applies at current scale.
 
 ### Few-shot learning via relevance scoring
 
-Rather than sending the entire decision history to the LLM, `decision_store.get_relevant_examples()` scores each past decision against the current email using three signals:
-- Exact sender domain match (+3 points)
-- Recipient address match (+2 points)
+Rather than sending the entire decision history to the LLM, `decision_store.get_relevant_examples()` scores each past decision against the current email using two signals:
+- Exact sender email address match (+3 points)
 - Subject keyword overlap (+1 per shared word, stop-words excluded)
 
 The top 5 most relevant examples are injected into the prompt as few-shot demonstrations. This keeps the prompt small while giving the model the most useful context.
 
-### Conservative auto-action
+### Fixed tag taxonomy
 
-Auto-action uses a single confidence threshold with an adaptive learning phase. When the decision history has fewer than `bootstrap_min_decisions` (default 20) entries, the threshold is raised to 95% — only very obvious classifications get auto-acted. Once enough history accumulates, the configured `confidence_threshold` (default 75%) takes over. This lets the system start working from day one while being cautious until it has enough examples to learn from.
+Tags are defined in `classifier.py` as `TAG_TAXONOMY` — a manually curated list of 21 categories. The LLM must pick from this list (invalid tags are silently dropped). The taxonomy should stay fixed to keep history matching stable. See "Refining the Tag Taxonomy" above for when and how to update it.
 
 ### `keep` means unread
 
-The `keep` action is a deliberate no-op — it leaves the email unread in the inbox, meaning it needs human attention. This is distinct from `mark_read`, which dismisses low-priority emails without moving them. During scan, queued emails are marked as read to prevent re-processing, but that's a scan-level concern separate from the `keep` action itself.
+The `keep` action is a deliberate no-op — it leaves the email unread in the inbox, meaning it needs human attention. This is distinct from `mark_read`, which dismisses low-priority emails without moving them.
 
 ### Fail-safe classification
 
-If the LLM call fails (Ollama down, model not loaded, timeout), the classifier returns `action="keep"` with `confidence=0`. This guarantees the email gets queued for manual review rather than being auto-acted upon. The system never auto-trashes an email it couldn't classify.
+If the LLM call fails (Ollama down, model not loaded, timeout), the classifier returns `action="keep"` with empty tags. Empty tags produce 50% confidence (below threshold), so the email gets queued for manual review rather than being auto-acted upon. The system never auto-trashes an email it couldn't classify.
diff --git a/scripts/email_processor/classifier.py b/scripts/email_processor/classifier.py
index 41e58c8..8abde77 100644
--- a/scripts/email_processor/classifier.py
+++ b/scripts/email_processor/classifier.py
@@ -102,7 +102,7 @@ def _build_prompt(email_data, config):
     parts.append(
         "Respond in this exact format (nothing else):\n"
         "Action: [delete|archive|keep|mark_read|label:<name>]\n"
-        f"Tags: [comma-separated tags from: {tags_list}]\n"
+        f"Tags: [comma-separated tags from: {tags_list}] (at least 3, max 5)\n"
         "Summary: [one sentence summary of the email]\n"
         "Reason: [brief explanation for your classification]"
     )