Compute confidence from decision history instead of LLM

2026-03-04 15:05:44 -08:00 · 2026-03-04 14:23:50 -08:00
5 changed files with 203 additions and 131 deletions
--- a/scripts/email_processor/README.md
+++ b/scripts/email_processor/README.md
@@ -1,6 +1,6 @@
 # Email Processor

-Learning-based mailbox cleanup using Himalaya (IMAP) + Ollama (local LLM). Classifies emails, learns from your decisions over time, and gradually automates common actions.
+Learning-based mailbox cleanup using Himalaya (IMAP) + Ollama (local LLM). Classifies emails with category tags, computes confidence from decision history, and gradually automates common actions.

 ## Prerequisites

@@ -25,22 +25,22 @@ ollama pull kamekichi128/qwen3-4b-instruct-2507:latest

 ## How It Works

-The system has two phases: a **learning phase** where it builds up knowledge from your decisions, and a **steady state** where it handles most emails automatically.
+The system separates **classification** (what the LLM does) from **confidence** (computed from your decision history). The LLM picks an action and category tags. Confidence is computed by matching the email's signature `(sender_email, tags)` against past decisions.

-### Learning Phase (first ~20 decisions)
+### Early On (few decisions)

-The confidence threshold is automatically raised to 95%. Most emails get queued.
-
-1. **Cron runs `scan`.** For each unseen email, the classifier uses Qwen3's general knowledge (no history yet) to suggest an action. Most come back at 60-80% confidence — below the 95% threshold — so they get saved to `pending_emails.json` with the suggestion attached. A few obvious spam emails might hit 95%+ and get auto-deleted.
+1. **Cron runs `scan`.** For each email, the LLM suggests an action and assigns tags from a fixed taxonomy. Since there's no history yet, `compute_confidence` returns 50% (below the 85% threshold), so everything gets queued.

 2. **You run `review list`.** It prints what's pending:
   ```
     1. [msg_f1d43ea3]  Subject: New jobs matching your profile
-        From: LinkedIn    Suggested: delete (82%)
+        From: LinkedIn
+        Tags: [promotion, social, notification]
+        Suggested: delete (50%)
     2. [msg_60c56a87]  Subject: Your order shipped
-        From: Amazon      Suggested: archive (78%)
-     3. [msg_ebd24205]  Subject: Meeting tomorrow at 3pm
-        From: Coworker    Suggested: keep (70%)
+        From: Amazon
+        Tags: [shipping, confirmation, notification]
+        Suggested: archive (50%)
   ```

 3. **You act on them.** Either individually or in bulk:
@@ -49,28 +49,28 @@ The confidence threshold is automatically raised to 95%. Most emails get queued.
   ./email-processor.sh review 2 archive    # agree with suggestion
   ./email-processor.sh review accept       # accept all suggestions at once
   ```
-   Each command executes via himalaya, appends to `decision_history.json`, and marks the pending entry as done.
+   Each command executes via himalaya and appends to `decision_history.json` with tags.

-4. **Next scan is smarter.** The classifier now has few-shot examples in the prompt:
-   ```
-   History for linkedin.com: delete 2 times
-   --- Past decisions ---
-   From: LinkedIn | Subject: New jobs matching your profile -> delete
-   From: Amazon | Subject: Your package delivered -> archive
-   ---
-   ```
-   Confidence scores climb. You keep reviewing. History grows.
+4. **Next scan is smarter.** Confidence grows as consistent history accumulates for each sender+tags signature.

-### Steady State (20+ decisions)
+### Steady State (10+ consistent decisions per sender)

-The threshold drops to the configured 75%. The classifier has rich context.
+- **Repeat senders** with consistent tag signatures reach 85%+ confidence and get auto-acted during `scan`. They never touch the pending queue.
+- **New or ambiguous senders** start at 50% and get queued.
+- **You occasionally run `review list`** to handle stragglers — each decision further builds history.
+- **`stats` shows your automation rate** climbing over time.

- **Repeat senders** (LinkedIn, Amazon, Uber) get auto-acted at 85-95% confidence during `scan`. They never touch the pending queue.
- **New or ambiguous senders** may fall below 75% and get queued.
- **You occasionally run `review list`** to handle stragglers — each decision further improves future classifications.
- **`stats` shows your automation rate** climbing: 60%, 70%, 80%+.
+### Confidence Computation

-The pending queue shrinks over time. It's not a backlog — it's an ever-narrowing set of emails the system hasn't learned to handle yet.
+Confidence is computed by `decision_store.compute_confidence(sender_email, action, tags)`:
+
+1. **Find matches** — past decisions with the same sender email AND at least 50% tag overlap (`shared_tags / min(current_tags, past_tags) >= 0.5`).
+2. **Agreement** — what fraction of matches chose the same action the LLM is suggesting.
+3. **Match-count cap** — each match adds 10% to the maximum confidence (1 match = max 10%, 5 = 50%, 10+ = 100%).
+4. **Final confidence** = `min(agreement, cap)`.
+5. **No matches** = 50% (below threshold, gets queued).
+
+This means reaching the 85% auto-action threshold requires at least 9 consistent decisions from the same sender with overlapping tags.

 ## Usage

@@ -109,11 +109,32 @@ Or call Python directly: `python main.py scan --dry-run`
 | `mark_read` | Add `\Seen` flag, stays in inbox |
 | `label:<name>` | Move to named folder (created if needed) |

-## Auto-Action Criteria
+## Tag Taxonomy

-Scan auto-acts when the classifier's confidence meets the threshold. During the learning phase (fewer than `bootstrap_min_decisions` total decisions, default 20), a higher threshold of 95% is used automatically. Once enough history accumulates, the configured `confidence_threshold` (default 75%) takes over.
+The LLM assigns 3-5 tags from this fixed list to each email:

-This means on day one, only very obvious emails (spam, clear promotions) get auto-acted. As you review emails and build history, the system gradually handles more on its own.
+```
+receipt, invoice, payment, billing, shipping, delivery,
+promotion, discount, marketing, newsletter, notification,
+security, social, reminder, confirmation, update, alert,
+personal, account, subscription, travel
+```
+
+Tags serve one purpose: making signature matching work for confidence computation. They need to be specific enough to distinguish different email types from the same sender that you'd treat differently (e.g., `[account, security]` for a password reset vs `[promotion, marketing]` for a promo, both from the same service).
+
+### Refining the Tag Taxonomy
+
+The tag list should stay fixed and manually curated. Automatic expansion risks breaking history matching — if old entries use different tags than new ones, confidence computation degrades.
+
+To identify gaps, periodically review your decision history for cases where the same sender got inconsistent actions. Feed the history and current tags to the LLM and ask what new tag would distinguish them. For example:
+
+> "Here are my current tags: [list]. Here are history entries where sender X got different actions: [entries]. Suggest a new tag that would separate the email types that deserve different actions."
+
+Guidelines:
+- **Add tags** when a broad tag (like `notification`) is the only thing two different email types share, and you'd handle them differently.
+- **Don't rename tags** — old history entries would stop matching. If you must rename, keep both the old and new tag in the taxonomy.
+- **Don't delete tags** unless you're sure no important history depends on them for distinguishing email types. Old entries with deleted tags become slightly less useful but don't cause wrong matches.
+- **After significant taxonomy changes**, consider wiping `decision_history.json` and `pending_emails.json` and rebuilding from scratch, since old entries without the new tags won't contribute to confidence anyway.

 ## Configuration

@@ -129,8 +150,7 @@ This means on day one, only very obvious emails (spam, clear promotions) get aut
    "max_body_length": 1000
  },
  "automation": {
-    "confidence_threshold": 75,
-    "bootstrap_min_decisions": 20
+    "confidence_threshold": 85
  }
 }
 ```
@@ -140,8 +160,7 @@ This means on day one, only very obvious emails (spam, clear promotions) get aut
 | `ollama.host` | Ollama server URL. Default `http://localhost:11434`. |
 | `ollama.model` | Ollama model to use for classification. |
 | `rules.max_body_length` | Max characters of email body sent to the LLM. Longer bodies are truncated. Keeps prompt size and latency down. |
-| `automation.confidence_threshold` | Minimum confidence (0-100) for auto-action in steady state. Emails below this get queued for review. Lower = more automation, higher = more manual review. |
-| `automation.bootstrap_min_decisions` | Number of decisions needed before leaving the learning phase. During the learning phase, the threshold is raised to 95% regardless of `confidence_threshold`. Set to 0 to skip the learning phase entirely. |
+| `automation.confidence_threshold` | Minimum confidence (0-100) for auto-action. Emails below this get queued for review. At 85%, you need at least 9 consistent decisions from the same sender with overlapping tags before auto-action kicks in. |

 ## Testing

@@ -173,15 +192,16 @@ ollama list  # should show kamekichi128/qwen3-4b-instruct-2507:latest
 ```
 email_processor/
  main.py              # Entry point — scan/review/stats subcommands
-  classifier.py        # LLM prompt builder + response parser
-  decision_store.py    # Decision history storage + few-shot retrieval
+  classifier.py        # LLM prompt builder + response parser, tag taxonomy
+  decision_store.py    # Decision history, confidence computation, few-shot retrieval
  config.json          # Ollama + automation settings
  email-processor.sh   # Shell wrapper (activates venv, forwards args)
  data/
    pending_emails.json    # Queue of emails awaiting review
-    decision_history.json  # Past decisions (few-shot learning data)
+    decision_history.json  # Past decisions (few-shot learning + confidence data)
  logs/
    YYYY-MM-DD.log         # Daily processing logs
+    llm_YYYY-MM-DD.log     # Full LLM prompt/response logs
 ```

 ## Design Decisions
@@ -199,31 +219,34 @@ The tradeoff is a subprocess spawn per operation, but for email volumes (tens pe

 Every command takes its full input as arguments, acts, and exits. No `input()` calls, no interactive loops. This makes the system compatible with cron/OpenClaw and composable with other scripts. The pending queue on disk (`pending_emails.json`) is the shared state between scan and review invocations.

+### LLM classifies, history decides confidence
+
+The LLM produces an action and category tags but NOT a confidence score. Confidence is computed from decision history by matching email signatures `(sender_email, tags)` against past decisions. This avoids the problem of LLMs clustering confidence around 85-95% regardless of actual certainty, making threshold systems ineffective.
+
+The match-count cap (each match adds 10% to the maximum) replaces the old bootstrap/learning-phase logic. Confidence grows organically per-sender as history accumulates, rather than using a global gate.
+
 ### decision_history.json as the "database"

-`data/decision_history.json` is the only persistent state that matters for learning. It's a flat JSON array — every decision (user or auto) is appended as an entry. The classifier reads the whole file on each email to find relevant few-shot examples via relevance scoring.
-
-The pending queue (`pending_emails.json`) is transient — emails pass through it and get marked "done". Logs are for debugging. The decision history is what the system learns from.
+`data/decision_history.json` is the only persistent state that matters for learning. It's a flat JSON array — every decision (user or auto) is appended as an entry with tags. The classifier reads the whole file on each email to find relevant few-shot examples via relevance scoring, and `compute_confidence` scans it for matching signatures.

 A flat JSON file works fine for hundreds or low thousands of decisions. SQLite would make sense if the history grows past ~10k entries and the linear scan becomes noticeable, or if concurrent writes from multiple processes become necessary. Neither applies at current scale.

 ### Few-shot learning via relevance scoring

-Rather than sending the entire decision history to the LLM, `decision_store.get_relevant_examples()` scores each past decision against the current email using three signals:
- Exact sender domain match (+3 points)
- Recipient address match (+2 points)
+Rather than sending the entire decision history to the LLM, `decision_store.get_relevant_examples()` scores each past decision against the current email using two signals:
+- Exact sender email address match (+3 points)
 - Subject keyword overlap (+1 per shared word, stop-words excluded)

 The top 5 most relevant examples are injected into the prompt as few-shot demonstrations. This keeps the prompt small while giving the model the most useful context.

-### Conservative auto-action
+### Fixed tag taxonomy

-Auto-action uses a single confidence threshold with an adaptive learning phase. When the decision history has fewer than `bootstrap_min_decisions` (default 20) entries, the threshold is raised to 95% — only very obvious classifications get auto-acted. Once enough history accumulates, the configured `confidence_threshold` (default 75%) takes over. This lets the system start working from day one while being cautious until it has enough examples to learn from.
+Tags are defined in `classifier.py` as `TAG_TAXONOMY` — a manually curated list of 21 categories. The LLM must pick from this list (invalid tags are silently dropped). The taxonomy should stay fixed to keep history matching stable. See "Refining the Tag Taxonomy" above for when and how to update it.

 ### `keep` means unread

-The `keep` action is a deliberate no-op — it leaves the email unread in the inbox, meaning it needs human attention. This is distinct from `mark_read`, which dismisses low-priority emails without moving them. During scan, queued emails are marked as read to prevent re-processing, but that's a scan-level concern separate from the `keep` action itself.
+The `keep` action is a deliberate no-op — it leaves the email unread in the inbox, meaning it needs human attention. This is distinct from `mark_read`, which dismisses low-priority emails without moving them.

 ### Fail-safe classification

-If the LLM call fails (Ollama down, model not loaded, timeout), the classifier returns `action="keep"` with `confidence=0`. This guarantees the email gets queued for manual review rather than being auto-acted upon. The system never auto-trashes an email it couldn't classify.
+If the LLM call fails (Ollama down, model not loaded, timeout), the classifier returns `action="keep"` with empty tags. Empty tags produce 50% confidence (below threshold), so the email gets queued for manual review rather than being auto-acted upon. The system never auto-trashes an email it couldn't classify.
--- a/scripts/email_processor/classifier.py
+++ b/scripts/email_processor/classifier.py
@@ -5,7 +5,10 @@ Classifier - LLM-based email classification with learning.
 This module builds a rich prompt for the local Ollama model (Qwen3) that
 includes few-shot examples from past user decisions, per-sender statistics,
 and a list of known labels. The model returns a structured response with
-an action, confidence score, summary, and reason.
+an action, category tags, summary, and reason.
+
+Confidence is NOT produced by the LLM — it is computed externally from
+decision history by decision_store.compute_confidence().

 The prompt structure:
  1. System instructions (action definitions)
@@ -13,7 +16,7 @@ The prompt structure:
  3. Sender statistics ("linkedin.com: deleted 8 times, kept 2 times")
  4. Few-shot examples (top 5 most relevant past decisions)
  5. The email to classify (subject, sender, recipient, body preview)
-  6. Output format specification
+  6. Output format specification (action, tags, summary, reason)
 """

 import time
@@ -24,6 +27,15 @@ import decision_store

 LOGS_DIR = Path(__file__).parent / "logs"

+TAG_TAXONOMY = [
+    "receipt", "invoice", "payment", "billing",
+    "shipping", "delivery",
+    "promotion", "discount", "marketing", "newsletter",
+    "notification", "security", "social",
+    "reminder", "confirmation", "update", "alert",
+    "personal", "account", "subscription", "travel",
+]
+

 def _build_prompt(email_data, config):
    """Assemble the full classification prompt with learning context.
@@ -36,8 +48,8 @@ def _build_prompt(email_data, config):

    # Gather learning context from decision history
    examples = decision_store.get_relevant_examples(email_data, n=10)
-    sender_domain = decision_store._extract_domain(email_data.get("sender", ""))
-    sender_stats = decision_store.get_sender_stats(sender_domain) if sender_domain else {}
+    sender_email = decision_store._extract_email_address(email_data.get("sender", ""))
+    sender_stats = decision_store.get_sender_stats(sender_email) if sender_email else {}
    known_labels = decision_store.get_known_labels()

    # /no_think disables Qwen3's chain-of-thought, giving faster + shorter output
@@ -63,7 +75,7 @@ def _build_prompt(email_data, config):
        stats_str = ", ".join(
            f"{action} {count} times" for action, count in sender_stats.items()
        )
-        parts.append(f"\nHistory for {sender_domain}: {stats_str}\n")
+        parts.append(f"\nHistory for {sender_email}: {stats_str}\n")

    # Section 4: Few-shot examples (top 5 most relevant past decisions)
    if examples:
@@ -86,10 +98,11 @@ def _build_prompt(email_data, config):
    )

    # Section 6: Required output format
+    tags_list = ", ".join(TAG_TAXONOMY)
    parts.append(
        "Respond in this exact format (nothing else):\n"
        "Action: [delete|archive|keep|mark_read|label:<name>]\n"
-        "Confidence: [0-100]\n"
+        f"Tags: [comma-separated tags from: {tags_list}] (at least 3, max 5)\n"
        "Summary: [one sentence summary of the email]\n"
        "Reason: [brief explanation for your classification]"
    )
@@ -97,18 +110,19 @@ def _build_prompt(email_data, config):
    return "\n".join(parts)


-def _log_llm(prompt, output, email_data, action, confidence, duration):
+def _log_llm(prompt, output, email_data, action, tags, duration):
    """Log the full LLM prompt and response to logs/llm_YYYY-MM-DD.log."""
    LOGS_DIR.mkdir(exist_ok=True)
    log_file = LOGS_DIR / f"llm_{datetime.now().strftime('%Y-%m-%d')}.log"
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    subject = email_data.get("subject", "(No Subject)")[:60]
    sender = email_data.get("sender", "(Unknown)")[:60]
+    tags_str = ", ".join(tags)

    with open(log_file, "a", encoding="utf-8") as f:
        f.write(f"{'=' * 70}\n")
        f.write(f"[{timestamp}] {subject}\n")
-        f.write(f"From: {sender} | Result: {action} @ {confidence}% | {duration:.1f}s\n")
+        f.write(f"From: {sender} | Result: {action} tags=[{tags_str}] | {duration:.1f}s\n")
        f.write(f"{'-' * 70}\n")
        f.write(f"PROMPT:\n{prompt}\n")
        f.write(f"{'-' * 70}\n")
@@ -121,17 +135,19 @@ def _parse_response(output):

    Expected format (one per line):
        Action: delete
-        Confidence: 92
+        Tags: promotion, marketing, newsletter
        Summary: Promotional offer from retailer
        Reason: Clearly a marketing email with discount offer

-    Falls back to safe defaults (keep, 50% confidence) on parse failure.
+    Falls back to safe defaults (keep, empty tags) on parse failure.
    """
    action = "keep"
-    confidence = 50
+    tags = []
    summary = "No summary"
    reason = "Unknown"

+    valid_tags = set(TAG_TAXONOMY)
+
    for line in output.strip().split("\n"):
        line = line.strip()
        if line.startswith("Action:"):
@@ -139,25 +155,26 @@ def _parse_response(output):
            valid_actions = {"delete", "archive", "keep", "mark_read"}
            if raw_action in valid_actions or raw_action.startswith("label:"):
                action = raw_action
-        elif line.startswith("Confidence:"):
-            try:
-                confidence = int(line.replace("Confidence:", "").strip().rstrip("%"))
-                confidence = max(0, min(100, confidence))  # clamp to 0-100
-            except ValueError:
-                confidence = 50
+        elif line.startswith("Tags:"):
+            raw_tags = line.replace("Tags:", "").strip()
+            tags = [
+                t.strip().lower()
+                for t in raw_tags.split(",")
+                if t.strip().lower() in valid_tags
+            ]
        elif line.startswith("Summary:"):
            summary = line.replace("Summary:", "").strip()[:200]
        elif line.startswith("Reason:"):
            reason = line.replace("Reason:", "").strip()

-    return action, confidence, summary, reason
+    return action, tags, summary, reason


 def classify_email(email_data, config):
    """Classify an email using the local LLM with few-shot learning context.

    Connects to Ollama, sends the assembled prompt, and parses the response.
-    On any error, falls back to "keep" with 0% confidence so the email
+    On any error, falls back to "keep" with empty tags so the email
    gets queued for manual review rather than auto-acted upon.

    Args:
@@ -165,7 +182,7 @@ def classify_email(email_data, config):
        config:     full config dict (needs ollama.model and rules.max_body_length).

    Returns:
-        Tuple of (action, confidence, summary, reason, duration_seconds).
+        Tuple of (action, tags, summary, reason, duration_seconds).
    """
    import ollama

@@ -177,15 +194,15 @@ def classify_email(email_data, config):
        # Low temperature for consistent classification
        response = ollama.generate(model=model, prompt=prompt, options={"temperature": 0.1})
        output = response["response"]
-        action, confidence, summary, reason = _parse_response(output)
+        action, tags, summary, reason = _parse_response(output)
    except Exception as e:
-        # On failure, default to "keep" with 0 confidence -> always queued
+        # On failure, default to "keep" with empty tags -> always queued
        output = f"ERROR: {e}"
        action = "keep"
-        confidence = 0
+        tags = []
        summary = "Classification failed"
        reason = f"error - {str(e)[:100]}"

    duration = time.time() - start_time
-    _log_llm(prompt, output, email_data, action, confidence, duration)
-    return action, confidence, summary, reason, duration
+    _log_llm(prompt, output, email_data, action, tags, duration)
+    return action, tags, summary, reason, duration
--- a/scripts/email_processor/config.json
+++ b/scripts/email_processor/config.json
@@ -8,7 +8,6 @@
    "check_unseen_only": true
  },
  "automation": {
-    "confidence_threshold": 75,
-    "bootstrap_min_decisions": 30
+    "confidence_threshold": 85
  }
 }
--- a/scripts/email_processor/decision_store.py
+++ b/scripts/email_processor/decision_store.py
@@ -71,7 +71,7 @@ def _extract_email_address(sender):
 # Public API
 # ---------------------------------------------------------------------------

-def record_decision(email_data, action, source="user"):
+def record_decision(email_data, action, source="user", tags=None):
    """Append a decision to the history file.

    Args:
@@ -79,6 +79,7 @@ def record_decision(email_data, action, source="user"):
        action:     one of "delete", "archive", "keep", "mark_read",
                    or "label:<name>".
        source:     "user" (manual review) or "auto" (high-confidence).
+        tags:       list of category tags from the classifier taxonomy.
    """
    history = _load_history()
    entry = {
@@ -90,6 +91,7 @@ def record_decision(email_data, action, source="user"):
        "summary": email_data.get("summary", ""),
        "action": action,
        "source": source,
+        "tags": tags or [],
    }
    history.append(entry)
    _save_history(history)
@@ -99,10 +101,9 @@ def record_decision(email_data, action, source="user"):
 def get_relevant_examples(email_data, n=10):
    """Find the N most relevant past decisions for a given email.

-    Relevance is scored by three signals:
-      - Exact sender domain match:        +3 points
-      - Recipient string match:           +2 points
-      - Subject keyword overlap:          +1 point per shared word
+    Relevance is scored by two signals:
+      - Exact sender email address match:  +3 points
+      - Subject keyword overlap:           +1 point per shared word

    Only entries with score > 0 are considered. Results are returned
    sorted by descending relevance.
@@ -111,8 +112,7 @@ def get_relevant_examples(email_data, n=10):
    if not history:
        return []

-    target_domain = _extract_domain(email_data.get("sender", ""))
-    target_recipient = email_data.get("recipient", "").lower()
+    target_email = _extract_email_address(email_data.get("sender", ""))
    target_words = (
        set(re.findall(r"\w+", email_data.get("subject", "").lower())) - _STOP_WORDS
    )
@@ -121,15 +121,11 @@ def get_relevant_examples(email_data, n=10):
    for entry in history:
        score = 0

-        # Signal 1: sender domain match
-        if target_domain and entry.get("sender_domain", "") == target_domain:
+        # Signal 1: sender email match
+        if target_email and _extract_email_address(entry.get("sender", "")) == target_email:
            score += 3

-        # Signal 2: recipient substring match
-        if target_recipient and target_recipient in entry.get("recipient", "").lower():
-            score += 2
-
-        # Signal 3: subject keyword overlap
+        # Signal 2: subject keyword overlap
        entry_words = (
            set(re.findall(r"\w+", entry.get("subject", "").lower())) - _STOP_WORDS
        )
@@ -142,27 +138,64 @@ def get_relevant_examples(email_data, n=10):
    return [entry for _, entry in scored[:n]]


-def get_sender_stats(sender_domain):
-    """Get action distribution for a sender domain.
+def get_sender_stats(sender_email):
+    """Get action distribution for a sender email address.

    Returns a dict like {"delete": 5, "keep": 2, "archive": 1}.
    """
    history = _load_history()
    actions = Counter()
    for entry in history:
-        if entry.get("sender_domain", "") == sender_domain:
+        if _extract_email_address(entry.get("sender", "")) == sender_email:
            actions[entry["action"]] += 1
    return dict(actions)


-def get_sender_history_count(sender_domain):
-    """Count total past decisions for a sender domain.
+def compute_confidence(sender_email, action, tags):
+    """Compute confidence from decision history by matching email signatures.

-    Used by the scan command to decide whether there is enough history
-    to trust auto-actions for this sender.
+    A "signature" is (sender_email, tags). Past decisions match if they have
+    the same sender email AND at least 50% tag overlap with the current email.
+
+    Confidence is based on two factors:
+      1. Agreement: what fraction of matching decisions chose the same action.
+      2. Match-count cap: limits confidence until enough history exists
+         (1 match -> max 10%, 5 matches -> 50%, 10+ -> 100%).
+
+    Returns an integer 0-100.
    """
    history = _load_history()
-    return sum(1 for e in history if e.get("sender_domain", "") == sender_domain)
+    if not history or not tags:
+        return 50
+
+    # Find past decisions with same sender and sufficient tag overlap
+    matches = []
+    for entry in history:
+        entry_email = _extract_email_address(entry.get("sender", ""))
+        if entry_email != sender_email:
+            continue
+
+        entry_tags = entry.get("tags", [])
+        if not entry_tags:
+            continue
+
+        shared = len(set(tags) & set(entry_tags))
+        min_len = min(len(tags), len(entry_tags))
+        if min_len > 0 and shared / min_len >= 0.5:
+            matches.append(entry)
+
+    if not matches:
+        return 50
+
+    # Agreement: fraction of matches with the same action
+    matching_action = sum(1 for m in matches if m["action"] == action)
+    total = len(matches)
+    agreement = round(matching_action / total * 100)
+
+    # Cap by match count: each match adds 10% to the cap
+    cap = min(total * 10, 100)
+
+    return min(agreement, cap)


 def get_known_labels():
@@ -194,13 +227,13 @@ def get_all_stats():
    by_action = Counter(e["action"] for e in history)
    by_source = Counter(e["source"] for e in history)

-    # Top 10 sender domains by decision count
-    domain_counts = Counter(e.get("sender_domain", "") for e in history)
-    top_domains = domain_counts.most_common(10)
+    # Top 10 sender addresses by decision count
+    sender_counts = Counter(_extract_email_address(e.get("sender", "")) for e in history)
+    top_senders = sender_counts.most_common(10)

    return {
        "total": total,
        "by_action": dict(by_action),
        "by_source": dict(by_source),
-        "top_domains": top_domains,
+        "top_senders": top_senders,
    }
--- a/scripts/email_processor/main.py
+++ b/scripts/email_processor/main.py
@@ -232,11 +232,11 @@ def save_pending(pending):
        json.dump(pending, f, indent=2, ensure_ascii=False)


-def add_to_pending(email_data, summary, reason, action_suggestion, confidence):
+def add_to_pending(email_data, summary, reason, action_suggestion, confidence, tags=None):
    """Add an email to the pending queue for manual review.

-    Stores the classifier's suggestion and confidence alongside the
-    email metadata so the user can see what the model thought.
+    Stores the classifier's suggestion, computed confidence, and tags
+    alongside the email metadata so the user can see what the model thought.
    """
    pending = load_pending()

@@ -254,6 +254,7 @@ def add_to_pending(email_data, summary, reason, action_suggestion, confidence):
        "reason": reason,
        "suggested_action": action_suggestion,
        "confidence": confidence,
+        "tags": tags or [],
        "email_date": email_data.get("date", ""),
        "status": "pending",
        "found_at": datetime.now().isoformat(),
@@ -283,10 +284,10 @@ def log_result(log_file, email_data, action, detail, duration=None):
 def cmd_scan(config, recent=None, dry_run=False):
    """Fetch emails, classify each one, then auto-act or queue.

-    Auto-action is based on a single confidence threshold. When the
-    decision history has fewer than 20 entries, a higher threshold (95%)
-    is used to be conservative during the learning phase. Once enough
-    history accumulates, the configured threshold takes over.
+    Confidence is computed from decision history by matching the email's
+    signature (sender_email, tags) against past decisions. New/unknown
+    senders start at 50% (queued). Confidence grows as consistent history
+    accumulates.

    Args:
        config:  full config dict.
@@ -302,17 +303,7 @@ def cmd_scan(config, recent=None, dry_run=False):

    # Load automation threshold
    automation = config.get("automation", {})
-    configured_threshold = automation.get("confidence_threshold", 75)
-
-    # Adaptive threshold: be conservative when history is thin
-    stats = decision_store.get_all_stats()
-    total_decisions = stats["total"] if stats else 0
-    bootstrap_min = automation.get("bootstrap_min_decisions", 20)
-    if total_decisions < bootstrap_min:
-        confidence_threshold = 95
-        print(f"Learning phase ({total_decisions}/{bootstrap_min} decisions) — threshold: 95%\n")
-    else:
-        confidence_threshold = configured_threshold
+    confidence_threshold = automation.get("confidence_threshold", 75)

    # Fetch envelopes via himalaya
    if recent:
@@ -354,12 +345,18 @@ def cmd_scan(config, recent=None, dry_run=False):
        email_data = build_email_data(envelope, body, config)
        print(f"{email_data['subject'][:55]}")

-        # Run the LLM classifier (includes few-shot examples from history)
-        action, confidence, summary, reason, duration = classifier.classify_email(
+        # Run the LLM classifier (returns tags instead of confidence)
+        action, tags, summary, reason, duration = classifier.classify_email(
            email_data, config
        )

+        # Compute confidence from decision history
+        sender_email = decision_store._extract_email_address(email_data.get("sender", ""))
+        confidence = decision_store.compute_confidence(sender_email, action, tags)
+
+        tags_str = ", ".join(tags) if tags else "(none)"
        print(f"    -> {action} (confidence: {confidence}%, {duration:.1f}s)")
+        print(f"       tags: [{tags_str}]")
        print(f"       {reason[:80]}")

        # Auto-act if confidence meets threshold
@@ -379,7 +376,7 @@ def cmd_scan(config, recent=None, dry_run=False):
            success = execute_action(eid, action)
            if success:
                decision_store.record_decision(
-                    {**email_data, "summary": summary}, action, source="auto"
+                    {**email_data, "summary": summary}, action, source="auto", tags=tags
                )
                log_result(log_file, email_data, f"AUTO:{action}", reason, duration)
                print(f"    ** AUTO-executed: {action}")
@@ -388,11 +385,11 @@ def cmd_scan(config, recent=None, dry_run=False):
                # Himalaya action failed — fall back to queuing
                log_result(log_file, email_data, "AUTO_FAILED", reason, duration)
                print(f"    !! Auto-action failed, queuing instead")
-                add_to_pending(email_data, summary, reason, action, confidence)
+                add_to_pending(email_data, summary, reason, action, confidence, tags)
                queued += 1
        else:
            # Not enough confidence or history — queue for manual review
-            add_to_pending(email_data, summary, reason, action, confidence)
+            add_to_pending(email_data, summary, reason, action, confidence, tags)
            log_result(log_file, email_data, f"QUEUED:{action}@{confidence}%", reason, duration)
            print(f"    -> Queued (confidence {confidence}% < {confidence_threshold}%)")
            queued += 1
@@ -440,11 +437,14 @@ def cmd_review_list():
    for i, (msg_id, data) in enumerate(sorted_items, 1):
        suggested = data.get("suggested_action", "?")
        conf = data.get("confidence", "?")
+        tags = data.get("tags", [])
+        tags_str = ", ".join(tags) if tags else "(none)"
        print(f"\n  {i}. [{msg_id}]")
        print(f"     Subject: {data.get('subject', 'N/A')[:55]}")
        print(f"     From: {data.get('sender', 'N/A')[:55]}")
        print(f"     To: {data.get('recipient', 'N/A')[:40]}")
        print(f"     Summary: {data.get('summary', 'N/A')[:70]}")
+        print(f"     Tags: [{tags_str}]")
        print(f"     Suggested: {suggested} ({conf}% confidence)")

    print(f"\n{'=' * 60}")
@@ -496,7 +496,7 @@ def cmd_review_act(selector, action):
        success = execute_action(eid, action)
        if success:
            # Record decision for future learning
-            decision_store.record_decision(data, action, source="user")
+            decision_store.record_decision(data, action, source="user", tags=data.get("tags", []))

            # Mark as done in pending queue
            pending = load_pending()
@@ -540,7 +540,7 @@ def cmd_review_accept():

        success = execute_action(eid, action)
        if success:
-            decision_store.record_decision(data, action, source="user")
+            decision_store.record_decision(data, action, source="user", tags=data.get("tags", []))

            pending = load_pending()
            pending[msg_id]["status"] = "done"
@@ -616,14 +616,14 @@ def cmd_stats():
    for action, count in sorted(stats["by_action"].items(), key=lambda x: -x[1]):
        print(f"  {action}: {count}")

-    # Top sender domains with per-domain action counts
-    print(f"\nTop sender domains:")
-    for domain, count in stats["top_domains"]:
-        domain_stats = decision_store.get_sender_stats(domain)
+    # Top sender addresses with per-sender action counts
+    print(f"\nTop senders:")
+    for sender, count in stats["top_senders"]:
+        sender_stats = decision_store.get_sender_stats(sender)
        detail = ", ".join(
-            f"{a}:{c}" for a, c in sorted(domain_stats.items(), key=lambda x: -x[1])
+            f"{a}:{c}" for a, c in sorted(sender_stats.items(), key=lambda x: -x[1])
        )
-        print(f"  {domain}: {count} ({detail})")
+        print(f"  {sender}: {count} ({detail})")

    # Custom labels
    labels = decision_store.get_known_labels()
Author	SHA1	Message	Date
Yanxin Lu	eb0310fc2d	Compute confidence from decision history instead of LLM	2026-03-04 15:05:44 -08:00
Yanxin Lu	64e28b55d1	Compute confidence from decision history instead of LLM	2026-03-04 14:23:50 -08:00