From 81bc42075fc8c5f46b53ebdf1abb1b52ac511feb Mon Sep 17 00:00:00 2001
From: Yanxin Lu <ylu@meta.com>
Date: Wed, 4 Mar 2026 15:16:39 -0800
Subject: [PATCH] steps for tag refinement

---
 scripts/email_processor/README.md | 72 +++++++++++++++++++++++++++----
 1 file changed, 64 insertions(+), 8 deletions(-)

diff --git a/scripts/email_processor/README.md b/scripts/email_processor/README.md
index d0a1e50..9cc5f42 100644
--- a/scripts/email_processor/README.md
+++ b/scripts/email_processor/README.md
@@ -124,17 +124,73 @@ Tags serve one purpose: making signature matching work for confidence computatio
 
 ### Refining the Tag Taxonomy
 
-The tag list should stay fixed and manually curated. Automatic expansion risks breaking history matching — if old entries use different tags than new ones, confidence computation degrades.
+The tag list should stay fixed and manually curated. Automatic expansion risks breaking history matching — if old entries use different tags than new ones, confidence computation degrades. Refinement is a periodic, manual process — run it when you notice confidence problems in the logs (e.g., same sender getting inconsistent actions, or emails being queued that should be auto-acted).
 
-To identify gaps, periodically review your decision history for cases where the same sender got inconsistent actions. Feed the history and current tags to the LLM and ask what new tag would distinguish them. For example:
+#### When to refine
 
-> "Here are my current tags: [list]. Here are history entries where sender X got different actions: [entries]. Suggest a new tag that would separate the email types that deserve different actions."
+Run this process when any of these are true:
+- A sender you've reviewed 10+ times is still getting queued (confidence stuck below 85%).
+- The same sender has a mix of actions in history (e.g., some deleted, some kept) and you suspect the tags aren't distinguishing the email types.
+- You're seeing tags in the logs that feel too vague for the emails they describe.
 
-Guidelines:
-- **Add tags** when a broad tag (like `notification`) is the only thing two different email types share, and you'd handle them differently.
-- **Don't rename tags** — old history entries would stop matching. If you must rename, keep both the old and new tag in the taxonomy.
-- **Don't delete tags** unless you're sure no important history depends on them for distinguishing email types. Old entries with deleted tags become slightly less useful but don't cause wrong matches.
-- **After significant taxonomy changes**, consider wiping `decision_history.json` and `pending_emails.json` and rebuilding from scratch, since old entries without the new tags won't contribute to confidence anyway.
+#### Step-by-step process
+
+**Step 1: Find senders with inconsistent actions.**
+
+Load `data/decision_history.json` and group entries by sender email address. For each sender, check if multiple different actions were taken. These are the candidates — the tag taxonomy may not be specific enough to separate their email types.
+
+Example: sender `noreply@example.com` has 8 entries with action `delete` and 4 entries with action `keep`. That's a split worth investigating.
+
+**Step 2: For each candidate sender, examine the entries.**
+
+Look at the subject lines, summaries, and current tags of the entries that got different actions. Identify the pattern — what makes the "delete" emails different from the "keep" emails?
+
+Example:
+- Deleted emails: subjects like "50% off sale", "Weekly deals" → tags: `[promotion, notification, newsletter]`
+- Kept emails: subjects like "Your password was changed", "New login from Chrome" → tags: `[security, notification, update]`
+
+The shared tag `notification` is causing these to match as the same signature, dragging confidence down.
+
+**Step 3: Determine if a new tag would fix it.**
+
+Ask: is there a category that applies to one group but not the other? In the example above, an `account` tag would distinguish password/login emails from promotional emails. Check if the tag already exists in `TAG_TAXONOMY` in `classifier.py` — it might just be that the LLM isn't using it consistently.
+
+If the tag already exists: the problem is LLM consistency, not the taxonomy. Consider adjusting the prompt or few-shot examples.
+
+If the tag doesn't exist: propose a new tag.
+
+**Step 4: Validate the proposed tag.**
+
+Before adding, check that the new tag:
+- Is **distinct** from existing tags (not a synonym — e.g., don't add `promo` when `promotion` exists).
+- Is **broadly useful** — it should apply to emails from multiple senders, not just the one you're debugging.
+- Doesn't **overlap** with an existing tag in a way that would confuse the LLM (e.g., adding `order` when `receipt` and `shipping` already cover those cases).
+
+**Step 5: Add the tag to `TAG_TAXONOMY` in `classifier.py`.**
+
+Add the new tag to the `TAG_TAXONOMY` list in `classifier.py:30-38`. Keep the list organized by category. The LLM prompt automatically picks up the updated list on the next scan.
+
+**Step 6: Decide whether to wipe history.**
+
+- If you added 1-2 tags: **don't wipe**. Old entries without the new tags will gradually be outweighed by new entries that have them. The 50% overlap threshold is forgiving enough that old entries still contribute during the transition.
+- If you added 3+ tags or the new tags fundamentally change how common email types would be tagged: **wipe** `data/decision_history.json` and `data/pending_emails.json`. Old entries without the new tags are dead weight — they won't match new entries and won't contribute to confidence.
+
+**Step 7: Verify with a dry run.**
+
+```bash
+./email-processor.sh scan --recent 7 --dry-run
+```
+
+Check the logs for the affected senders:
+- Are the new tags being assigned?
+- Are different email types from the same sender now getting different tag sets?
+- If history was preserved, is confidence trending correctly?
+
+#### Rules
+
+- **Only add tags, never rename.** Renaming `billing` to `finance` means old entries with `billing` never match new entries with `finance`. If you must rename, keep both in the taxonomy.
+- **Avoid deleting tags.** Old entries with deleted tags become slightly less useful (fewer matching tags) but don't cause incorrect matches. Only delete a tag if it's actively causing confusion (e.g., the LLM uses it inconsistently and it's hurting overlap calculations).
+- **Keep the taxonomy small.** More tags = more choices for the LLM = more inconsistency. The taxonomy should have the minimum number of tags needed to distinguish email types that deserve different actions. 20-30 tags is a reasonable range.
 
 ## Configuration