Files

Yanxin Lu 361e983b0f Stable review indices and deduplicate tag taxonomy

Review items now get a stable scan_index assigned during scan, so
sequential review commands don't target wrong emails after earlier
items are resolved. Indices reset on each new scan.

Deduplicate tag taxonomy from 21 to 14 tags: drop invoice/payment
(covered by billing), delivery (covered by shipping), discount/marketing
(covered by promotion), and generic notification/update tags.

2026-03-05 15:02:49 -08:00

16 KiB

Raw Blame History

Email Processor

Learning-based mailbox cleanup using Himalaya (IMAP) + Ollama (local LLM). Classifies emails with category tags, computes confidence from decision history, and gradually automates common actions.

Prerequisites

uv — Python package manager, handles venv and dependencies automatically.
Himalaya — CLI email client, handles IMAP connection and auth.
Ollama — local LLM server.

# Install uv (macOS)
brew install uv

# Install himalaya (macOS)
brew install himalaya

# Configure himalaya for your IMAP account (first time only)
himalaya account list  # should show your account after setup

# Install and start Ollama, pull the model
brew install ollama
ollama pull kamekichi128/qwen3-4b-instruct-2507:latest

How It Works

The system separates classification (what the LLM does) from confidence (computed from your decision history). The LLM picks an action and category tags. Confidence is computed by matching the email's signature (sender_email, tags) against past decisions.

Early On (few decisions)

Cron runs scan. For each email, the LLM suggests an action and assigns tags from a fixed taxonomy. Since there's no history yet, compute_confidence returns 50% (below the 85% threshold), so everything gets queued.

You run review list. It prints what's pending. Item numbers are stable within a scan cycle — they don't shift when earlier items are resolved:

  1. [msg_f1d43ea3]  Subject: New jobs matching your profile
     From: LinkedIn
     Tags: [promotion, social, notification]
     Suggested: delete (50%)
  2. [msg_60c56a87]  Subject: Your order shipped
     From: Amazon
     Tags: [shipping, confirmation, notification]
     Suggested: archive (50%)

You act on them. Either individually or in bulk. Numbers stay stable — after deleting item 1, item 2 is still 2:

./email-processor.sh review 1 delete     # agree with suggestion
./email-processor.sh review 2 archive    # still #2, not renumbered
./email-processor.sh review accept       # accept all suggestions at once

Each command executes via himalaya and appends to decision_history.json with tags.

Next scan is smarter. Confidence grows as consistent history accumulates for each sender+tags signature.

Steady State (10+ consistent decisions per sender)

Repeat senders with consistent tag signatures reach 85%+ confidence and get auto-acted during scan. They never touch the pending queue.
New or ambiguous senders start at 50% and get queued.
You occasionally run review list to handle stragglers — each decision further builds history.
stats shows your automation rate climbing over time.

Confidence Computation

Confidence is computed by decision_store.compute_confidence(sender_email, action, tags):

Find matches — past decisions with the same sender email AND at least 50% tag overlap (shared_tags / min(current_tags, past_tags) >= 0.5).
Agreement — what fraction of matches chose the same action the LLM is suggesting.
Match-count cap — each match adds 10% to the maximum confidence (1 match = max 10%, 5 = 50%, 10+ = 100%).
Final confidence = min(agreement, cap).
No matches = 50% (below threshold, gets queued).

This means reaching the 85% auto-action threshold requires at least 9 consistent decisions from the same sender with overlapping tags.

Usage

All commands are non-interactive — they take arguments, act, and exit. Compatible with cron/OpenClaw.

# Make the entry script executable (first time)
chmod +x email-processor.sh

# --- Scan ---
./email-processor.sh scan                         # classify unseen emails
./email-processor.sh scan --recent 30             # classify last 30 days
./email-processor.sh scan --dry-run               # classify only, no changes
./email-processor.sh scan --recent 7 --dry-run    # combine both

# --- Review ---
./email-processor.sh review list                  # show pending queue
./email-processor.sh review 1 delete              # delete item #1
./email-processor.sh review 3 archive             # #3 is still #3 even after #1 was deleted
./email-processor.sh review msg_f1d43ea3 archive  # archive by ID
./email-processor.sh review all delete            # delete all pending
./email-processor.sh review accept                # accept all suggestions

# --- Other ---
./email-processor.sh stats                        # show decision history

Or call Python directly: python main.py scan --dry-run

Actions

Action	Effect
`delete`	Move to Trash (`himalaya message delete`)
`archive`	Move to Archive folder
`keep`	Leave unread in inbox (no changes)
`mark_read`	Add `\Seen` flag, stays in inbox
`label:<name>`	Move to named folder (created if needed)

Tag Taxonomy

The LLM assigns 3-5 tags from this fixed list to each email:

receipt, billing, shipping,
promotion, newsletter, security, social,
reminder, confirmation, alert,
personal, account, subscription, travel

Tags serve one purpose: making signature matching work for confidence computation. They need to be specific enough to distinguish different email types from the same sender that you'd treat differently (e.g., [account, security] for a password reset vs [promotion] for a promo, both from the same service).

Refining the Tag Taxonomy

The tag list should stay fixed and manually curated. Automatic expansion risks breaking history matching — if old entries use different tags than new ones, confidence computation degrades. Refinement is a periodic, manual process — run it when you notice confidence problems in the logs (e.g., same sender getting inconsistent actions, or emails being queued that should be auto-acted).

When to refine

Run this process when any of these are true:

A sender you've reviewed 10+ times is still getting queued (confidence stuck below 85%).
The same sender has a mix of actions in history (e.g., some deleted, some kept) and you suspect the tags aren't distinguishing the email types.
You're seeing tags in the logs that feel too vague for the emails they describe.

Step-by-step process

Step 1: Find senders with inconsistent actions.

Load data/decision_history.json and group entries by sender email address. For each sender, check if multiple different actions were taken. These are the candidates — the tag taxonomy may not be specific enough to separate their email types.

Example: sender noreply@example.com has 8 entries with action delete and 4 entries with action keep. That's a split worth investigating.

Step 2: For each candidate sender, examine the entries.

Look at the subject lines, summaries, and current tags of the entries that got different actions. Identify the pattern — what makes the "delete" emails different from the "keep" emails?

Example:

Deleted emails: subjects like "50% off sale", "Weekly deals" → tags: [promotion, notification, newsletter]
Kept emails: subjects like "Your password was changed", "New login from Chrome" → tags: [security, notification, update]

The shared tag notification is causing these to match as the same signature, dragging confidence down.

Step 3: Determine if a new tag would fix it.

Ask: is there a category that applies to one group but not the other? In the example above, an account tag would distinguish password/login emails from promotional emails. Check if the tag already exists in TAG_TAXONOMY in classifier.py — it might just be that the LLM isn't using it consistently.

If the tag already exists: the problem is LLM consistency, not the taxonomy. Consider adjusting the prompt or few-shot examples.

If the tag doesn't exist: propose a new tag.

Step 4: Validate the proposed tag.

Before adding, check that the new tag:

Is distinct from existing tags (not a synonym — e.g., don't add promo when promotion exists).
Is broadly useful — it should apply to emails from multiple senders, not just the one you're debugging.
Doesn't overlap with an existing tag in a way that would confuse the LLM (e.g., adding order when receipt and shipping already cover those cases).

Step 5: Add the tag to TAG_TAXONOMY in classifier.py.

Add the new tag to the TAG_TAXONOMY list in classifier.py:30-38. Keep the list organized by category. The LLM prompt automatically picks up the updated list on the next scan.

Step 6: Decide whether to wipe history.

If you added 1-2 tags: don't wipe. Old entries without the new tags will gradually be outweighed by new entries that have them. The 50% overlap threshold is forgiving enough that old entries still contribute during the transition.
If you added 3+ tags or the new tags fundamentally change how common email types would be tagged: wipe data/decision_history.json and data/pending_emails.json. Old entries without the new tags are dead weight — they won't match new entries and won't contribute to confidence.

Step 7: Verify with a dry run.

./email-processor.sh scan --recent 7 --dry-run

Check the logs for the affected senders:

Are the new tags being assigned?
Are different email types from the same sender now getting different tag sets?
If history was preserved, is confidence trending correctly?

Rules

Only add tags, never rename. Renaming billing to finance means old entries with billing never match new entries with finance. If you must rename, keep both in the taxonomy.
Avoid deleting tags. Old entries with deleted tags become slightly less useful (fewer matching tags) but don't cause incorrect matches. Only delete a tag if it's actively causing confusion (e.g., the LLM uses it inconsistently and it's hurting overlap calculations).
Keep the taxonomy small. More tags = more choices for the LLM = more inconsistency. The taxonomy should have the minimum number of tags needed to distinguish email types that deserve different actions. 20-30 tags is a reasonable range.

Configuration

config.json — only Ollama and automation settings. IMAP auth is managed by himalaya's own config.

{
  "ollama": {
    "host": "http://localhost:11434",
    "model": "kamekichi128/qwen3-4b-instruct-2507:latest"
  },
  "rules": {
    "max_body_length": 1000
  },
  "automation": {
    "confidence_threshold": 85
  }
}

Key	Description
`ollama.host`	Ollama server URL. Default `http://localhost:11434`.
`ollama.model`	Ollama model to use for classification.
`rules.max_body_length`	Max characters of email body sent to the LLM. Longer bodies are truncated. Keeps prompt size and latency down.
`automation.confidence_threshold`	Minimum confidence (0-100) for auto-action. Emails below this get queued for review. At 85%, you need at least 9 consistent decisions from the same sender with overlapping tags before auto-action kicks in.

Testing

# 1. Verify himalaya can reach your mailbox
himalaya envelope list --page-size 3

# 2. Verify Ollama is running with the model
ollama list  # should show kamekichi128/qwen3-4b-instruct-2507:latest

# 3. Dry run — classify recent emails without touching anything
./email-processor.sh scan --recent 7 --dry-run

# 4. Live run — classify and act (auto-act or queue)
./email-processor.sh scan --recent 7

# 5. Check what got queued
./email-processor.sh review list

# 6. Act on a queued email to seed decision history
./email-processor.sh review 1 delete

# 7. Check that the decision was recorded
./email-processor.sh stats

File Structure

email_processor/
  main.py              # Entry point — scan/review/stats subcommands
  classifier.py        # LLM prompt builder + response parser, tag taxonomy
  decision_store.py    # Decision history, confidence computation, few-shot retrieval
  config.json          # Ollama + automation settings
  email-processor.sh   # Shell wrapper (activates venv, forwards args)
  data/
    pending_emails.json    # Queue of emails awaiting review
    decision_history.json  # Past decisions (few-shot learning + confidence data)
  logs/
    YYYY-MM-DD.log         # Daily processing logs
    llm_YYYY-MM-DD.log     # Full LLM prompt/response logs

Design Decisions

Himalaya instead of raw IMAP

All IMAP operations go through the himalaya CLI via subprocess calls. This means:

No IMAP credentials stored in config.json — himalaya manages its own auth.
No connection management, reconnect logic, or SSL setup in Python.
Each action is a single himalaya command (e.g., himalaya message delete 42).

The tradeoff is a subprocess spawn per operation, but for email volumes (tens per run, not thousands) this is negligible.

Non-interactive design

Every command takes its full input as arguments, acts, and exits. No input() calls, no interactive loops. This makes the system compatible with cron/OpenClaw and composable with other scripts. The pending queue on disk (pending_emails.json) is the shared state between scan and review invocations.

LLM classifies, history decides confidence

The LLM produces an action and category tags but NOT a confidence score. Confidence is computed from decision history by matching email signatures (sender_email, tags) against past decisions. This avoids the problem of LLMs clustering confidence around 85-95% regardless of actual certainty, making threshold systems ineffective.

The match-count cap (each match adds 10% to the maximum) replaces the old bootstrap/learning-phase logic. Confidence grows organically per-sender as history accumulates, rather than using a global gate.

decision_history.json as the "database"

data/decision_history.json is the only persistent state that matters for learning. It's a flat JSON array — every decision (user or auto) is appended as an entry with tags. The classifier reads the whole file on each email to find relevant few-shot examples via relevance scoring, and compute_confidence scans it for matching signatures.

A flat JSON file works fine for hundreds or low thousands of decisions. SQLite would make sense if the history grows past ~10k entries and the linear scan becomes noticeable, or if concurrent writes from multiple processes become necessary. Neither applies at current scale.

Few-shot learning via relevance scoring

Rather than sending the entire decision history to the LLM, decision_store.get_relevant_examples() scores each past decision against the current email using two signals:

Exact sender email address match (+3 points)
Subject keyword overlap (+1 per shared word, stop-words excluded)

The top 5 most relevant examples are injected into the prompt as few-shot demonstrations. This keeps the prompt small while giving the model the most useful context.

Fixed tag taxonomy

Tags are defined in classifier.py as TAG_TAXONOMY — a manually curated list of 21 categories. The LLM must pick from this list (invalid tags are silently dropped). The taxonomy should stay fixed to keep history matching stable. See "Refining the Tag Taxonomy" above for when and how to update it.

`keep` means unread

The keep action is a deliberate no-op — it leaves the email unread in the inbox, meaning it needs human attention. This is distinct from mark_read, which dismisses low-priority emails without moving them.

Stable item numbers during review

Each pending item gets a scan_index assigned sequentially during scan. These numbers are stable within a scan cycle — resolving item 1 doesn't renumber item 2 to 1. This matters when an agent (like OpenClaw) issues multiple review <n> <action> commands in sequence: without stable indices, the queue renumbers after each action, causing later commands to target the wrong emails. Indices reset to 1 on each new scan (done items from the previous cycle are cleared at scan start).

Fail-safe classification

If the LLM call fails (Ollama down, model not loaded, timeout), the classifier returns action="keep" with empty tags. Empty tags produce 50% confidence (below threshold), so the email gets queued for manual review rather than being auto-acted upon. The system never auto-trashes an email it couldn't classify.

16 KiB Raw Blame History