youlu-openclaw-workspace/scripts/email_processor/README.md

# Email Processor

Learning-based mailbox cleanup using Himalaya (IMAP) + Ollama (local LLM). Classifies emails with category tags, computes confidence from decision history, and gradually automates common actions.

## Prerequisites

- **uv** — Python package manager, handles venv and dependencies automatically.
- **Himalaya** — CLI email client, handles IMAP connection and auth.
- **Ollama** — local LLM server.

```bash
# Install uv (macOS)
brew install uv

# Install himalaya (macOS)
brew install himalaya

# Configure himalaya for your IMAP account (first time only)
himalaya account list  # should show your account after setup

# Install and start Ollama, pull the model
brew install ollama
ollama pull kamekichi128/qwen3-4b-instruct-2507:latest
```

## How It Works

The system separates **classification** (what the LLM does) from **confidence** (computed from your decision history). The LLM picks an action and category tags. Confidence is computed by matching the email's signature `(sender_email, tags)` against past decisions.

### Early On (few decisions)

1. **Cron runs `scan`.** For each email, the LLM suggests an action and assigns tags from a fixed taxonomy. Since there's no history yet, `compute_confidence` returns 50% (below the 85% threshold), so everything gets queued.

2. **You run `review list`.** It prints what's pending, identified by envelope ID (himalaya's IMAP UID):
   ```
     [42] msg_f1d43ea3
        Subject: New jobs matching your profile
        From: LinkedIn
        Tags: [promotion, social, newsletter]
        Suggested: delete (50%)
     [43] msg_60c56a87
        Subject: Your order shipped
        From: Amazon
        Tags: [shipping, confirmation, receipt]
        Suggested: archive (50%)
   ```

3. **You act on them.** Either individually or in bulk, using the envelope ID:
   ```bash
   ./email-processor.sh review 42 delete             # agree with suggestion
   ./email-processor.sh review 43 archive            # archive by envelope ID
   ./email-processor.sh review accept                # accept all suggestions at once
   ```
   Each command executes via himalaya and appends to `decision_history.json` with tags.

4. **Next scan is smarter.** Confidence grows as consistent history accumulates for each sender+tags signature.

### Steady State (10+ consistent decisions per sender)

- **Repeat senders** with consistent tag signatures reach 85%+ confidence and get auto-acted during `scan`. They never touch the pending queue.
- **New or ambiguous senders** start at 50% and get queued.
- **You occasionally run `review list`** to handle stragglers — each decision further builds history.
- **`digest` gives a quick glance** at what was processed recently — subject lines grouped by action, with `[auto]`/`[user]` markers.
- **`stats` shows your automation rate** climbing over time.

### Confidence Computation

Confidence is computed by `decision_store.compute_confidence(sender_email, action, tags)`:

1. **Find matches** — past decisions with the same sender email AND at least 50% tag overlap (`shared_tags / min(current_tags, past_tags) >= 0.5`).
2. **Agreement** — what fraction of matches chose the same action the LLM is suggesting.
3. **Match-count cap** — each match adds 10% to the maximum confidence (1 match = max 10%, 5 = 50%, 10+ = 100%).
4. **Final confidence** = `min(agreement, cap)`.
5. **No matches** = 50% (below threshold, gets queued).

This means reaching the 85% auto-action threshold requires at least 9 consistent decisions from the same sender with overlapping tags.

## Usage

All commands are non-interactive — they take arguments, act, and exit. Compatible with cron/OpenClaw.

```bash
# Make the entry script executable (first time)
chmod +x email-processor.sh

# --- Scan ---
./email-processor.sh scan                         # classify unseen emails
./email-processor.sh scan --recent 30             # classify last 30 days
./email-processor.sh scan --dry-run               # classify only, no changes
./email-processor.sh scan --recent 7 --dry-run    # combine both

# --- Review ---
./email-processor.sh review list                  # show pending queue
./email-processor.sh review 42 delete              # delete envelope 42
./email-processor.sh review 43 archive             # archive envelope 43
./email-processor.sh review msg_f1d43ea3 archive   # archive by msg_id
./email-processor.sh review all delete             # delete all pending
./email-processor.sh review accept                 # accept all suggestions

# --- Digest ---
./email-processor.sh digest                       # today's processed emails
./email-processor.sh digest --recent 3            # last 3 days

# --- Other ---
./email-processor.sh stats                        # show decision history
```

Or call Python directly: `python main.py scan --dry-run`

## Actions

| Action | Effect |
|---|---|
| `delete` | Move to Trash (`himalaya message delete`) |
| `archive` | Move to Archive folder |
| `keep` | Leave unread in inbox (no changes) |
| `mark_read` | Add `\Seen` flag, stays in inbox |
| `label:<name>` | Move to named folder (created if needed) |

## Tag Taxonomy

The LLM assigns 3-5 tags from this fixed list to each email:

```
receipt, billing, shipping,
promotion, newsletter, security, social,
reminder, confirmation, alert,
personal, account, subscription, travel
```

Tags serve one purpose: making signature matching work for confidence computation. They need to be specific enough to distinguish different email types from the same sender that you'd treat differently (e.g., `[account, security]` for a password reset vs `[promotion]` for a promo, both from the same service).

### Refining the Tag Taxonomy

The tag list should stay fixed and manually curated. Automatic expansion risks breaking history matching — if old entries use different tags than new ones, confidence computation degrades. Refinement is a periodic, manual process — run it when you notice confidence problems in the logs (e.g., same sender getting inconsistent actions, or emails being queued that should be auto-acted).

#### When to refine

Run this process when any of these are true:
- A sender you've reviewed 10+ times is still getting queued (confidence stuck below 85%).
- The same sender has a mix of actions in history (e.g., some deleted, some kept) and you suspect the tags aren't distinguishing the email types.
- You're seeing tags in the logs that feel too vague for the emails they describe.

#### Step-by-step process

**Step 1: Find senders with inconsistent actions.**

Load `data/decision_history.json` and group entries by sender email address. For each sender, check if multiple different actions were taken. These are the candidates — the tag taxonomy may not be specific enough to separate their email types.

Example: sender `noreply@example.com` has 8 entries with action `delete` and 4 entries with action `keep`. That's a split worth investigating.

**Step 2: For each candidate sender, examine the entries.**

Look at the subject lines, summaries, and current tags of the entries that got different actions. Identify the pattern — what makes the "delete" emails different from the "keep" emails?

Example:
- Deleted emails: subjects like "50% off sale", "Weekly deals" → tags: `[promotion, account, newsletter]`
- Kept emails: subjects like "Your password was changed", "New login from Chrome" → tags: `[security, account, alert]`

The shared tag `account` is causing these to match as the same signature, dragging confidence down.

**Step 3: Determine if a new tag would fix it.**

Ask: is there a category that applies to one group but not the other? In the example above, the LLM is assigning `account` to both promotional and security emails from the same service. Check if the problem is LLM consistency (the tag exists but the model uses it too broadly) or a missing tag (no existing tag can distinguish the two types).

If the tag exists but is overused: the problem is LLM consistency, not the taxonomy. Consider adjusting the prompt or few-shot examples.

If the tag doesn't exist: propose a new tag.

**Step 4: Validate the proposed tag.**

Before adding, check that the new tag:
- Is **distinct** from existing tags (not a synonym — e.g., don't add `promo` when `promotion` exists).
- Is **broadly useful** — it should apply to emails from multiple senders, not just the one you're debugging.
- Doesn't **overlap** with an existing tag in a way that would confuse the LLM (e.g., adding `order` when `receipt` and `shipping` already cover those cases).

**Step 5: Add the tag to `TAG_TAXONOMY` in `classifier.py`.**

Add the new tag to the `TAG_TAXONOMY` list in `classifier.py:30-37`. Keep the list organized by category. The LLM prompt automatically picks up the updated list on the next scan.

**Step 6: Decide whether to wipe history.**

- If you added 1-2 tags: **don't wipe**. Old entries without the new tags will gradually be outweighed by new entries that have them. The 50% overlap threshold is forgiving enough that old entries still contribute during the transition.
- If you added 3+ tags or the new tags fundamentally change how common email types would be tagged: **wipe** `data/decision_history.json` and `data/pending_emails.json`. Old entries without the new tags are dead weight — they won't match new entries and won't contribute to confidence.

**Step 7: Verify with a dry run.**

```bash
./email-processor.sh scan --recent 7 --dry-run
```

Check the logs for the affected senders:
- Are the new tags being assigned?
- Are different email types from the same sender now getting different tag sets?
- If history was preserved, is confidence trending correctly?

#### Rules

- **Only add tags, never rename.** Renaming `billing` to `finance` means old entries with `billing` never match new entries with `finance`. If you must rename, keep both in the taxonomy.
- **Avoid deleting tags.** Old entries with deleted tags become slightly less useful (fewer matching tags) but don't cause incorrect matches. Only delete a tag if it's actively causing confusion (e.g., the LLM uses it inconsistently and it's hurting overlap calculations).
- **Keep the taxonomy small.** More tags = more choices for the LLM = more inconsistency. The taxonomy should have the minimum number of tags needed to distinguish email types that deserve different actions. 10-20 tags is a reasonable range.

## Configuration

`config.json` — only Ollama and automation settings. IMAP auth is managed by himalaya's own config.

```json
{
  "ollama": {
    "host": "http://localhost:11434",
    "model": "kamekichi128/qwen3-4b-instruct-2507:latest"
  },
  "rules": {
    "max_body_length": 1000
  },
  "automation": {
    "confidence_threshold": 85
  }
}
```

| Key | Description |
|---|---|
| `ollama.host` | Ollama server URL. Default `http://localhost:11434`. |
| `ollama.model` | Ollama model to use for classification. |
| `rules.max_body_length` | Max characters of email body sent to the LLM. Longer bodies are truncated. Keeps prompt size and latency down. |
| `automation.confidence_threshold` | Minimum confidence (0-100) for auto-action. Emails below this get queued for review. At 85%, you need at least 9 consistent decisions from the same sender with overlapping tags before auto-action kicks in. |

## Testing

```bash
# 1. Verify himalaya can reach your mailbox
himalaya envelope list --page-size 3

# 2. Verify Ollama is running with the model
ollama list  # should show kamekichi128/qwen3-4b-instruct-2507:latest

# 3. Dry run — classify recent emails without touching anything
./email-processor.sh scan --recent 7 --dry-run

# 4. Live run — classify and act (auto-act or queue)
./email-processor.sh scan --recent 7

# 5. Check what got queued
./email-processor.sh review list

# 6. Act on a queued email to seed decision history
./email-processor.sh review 42 delete

# 7. Check that the decision was recorded
./email-processor.sh stats

# 8. Quick glance at what was processed today
./email-processor.sh digest
```

## File Structure

```
email_processor/
  main.py              # Entry point — scan/review/stats/digest subcommands
  classifier.py        # LLM prompt builder + response parser, tag taxonomy
  decision_store.py    # Decision history, confidence computation, few-shot retrieval
  config.json          # Ollama + automation settings
  email-processor.sh   # Shell wrapper (activates venv, forwards args)
  data/
    pending_emails.json    # Queue of emails awaiting review
    decision_history.json  # Past decisions (few-shot learning + confidence data)
  logs/
    YYYY-MM-DD.log         # Daily processing logs
    llm_YYYY-MM-DD.log     # Full LLM prompt/response logs
```

## Design Decisions

### Himalaya instead of raw IMAP

All IMAP operations go through the `himalaya` CLI via subprocess calls. This means:
- No IMAP credentials stored in config.json — himalaya manages its own auth.
- No connection management, reconnect logic, or SSL setup in Python.
- Each action is a single himalaya command (e.g., `himalaya message delete 42`).

The tradeoff is a subprocess spawn per operation, but for email volumes (tens per run, not thousands) this is negligible.

### Non-interactive design

Every command takes its full input as arguments, acts, and exits. No `input()` calls, no interactive loops. This makes the system compatible with cron/OpenClaw and composable with other scripts. The pending queue on disk (`pending_emails.json`) is the shared state between scan and review invocations.

### LLM classifies, history decides confidence

The LLM produces an action and category tags but NOT a confidence score. Confidence is computed from decision history by matching email signatures `(sender_email, tags)` against past decisions. This avoids the problem of LLMs clustering confidence around 85-95% regardless of actual certainty, making threshold systems ineffective.

The match-count cap (each match adds 10% to the maximum) replaces the old bootstrap/learning-phase logic. Confidence grows organically per-sender as history accumulates, rather than using a global gate.

### decision_history.json as the "database"

`data/decision_history.json` is the only persistent state that matters for learning. It's a flat JSON array — every decision (user or auto) is appended as an entry with tags. The classifier reads the whole file on each email to find relevant few-shot examples via relevance scoring, and `compute_confidence` scans it for matching signatures.

A flat JSON file works fine for hundreds or low thousands of decisions. SQLite would make sense if the history grows past ~10k entries and the linear scan becomes noticeable, or if concurrent writes from multiple processes become necessary. Neither applies at current scale.

### Few-shot learning via relevance scoring

Rather than sending the entire decision history to the LLM, `decision_store.get_relevant_examples()` scores each past decision against the current email using two signals:
- Exact sender email address match (+3 points)
- Subject keyword overlap (+1 per shared word, stop-words excluded)

The top 5 most relevant examples are injected into the prompt as few-shot demonstrations. This keeps the prompt small while giving the model the most useful context.

### Fixed tag taxonomy

Tags are defined in `classifier.py` as `TAG_TAXONOMY` — a manually curated list of 14 categories. The LLM must pick from this list (invalid tags are silently dropped). The taxonomy should stay fixed to keep history matching stable. See "Refining the Tag Taxonomy" above for when and how to update it.

### `keep` means unread

The `keep` action is a deliberate no-op — it leaves the email unread in the inbox, meaning it needs human attention. This is distinct from `mark_read`, which dismisses low-priority emails without moving them.

### Envelope IDs

Emails are identified by their envelope ID, which is himalaya's IMAP UID — a stable, unique identifier assigned by the mail server. UIDs don't shift when other messages are deleted or moved, so the same envelope ID always refers to the same email. Review commands use envelope IDs directly (e.g., `review 93 delete`). The `msg_id` hash (e.g., `msg_f1d43ea3`) is an internal key for the pending queue and can also be used as a selector.

### Fail-safe classification

If the LLM call fails (Ollama down, model not loaded, timeout), the classifier returns `action="keep"` with empty tags. Empty tags produce 50% confidence (below threshold), so the email gets queued for manual review rather than being auto-acted upon. The system never auto-trashes an email it couldn't classify.