309 lines
15 KiB
Markdown
309 lines
15 KiB
Markdown
# Email Processor
|
|
|
|
Learning-based mailbox cleanup using Himalaya (IMAP) + Ollama (local LLM). Classifies emails with category tags, computes confidence from decision history, and gradually automates common actions.
|
|
|
|
## Prerequisites
|
|
|
|
- **uv** — Python package manager, handles venv and dependencies automatically.
|
|
- **Himalaya** — CLI email client, handles IMAP connection and auth.
|
|
- **Ollama** — local LLM server.
|
|
|
|
```bash
|
|
# Install uv (macOS)
|
|
brew install uv
|
|
|
|
# Install himalaya (macOS)
|
|
brew install himalaya
|
|
|
|
# Configure himalaya for your IMAP account (first time only)
|
|
himalaya account list # should show your account after setup
|
|
|
|
# Install and start Ollama, pull the model
|
|
brew install ollama
|
|
ollama pull kamekichi128/qwen3-4b-instruct-2507:latest
|
|
```
|
|
|
|
## How It Works
|
|
|
|
The system separates **classification** (what the LLM does) from **confidence** (computed from your decision history). The LLM picks an action and category tags. Confidence is computed by matching the email's signature `(sender_email, tags)` against past decisions.
|
|
|
|
### Early On (few decisions)
|
|
|
|
1. **Cron runs `scan`.** For each email, the LLM suggests an action and assigns tags from a fixed taxonomy. Since there's no history yet, `compute_confidence` returns 50% (below the 85% threshold), so everything gets queued.
|
|
|
|
2. **You run `review list`.** It prints what's pending:
|
|
```
|
|
1. [msg_f1d43ea3] Subject: New jobs matching your profile
|
|
From: LinkedIn
|
|
Tags: [promotion, social, notification]
|
|
Suggested: delete (50%)
|
|
2. [msg_60c56a87] Subject: Your order shipped
|
|
From: Amazon
|
|
Tags: [shipping, confirmation, notification]
|
|
Suggested: archive (50%)
|
|
```
|
|
|
|
3. **You act on them.** Either individually or in bulk:
|
|
```bash
|
|
./email-processor.sh review 1 delete # agree with suggestion
|
|
./email-processor.sh review 2 archive # agree with suggestion
|
|
./email-processor.sh review accept # accept all suggestions at once
|
|
```
|
|
Each command executes via himalaya and appends to `decision_history.json` with tags.
|
|
|
|
4. **Next scan is smarter.** Confidence grows as consistent history accumulates for each sender+tags signature.
|
|
|
|
### Steady State (10+ consistent decisions per sender)
|
|
|
|
- **Repeat senders** with consistent tag signatures reach 85%+ confidence and get auto-acted during `scan`. They never touch the pending queue.
|
|
- **New or ambiguous senders** start at 50% and get queued.
|
|
- **You occasionally run `review list`** to handle stragglers — each decision further builds history.
|
|
- **`stats` shows your automation rate** climbing over time.
|
|
|
|
### Confidence Computation
|
|
|
|
Confidence is computed by `decision_store.compute_confidence(sender_email, action, tags)`:
|
|
|
|
1. **Find matches** — past decisions with the same sender email AND at least 50% tag overlap (`shared_tags / min(current_tags, past_tags) >= 0.5`).
|
|
2. **Agreement** — what fraction of matches chose the same action the LLM is suggesting.
|
|
3. **Match-count cap** — each match adds 10% to the maximum confidence (1 match = max 10%, 5 = 50%, 10+ = 100%).
|
|
4. **Final confidence** = `min(agreement, cap)`.
|
|
5. **No matches** = 50% (below threshold, gets queued).
|
|
|
|
This means reaching the 85% auto-action threshold requires at least 9 consistent decisions from the same sender with overlapping tags.
|
|
|
|
## Usage
|
|
|
|
All commands are non-interactive — they take arguments, act, and exit. Compatible with cron/OpenClaw.
|
|
|
|
```bash
|
|
# Make the entry script executable (first time)
|
|
chmod +x email-processor.sh
|
|
|
|
# --- Scan ---
|
|
./email-processor.sh scan # classify unseen emails
|
|
./email-processor.sh scan --recent 30 # classify last 30 days
|
|
./email-processor.sh scan --dry-run # classify only, no changes
|
|
./email-processor.sh scan --recent 7 --dry-run # combine both
|
|
|
|
# --- Review ---
|
|
./email-processor.sh review list # show pending queue
|
|
./email-processor.sh review 1 delete # delete email #1
|
|
./email-processor.sh review msg_f1d43ea3 archive # archive by ID
|
|
./email-processor.sh review all delete # delete all pending
|
|
./email-processor.sh review accept # accept all suggestions
|
|
|
|
# --- Other ---
|
|
./email-processor.sh stats # show decision history
|
|
```
|
|
|
|
Or call Python directly: `python main.py scan --dry-run`
|
|
|
|
## Actions
|
|
|
|
| Action | Effect |
|
|
|---|---|
|
|
| `delete` | Move to Trash (`himalaya message delete`) |
|
|
| `archive` | Move to Archive folder |
|
|
| `keep` | Leave unread in inbox (no changes) |
|
|
| `mark_read` | Add `\Seen` flag, stays in inbox |
|
|
| `label:<name>` | Move to named folder (created if needed) |
|
|
|
|
## Tag Taxonomy
|
|
|
|
The LLM assigns 3-5 tags from this fixed list to each email:
|
|
|
|
```
|
|
receipt, invoice, payment, billing, shipping, delivery,
|
|
promotion, discount, marketing, newsletter, notification,
|
|
security, social, reminder, confirmation, update, alert,
|
|
personal, account, subscription, travel
|
|
```
|
|
|
|
Tags serve one purpose: making signature matching work for confidence computation. They need to be specific enough to distinguish different email types from the same sender that you'd treat differently (e.g., `[account, security]` for a password reset vs `[promotion, marketing]` for a promo, both from the same service).
|
|
|
|
### Refining the Tag Taxonomy
|
|
|
|
The tag list should stay fixed and manually curated. Automatic expansion risks breaking history matching — if old entries use different tags than new ones, confidence computation degrades. Refinement is a periodic, manual process — run it when you notice confidence problems in the logs (e.g., same sender getting inconsistent actions, or emails being queued that should be auto-acted).
|
|
|
|
#### When to refine
|
|
|
|
Run this process when any of these are true:
|
|
- A sender you've reviewed 10+ times is still getting queued (confidence stuck below 85%).
|
|
- The same sender has a mix of actions in history (e.g., some deleted, some kept) and you suspect the tags aren't distinguishing the email types.
|
|
- You're seeing tags in the logs that feel too vague for the emails they describe.
|
|
|
|
#### Step-by-step process
|
|
|
|
**Step 1: Find senders with inconsistent actions.**
|
|
|
|
Load `data/decision_history.json` and group entries by sender email address. For each sender, check if multiple different actions were taken. These are the candidates — the tag taxonomy may not be specific enough to separate their email types.
|
|
|
|
Example: sender `noreply@example.com` has 8 entries with action `delete` and 4 entries with action `keep`. That's a split worth investigating.
|
|
|
|
**Step 2: For each candidate sender, examine the entries.**
|
|
|
|
Look at the subject lines, summaries, and current tags of the entries that got different actions. Identify the pattern — what makes the "delete" emails different from the "keep" emails?
|
|
|
|
Example:
|
|
- Deleted emails: subjects like "50% off sale", "Weekly deals" → tags: `[promotion, notification, newsletter]`
|
|
- Kept emails: subjects like "Your password was changed", "New login from Chrome" → tags: `[security, notification, update]`
|
|
|
|
The shared tag `notification` is causing these to match as the same signature, dragging confidence down.
|
|
|
|
**Step 3: Determine if a new tag would fix it.**
|
|
|
|
Ask: is there a category that applies to one group but not the other? In the example above, an `account` tag would distinguish password/login emails from promotional emails. Check if the tag already exists in `TAG_TAXONOMY` in `classifier.py` — it might just be that the LLM isn't using it consistently.
|
|
|
|
If the tag already exists: the problem is LLM consistency, not the taxonomy. Consider adjusting the prompt or few-shot examples.
|
|
|
|
If the tag doesn't exist: propose a new tag.
|
|
|
|
**Step 4: Validate the proposed tag.**
|
|
|
|
Before adding, check that the new tag:
|
|
- Is **distinct** from existing tags (not a synonym — e.g., don't add `promo` when `promotion` exists).
|
|
- Is **broadly useful** — it should apply to emails from multiple senders, not just the one you're debugging.
|
|
- Doesn't **overlap** with an existing tag in a way that would confuse the LLM (e.g., adding `order` when `receipt` and `shipping` already cover those cases).
|
|
|
|
**Step 5: Add the tag to `TAG_TAXONOMY` in `classifier.py`.**
|
|
|
|
Add the new tag to the `TAG_TAXONOMY` list in `classifier.py:30-38`. Keep the list organized by category. The LLM prompt automatically picks up the updated list on the next scan.
|
|
|
|
**Step 6: Decide whether to wipe history.**
|
|
|
|
- If you added 1-2 tags: **don't wipe**. Old entries without the new tags will gradually be outweighed by new entries that have them. The 50% overlap threshold is forgiving enough that old entries still contribute during the transition.
|
|
- If you added 3+ tags or the new tags fundamentally change how common email types would be tagged: **wipe** `data/decision_history.json` and `data/pending_emails.json`. Old entries without the new tags are dead weight — they won't match new entries and won't contribute to confidence.
|
|
|
|
**Step 7: Verify with a dry run.**
|
|
|
|
```bash
|
|
./email-processor.sh scan --recent 7 --dry-run
|
|
```
|
|
|
|
Check the logs for the affected senders:
|
|
- Are the new tags being assigned?
|
|
- Are different email types from the same sender now getting different tag sets?
|
|
- If history was preserved, is confidence trending correctly?
|
|
|
|
#### Rules
|
|
|
|
- **Only add tags, never rename.** Renaming `billing` to `finance` means old entries with `billing` never match new entries with `finance`. If you must rename, keep both in the taxonomy.
|
|
- **Avoid deleting tags.** Old entries with deleted tags become slightly less useful (fewer matching tags) but don't cause incorrect matches. Only delete a tag if it's actively causing confusion (e.g., the LLM uses it inconsistently and it's hurting overlap calculations).
|
|
- **Keep the taxonomy small.** More tags = more choices for the LLM = more inconsistency. The taxonomy should have the minimum number of tags needed to distinguish email types that deserve different actions. 20-30 tags is a reasonable range.
|
|
|
|
## Configuration
|
|
|
|
`config.json` — only Ollama and automation settings. IMAP auth is managed by himalaya's own config.
|
|
|
|
```json
|
|
{
|
|
"ollama": {
|
|
"host": "http://localhost:11434",
|
|
"model": "kamekichi128/qwen3-4b-instruct-2507:latest"
|
|
},
|
|
"rules": {
|
|
"max_body_length": 1000
|
|
},
|
|
"automation": {
|
|
"confidence_threshold": 85
|
|
}
|
|
}
|
|
```
|
|
|
|
| Key | Description |
|
|
|---|---|
|
|
| `ollama.host` | Ollama server URL. Default `http://localhost:11434`. |
|
|
| `ollama.model` | Ollama model to use for classification. |
|
|
| `rules.max_body_length` | Max characters of email body sent to the LLM. Longer bodies are truncated. Keeps prompt size and latency down. |
|
|
| `automation.confidence_threshold` | Minimum confidence (0-100) for auto-action. Emails below this get queued for review. At 85%, you need at least 9 consistent decisions from the same sender with overlapping tags before auto-action kicks in. |
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
# 1. Verify himalaya can reach your mailbox
|
|
himalaya envelope list --page-size 3
|
|
|
|
# 2. Verify Ollama is running with the model
|
|
ollama list # should show kamekichi128/qwen3-4b-instruct-2507:latest
|
|
|
|
# 3. Dry run — classify recent emails without touching anything
|
|
./email-processor.sh scan --recent 7 --dry-run
|
|
|
|
# 4. Live run — classify and act (auto-act or queue)
|
|
./email-processor.sh scan --recent 7
|
|
|
|
# 5. Check what got queued
|
|
./email-processor.sh review list
|
|
|
|
# 6. Act on a queued email to seed decision history
|
|
./email-processor.sh review 1 delete
|
|
|
|
# 7. Check that the decision was recorded
|
|
./email-processor.sh stats
|
|
```
|
|
|
|
## File Structure
|
|
|
|
```
|
|
email_processor/
|
|
main.py # Entry point — scan/review/stats subcommands
|
|
classifier.py # LLM prompt builder + response parser, tag taxonomy
|
|
decision_store.py # Decision history, confidence computation, few-shot retrieval
|
|
config.json # Ollama + automation settings
|
|
email-processor.sh # Shell wrapper (activates venv, forwards args)
|
|
data/
|
|
pending_emails.json # Queue of emails awaiting review
|
|
decision_history.json # Past decisions (few-shot learning + confidence data)
|
|
logs/
|
|
YYYY-MM-DD.log # Daily processing logs
|
|
llm_YYYY-MM-DD.log # Full LLM prompt/response logs
|
|
```
|
|
|
|
## Design Decisions
|
|
|
|
### Himalaya instead of raw IMAP
|
|
|
|
All IMAP operations go through the `himalaya` CLI via subprocess calls. This means:
|
|
- No IMAP credentials stored in config.json — himalaya manages its own auth.
|
|
- No connection management, reconnect logic, or SSL setup in Python.
|
|
- Each action is a single himalaya command (e.g., `himalaya message delete 42`).
|
|
|
|
The tradeoff is a subprocess spawn per operation, but for email volumes (tens per run, not thousands) this is negligible.
|
|
|
|
### Non-interactive design
|
|
|
|
Every command takes its full input as arguments, acts, and exits. No `input()` calls, no interactive loops. This makes the system compatible with cron/OpenClaw and composable with other scripts. The pending queue on disk (`pending_emails.json`) is the shared state between scan and review invocations.
|
|
|
|
### LLM classifies, history decides confidence
|
|
|
|
The LLM produces an action and category tags but NOT a confidence score. Confidence is computed from decision history by matching email signatures `(sender_email, tags)` against past decisions. This avoids the problem of LLMs clustering confidence around 85-95% regardless of actual certainty, making threshold systems ineffective.
|
|
|
|
The match-count cap (each match adds 10% to the maximum) replaces the old bootstrap/learning-phase logic. Confidence grows organically per-sender as history accumulates, rather than using a global gate.
|
|
|
|
### decision_history.json as the "database"
|
|
|
|
`data/decision_history.json` is the only persistent state that matters for learning. It's a flat JSON array — every decision (user or auto) is appended as an entry with tags. The classifier reads the whole file on each email to find relevant few-shot examples via relevance scoring, and `compute_confidence` scans it for matching signatures.
|
|
|
|
A flat JSON file works fine for hundreds or low thousands of decisions. SQLite would make sense if the history grows past ~10k entries and the linear scan becomes noticeable, or if concurrent writes from multiple processes become necessary. Neither applies at current scale.
|
|
|
|
### Few-shot learning via relevance scoring
|
|
|
|
Rather than sending the entire decision history to the LLM, `decision_store.get_relevant_examples()` scores each past decision against the current email using two signals:
|
|
- Exact sender email address match (+3 points)
|
|
- Subject keyword overlap (+1 per shared word, stop-words excluded)
|
|
|
|
The top 5 most relevant examples are injected into the prompt as few-shot demonstrations. This keeps the prompt small while giving the model the most useful context.
|
|
|
|
### Fixed tag taxonomy
|
|
|
|
Tags are defined in `classifier.py` as `TAG_TAXONOMY` — a manually curated list of 21 categories. The LLM must pick from this list (invalid tags are silently dropped). The taxonomy should stay fixed to keep history matching stable. See "Refining the Tag Taxonomy" above for when and how to update it.
|
|
|
|
### `keep` means unread
|
|
|
|
The `keep` action is a deliberate no-op — it leaves the email unread in the inbox, meaning it needs human attention. This is distinct from `mark_read`, which dismisses low-priority emails without moving them.
|
|
|
|
### Fail-safe classification
|
|
|
|
If the LLM call fails (Ollama down, model not loaded, timeout), the classifier returns `action="keep"` with empty tags. Empty tags produce 50% confidence (below threshold), so the email gets queued for manual review rather than being auto-acted upon. The system never auto-trashes an email it couldn't classify.
|