youlu-openclaw-workspace/scripts/email_processor/README.md

# Email Processor

Learning-based mailbox cleanup using Himalaya (IMAP) + Ollama (local LLM). Classifies emails, learns from your decisions over time, and gradually automates common actions.

## Prerequisites

- **Himalaya** — CLI email client, handles IMAP connection and auth.
- **Ollama** — local LLM server.
- **Python 3.8+**

```bash
# Install himalaya (macOS)
brew install himalaya

# Configure himalaya for your IMAP account (first time only)
himalaya account list  # should show your account after setup

# Install and start Ollama, pull the model
brew install ollama
ollama pull kamekichi128/qwen3-4b-instruct-2507:latest

# Set up Python venv and install dependencies
cd scripts/email_processor
python3 -m venv venv
source venv/bin/activate
pip install ollama
```

## How It Works

The system has two phases: a **learning phase** where it builds up knowledge from your decisions, and a **steady state** where it handles most emails automatically.

### Learning Phase (first ~20 decisions)

The confidence threshold is automatically raised to 95%. Most emails get queued.

1. **Cron runs `scan`.** For each unseen email, the classifier uses Qwen3's general knowledge (no history yet) to suggest an action. Most come back at 60-80% confidence — below the 95% threshold — so they get saved to `pending_emails.json` with the suggestion attached. A few obvious spam emails might hit 95%+ and get auto-deleted.

2. **You run `review list`.** It prints what's pending:
   ```
     1. [msg_f1d43ea3]  Subject: New jobs matching your profile
        From: LinkedIn    Suggested: delete (82%)
     2. [msg_60c56a87]  Subject: Your order shipped
        From: Amazon      Suggested: archive (78%)
     3. [msg_ebd24205]  Subject: Meeting tomorrow at 3pm
        From: Coworker    Suggested: keep (70%)
   ```

3. **You act on them.** Either individually or in bulk:
   ```bash
   ./email-processor.sh review 1 delete     # agree with suggestion
   ./email-processor.sh review 2 archive    # agree with suggestion
   ./email-processor.sh review accept       # accept all suggestions at once
   ```
   Each command executes via himalaya, appends to `decision_history.json`, and marks the pending entry as done.

4. **Next scan is smarter.** The classifier now has few-shot examples in the prompt:
   ```
   History for linkedin.com: delete 2 times
   --- Past decisions ---
   From: LinkedIn | Subject: New jobs matching your profile -> delete
   From: Amazon | Subject: Your package delivered -> archive
   ---
   ```
   Confidence scores climb. You keep reviewing. History grows.

### Steady State (20+ decisions)

The threshold drops to the configured 75%. The classifier has rich context.

- **Repeat senders** (LinkedIn, Amazon, Uber) get auto-acted at 85-95% confidence during `scan`. They never touch the pending queue.
- **New or ambiguous senders** may fall below 75% and get queued.
- **You occasionally run `review list`** to handle stragglers — each decision further improves future classifications.
- **`stats` shows your automation rate** climbing: 60%, 70%, 80%+.

The pending queue shrinks over time. It's not a backlog — it's an ever-narrowing set of emails the system hasn't learned to handle yet.

## Usage

All commands are non-interactive — they take arguments, act, and exit. Compatible with cron/OpenClaw.

```bash
# Make the entry script executable (first time)
chmod +x email-processor.sh

# --- Scan ---
./email-processor.sh scan                         # classify unseen emails
./email-processor.sh scan --recent 30             # classify last 30 days
./email-processor.sh scan --dry-run               # classify only, no changes
./email-processor.sh scan --recent 7 --dry-run    # combine both

# --- Review ---
./email-processor.sh review list                  # show pending queue
./email-processor.sh review 1 delete              # delete email #1
./email-processor.sh review msg_f1d43ea3 archive  # archive by ID
./email-processor.sh review all delete            # delete all pending
./email-processor.sh review accept                # accept all suggestions

# --- Other ---
./email-processor.sh stats                        # show decision history
```

Or call Python directly: `python main.py scan --dry-run`

## Actions

| Action | Effect |
|---|---|
| `delete` | Move to Trash (`himalaya message delete`) |
| `archive` | Move to Archive folder |
| `keep` | Leave unread in inbox (no changes) |
| `mark_read` | Add `\Seen` flag, stays in inbox |
| `label:<name>` | Move to named folder (created if needed) |

## Auto-Action Criteria

Scan auto-acts when the classifier's confidence meets the threshold. During the learning phase (fewer than `bootstrap_min_decisions` total decisions, default 20), a higher threshold of 95% is used automatically. Once enough history accumulates, the configured `confidence_threshold` (default 75%) takes over.

This means on day one, only very obvious emails (spam, clear promotions) get auto-acted. As you review emails and build history, the system gradually handles more on its own.

## Configuration

`config.json` — only Ollama and automation settings. IMAP auth is managed by himalaya's own config.

```json
{
  "ollama": {
    "host": "http://localhost:11434",
    "model": "kamekichi128/qwen3-4b-instruct-2507:latest"
  },
  "rules": {
    "max_body_length": 1000
  },
  "automation": {
    "confidence_threshold": 75,
    "bootstrap_min_decisions": 20
  }
}
```

| Key | Description |
|---|---|
| `ollama.host` | Ollama server URL. Default `http://localhost:11434`. |
| `ollama.model` | Ollama model to use for classification. |
| `rules.max_body_length` | Max characters of email body sent to the LLM. Longer bodies are truncated. Keeps prompt size and latency down. |
| `automation.confidence_threshold` | Minimum confidence (0-100) for auto-action in steady state. Emails below this get queued for review. Lower = more automation, higher = more manual review. |
| `automation.bootstrap_min_decisions` | Number of decisions needed before leaving the learning phase. During the learning phase, the threshold is raised to 95% regardless of `confidence_threshold`. Set to 0 to skip the learning phase entirely. |

## Testing

```bash
# 1. Verify himalaya can reach your mailbox
himalaya envelope list --page-size 3

# 2. Verify Ollama is running with the model
ollama list  # should show kamekichi128/qwen3-4b-instruct-2507:latest

# 3. Dry run — classify recent emails without touching anything
./email-processor.sh scan --recent 7 --dry-run

# 4. Live run — classify and act (auto-act or queue)
./email-processor.sh scan --recent 7

# 5. Check what got queued
./email-processor.sh review list

# 6. Act on a queued email to seed decision history
./email-processor.sh review 1 delete

# 7. Check that the decision was recorded
./email-processor.sh stats
```

## File Structure

```
email_processor/
  main.py              # Entry point — scan/review/stats subcommands
  classifier.py        # LLM prompt builder + response parser
  decision_store.py    # Decision history storage + few-shot retrieval
  config.json          # Ollama + automation settings
  email-processor.sh   # Shell wrapper (activates venv, forwards args)
  data/
    pending_emails.json    # Queue of emails awaiting review
    decision_history.json  # Past decisions (few-shot learning data)
  logs/
    YYYY-MM-DD.log         # Daily processing logs
```

## Design Decisions

### Himalaya instead of raw IMAP

All IMAP operations go through the `himalaya` CLI via subprocess calls. This means:
- No IMAP credentials stored in config.json — himalaya manages its own auth.
- No connection management, reconnect logic, or SSL setup in Python.
- Each action is a single himalaya command (e.g., `himalaya message delete 42`).

The tradeoff is a subprocess spawn per operation, but for email volumes (tens per run, not thousands) this is negligible.

### Non-interactive design

Every command takes its full input as arguments, acts, and exits. No `input()` calls, no interactive loops. This makes the system compatible with cron/OpenClaw and composable with other scripts. The pending queue on disk (`pending_emails.json`) is the shared state between scan and review invocations.

### decision_history.json as the "database"

`data/decision_history.json` is the only persistent state that matters for learning. It's a flat JSON array — every decision (user or auto) is appended as an entry. The classifier reads the whole file on each email to find relevant few-shot examples via relevance scoring.

The pending queue (`pending_emails.json`) is transient — emails pass through it and get marked "done". Logs are for debugging. The decision history is what the system learns from.

A flat JSON file works fine for hundreds or low thousands of decisions. SQLite would make sense if the history grows past ~10k entries and the linear scan becomes noticeable, or if concurrent writes from multiple processes become necessary. Neither applies at current scale.

### Few-shot learning via relevance scoring

Rather than sending the entire decision history to the LLM, `decision_store.get_relevant_examples()` scores each past decision against the current email using three signals:
- Exact sender domain match (+3 points)
- Recipient address match (+2 points)
- Subject keyword overlap (+1 per shared word, stop-words excluded)

The top 5 most relevant examples are injected into the prompt as few-shot demonstrations. This keeps the prompt small while giving the model the most useful context.

### Conservative auto-action

Auto-action uses a single confidence threshold with an adaptive learning phase. When the decision history has fewer than `bootstrap_min_decisions` (default 20) entries, the threshold is raised to 95% — only very obvious classifications get auto-acted. Once enough history accumulates, the configured `confidence_threshold` (default 75%) takes over. This lets the system start working from day one while being cautious until it has enough examples to learn from.

### `keep` means unread

The `keep` action is a deliberate no-op — it leaves the email unread in the inbox, meaning it needs human attention. This is distinct from `mark_read`, which dismisses low-priority emails without moving them. During scan, queued emails are marked as read to prevent re-processing, but that's a scan-level concern separate from the `keep` action itself.

### Fail-safe classification

If the LLM call fails (Ollama down, model not loaded, timeout), the classifier returns `action="keep"` with `confidence=0`. This guarantees the email gets queued for manual review rather than being auto-acted upon. The system never auto-trashes an email it couldn't classify.