Learning-based mailbox cleanup using Himalaya (IMAP) + Ollama (local LLM). Classifies emails with category tags, computes confidence from decision history, and gradually automates common actions.

Prerequisites

uv — Python package manager, handles venv and dependencies automatically.
Himalaya — CLI email client, handles IMAP connection and auth.
Ollama — local LLM server.

# Install uv (macOS)
brew install uv

# Install himalaya (macOS)
brew install himalaya

# Configure himalaya for your IMAP account (first time only)
himalaya account list  # should show your account after setup

# Install and start Ollama, pull the model
brew install ollama
ollama pull kamekichi128/qwen3-4b-instruct-2507:latest

How It Works

The system separates classification (what the LLM does) from confidence (computed from your decision history). The LLM picks an action and category tags. Confidence is computed by matching the email's signature (sender_email, tags) against past decisions.

Early On (few decisions)

Cron runs scan. For each email, the LLM suggests an action and assigns tags from a fixed taxonomy. Since there's no history yet, compute_confidence returns 50% (below the 85% threshold), so everything gets queued.

You run review list. It prints what's pending:

  1. [msg_f1d43ea3]  Subject: New jobs matching your profile
     From: LinkedIn
     Tags: [promotion, social, notification]
     Suggested: delete (50%)
  2. [msg_60c56a87]  Subject: Your order shipped
     From: Amazon
     Tags: [shipping, confirmation, notification]
     Suggested: archive (50%)

You act on them. Either individually or in bulk:

./email-processor.sh review 1 delete     # agree with suggestion
./email-processor.sh review 2 archive    # agree with suggestion
./email-processor.sh review accept       # accept all suggestions at once

Each command executes via himalaya and appends to decision_history.json with tags.

Next scan is smarter. Confidence grows as consistent history accumulates for each sender+tags signature.

Steady State (10+ consistent decisions per sender)

Repeat senders with consistent tag signatures reach 85%+ confidence and get auto-acted during scan. They never touch the pending queue.
New or ambiguous senders start at 50% and get queued.
You occasionally run review list to handle stragglers — each decision further builds history.
stats shows your automation rate climbing over time.

Confidence Computation

Confidence is computed by decision_store.compute_confidence(sender_email, action, tags):

Find matches — past decisions with the same sender email AND at least 50% tag overlap (shared_tags / min(current_tags, past_tags) >= 0.5).
Agreement — what fraction of matches chose the same action the LLM is suggesting.
Match-count cap — each match adds 10% to the maximum confidence (1 match = max 10%, 5 = 50%, 10+ = 100%).
Final confidence = min(agreement, cap).
No matches = 50% (below threshold, gets queued).

This means reaching the 85% auto-action threshold requires at least 9 consistent decisions from the same sender with overlapping tags.

Usage

All commands are non-interactive — they take arguments, act, and exit. Compatible with cron/OpenClaw.

# Make the entry script executable (first time)
chmod +x email-processor.sh

# --- Scan ---
./email-processor.sh scan                         # classify unseen emails
./email-processor.sh scan --recent 30             # classify last 30 days
./email-processor.sh scan --dry-run               # classify only, no changes
./email-processor.sh scan --recent 7 --dry-run    # combine both

# --- Review ---
./email-processor.sh review list                  # show pending queue
./email-processor.sh review 1 delete              # delete email #1
./email-processor.sh review msg_f1d43ea3 archive  # archive by ID
./email-processor.sh review all delete            # delete all pending
./email-processor.sh review accept                # accept all suggestions

# --- Other ---
./email-processor.sh stats                        # show decision history

Or call Python directly: python main.py scan --dry-run

Actions

Action	Effect
`delete`	Move to Trash (`himalaya message delete`)
`archive`	Move to Archive folder
`keep`	Leave unread in inbox (no changes)
`mark_read`	Add `\Seen` flag, stays in inbox
`label:<name>`	Move to named folder (created if needed)

Tag Taxonomy

The LLM assigns 3-5 tags from this fixed list to each email:

receipt, invoice, payment, billing, shipping, delivery,
promotion, discount, marketing, newsletter, notification,
security, social, reminder, confirmation, update, alert,
personal, account, subscription, travel

Tags serve one purpose: making signature matching work for confidence computation. They need to be specific enough to distinguish different email types from the same sender that you'd treat differently (e.g., [account, security] for a password reset vs [promotion, marketing] for a promo, both from the same service).

Refining the Tag Taxonomy

The tag list should stay fixed and manually curated. Automatic expansion risks breaking history matching — if old entries use different tags than new ones, confidence computation degrades.

To identify gaps, periodically review your decision history for cases where the same sender got inconsistent actions. Feed the history and current tags to the LLM and ask what new tag would distinguish them. For example:

"Here are my current tags: [list]. Here are history entries where sender X got different actions: [entries]. Suggest a new tag that would separate the email types that deserve different actions."

Guidelines:

Add tags when a broad tag (like notification) is the only thing two different email types share, and you'd handle them differently.
Don't rename tags — old history entries would stop matching. If you must rename, keep both the old and new tag in the taxonomy.
Don't delete tags unless you're sure no important history depends on them for distinguishing email types. Old entries with deleted tags become slightly less useful but don't cause wrong matches.
After significant taxonomy changes, consider wiping decision_history.json and pending_emails.json and rebuilding from scratch, since old entries without the new tags won't contribute to confidence anyway.

Configuration

config.json — only Ollama and automation settings. IMAP auth is managed by himalaya's own config.

{
  "ollama": {
    "host": "http://localhost:11434",
    "model": "kamekichi128/qwen3-4b-instruct-2507:latest"
  },
  "rules": {
    "max_body_length": 1000
  },
  "automation": {
    "confidence_threshold": 85
  }
}

Key	Description
`ollama.host`	Ollama server URL. Default `http://localhost:11434`.
`ollama.model`	Ollama model to use for classification.
`rules.max_body_length`	Max characters of email body sent to the LLM. Longer bodies are truncated. Keeps prompt size and latency down.
`automation.confidence_threshold`	Minimum confidence (0-100) for auto-action. Emails below this get queued for review. At 85%, you need at least 9 consistent decisions from the same sender with overlapping tags before auto-action kicks in.

Testing

# 1. Verify himalaya can reach your mailbox
himalaya envelope list --page-size 3

# 2. Verify Ollama is running with the model
ollama list  # should show kamekichi128/qwen3-4b-instruct-2507:latest

# 3. Dry run — classify recent emails without touching anything
./email-processor.sh scan --recent 7 --dry-run

# 4. Live run — classify and act (auto-act or queue)
./email-processor.sh scan --recent 7

# 5. Check what got queued
./email-processor.sh review list

# 6. Act on a queued email to seed decision history
./email-processor.sh review 1 delete

# 7. Check that the decision was recorded
./email-processor.sh stats

File Structure

email_processor/
  main.py              # Entry point — scan/review/stats subcommands
  classifier.py        # LLM prompt builder + response parser, tag taxonomy
  decision_store.py    # Decision history, confidence computation, few-shot retrieval
  config.json          # Ollama + automation settings
  email-processor.sh   # Shell wrapper (activates venv, forwards args)
  data/
    pending_emails.json    # Queue of emails awaiting review
    decision_history.json  # Past decisions (few-shot learning + confidence data)
  logs/
    YYYY-MM-DD.log         # Daily processing logs
    llm_YYYY-MM-DD.log     # Full LLM prompt/response logs

Design Decisions

Himalaya instead of raw IMAP

All IMAP operations go through the himalaya CLI via subprocess calls. This means:

No IMAP credentials stored in config.json — himalaya manages its own auth.
No connection management, reconnect logic, or SSL setup in Python.
Each action is a single himalaya command (e.g., himalaya message delete 42).

The tradeoff is a subprocess spawn per operation, but for email volumes (tens per run, not thousands) this is negligible.

Non-interactive design

Every command takes its full input as arguments, acts, and exits. No input() calls, no interactive loops. This makes the system compatible with cron/OpenClaw and composable with other scripts. The pending queue on disk (pending_emails.json) is the shared state between scan and review invocations.

LLM classifies, history decides confidence

The LLM produces an action and category tags but NOT a confidence score. Confidence is computed from decision history by matching email signatures (sender_email, tags) against past decisions. This avoids the problem of LLMs clustering confidence around 85-95% regardless of actual certainty, making threshold systems ineffective.

The match-count cap (each match adds 10% to the maximum) replaces the old bootstrap/learning-phase logic. Confidence grows organically per-sender as history accumulates, rather than using a global gate.

decision_history.json as the "database"

data/decision_history.json is the only persistent state that matters for learning. It's a flat JSON array — every decision (user or auto) is appended as an entry with tags. The classifier reads the whole file on each email to find relevant few-shot examples via relevance scoring, and compute_confidence scans it for matching signatures.

A flat JSON file works fine for hundreds or low thousands of decisions. SQLite would make sense if the history grows past ~10k entries and the linear scan becomes noticeable, or if concurrent writes from multiple processes become necessary. Neither applies at current scale.

Few-shot learning via relevance scoring

Rather than sending the entire decision history to the LLM, decision_store.get_relevant_examples() scores each past decision against the current email using two signals:

Exact sender email address match (+3 points)
Subject keyword overlap (+1 per shared word, stop-words excluded)

The top 5 most relevant examples are injected into the prompt as few-shot demonstrations. This keeps the prompt small while giving the model the most useful context.

Fixed tag taxonomy

Tags are defined in classifier.py as TAG_TAXONOMY — a manually curated list of 21 categories. The LLM must pick from this list (invalid tags are silently dropped). The taxonomy should stay fixed to keep history matching stable. See "Refining the Tag Taxonomy" above for when and how to update it.

`keep` means unread

The keep action is a deliberate no-op — it leaves the email unread in the inbox, meaning it needs human attention. This is distinct from mark_read, which dismisses low-priority emails without moving them.

Fail-safe classification

If the LLM call fails (Ollama down, model not loaded, timeout), the classifier returns action="keep" with empty tags. Empty tags produce 50% confidence (below threshold), so the email gets queued for manual review rather than being auto-acted upon. The system never auto-trashes an email it couldn't classify.

README.md

Email Processor