youlu-openclaw-workspace/scripts/news_digest/README.md

# RSS News Digest

Fetches articles from RSS/Atom feeds, downloads full article content, stores everything in SQLite with URL-based deduplication, and outputs a JSON digest to stdout. When Ollama is configured, each article is automatically summarized during fetch and the summary is stored in the database.

HTTP requests use automatic retries with exponential backoff for transient errors (429/5xx), browser-like headers to avoid 403 blocks, and rate limiting between content fetches.

## Setup

Install [uv](https://docs.astral.sh/uv/getting-started/installation/) (a fast Python package manager):

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

`run.sh` handles Python and dependency installation automatically — no manual venv or `pip install` needed.

For AI-powered article summaries, install [Ollama](https://ollama.com) and pull a model:

```bash
ollama pull kamekichi128/qwen3-4b-instruct-2507
```

## Configuration

Edit `config.json` to add feeds and adjust settings:

```json
{
  "settings": {
    "hours_lookback": 24,
    "retention_days": 30,
    "max_articles_per_feed": 10
  },
  "ollama": {
    "model": "kamekichi128/qwen3-4b-instruct-2507",
    "prompt": "Summarize the following news article in 2-3 concise sentences:"
  },
  "feeds": [
    {
      "url": "https://feeds.bbci.co.uk/news/rss.xml",
      "name": "BBC News",
      "category": "World",
      "enabled": true
    }
  ]
}
```

- **hours_lookback** — only include articles published within this many hours
- **retention_days** — auto-delete articles older than this from the database
- **max_articles_per_feed** — limit how many articles are saved per feed per run (0 = unlimited)
- **ollama.model** — Ollama model name for article summaries (generated during fetch)
- **ollama.prompt** — prompt sent to the model for each article
- Removing the `ollama` key disables summarization; articles are still fetched normally
- **feeds[].enabled** — set to `false` to skip a feed without removing it

## Usage

```bash
# Fetch feeds and print recent articles (default fields: id, title, url, published_date, fetched_date, feed_name)
./run.sh

# Verbose mode (logs to stderr)
./run.sh -v

# Override lookback window to 48 hours
./run.sh --hours 48

# Include more fields in output
./run.sh -f id,title,url,description,published_date

# Include article summaries in output (requires ollama config)
./run.sh -f id,title,url,summary

# Include full article content in output
./run.sh -f id,title,url,content

# Review stored articles without fetching new ones
./run.sh --no-fetch

# Only purge old articles (no fetching or output)
./run.sh --purge-only -v

# Smoke test: full pipeline (fetch 1 article from first enabled feed + summarize)
./run.sh --test

# Test feed fetching only (1 article from first enabled feed)
./run.sh --test feed

# Test Ollama summarization only (uses hardcoded articles)
./run.sh --test summary

# Custom config and database paths
./run.sh -c my_config.json -d my_news.db
```

## Examples

### Fetch feeds and list articles

```
$ ./run.sh -v
2026-02-21 22:41:44 DEBUG Ollama summarization enabled (model: qwen3)
2026-02-21 22:41:44 INFO Feed 'BBC News': 35 new articles (of 36 within lookback)
2026-02-21 22:43:54 INFO Feed 'NY Times': 10 new articles (of 10 within lookback)
2026-02-21 22:44:57 INFO Feed 'Hacker News': 10 new articles (of 10 within lookback)
2026-02-21 22:46:08 INFO Feed '联合早报 中国': 10 new articles (of 10 within lookback)
2026-02-21 22:47:23 INFO Feed '澎湃新闻 热点': 10 new articles (of 10 within lookback)
2026-02-21 22:48:49 INFO Feed '36氪 热榜': 8 new articles (of 8 within lookback)
2026-02-21 22:48:59 INFO Total new articles saved: 83
[
  {"id": 1, "title": "Iran students stage first large anti-government protests since deadly crackdown", "url": "https://www.bbc.com/news/articles/..."},
  {"id": 2, "title": "Trump says he will increase his new global tariffs to 15%", "url": "https://www.bbc.com/news/articles/..."},
  ...
]
```

### Review stored articles without fetching

```
$ ./run.sh --no-fetch -f id,title,feed_name
[
  {"id": 1, "title": "Iran students stage first large anti-government protests...", "feed_name": "BBC News"},
  {"id": 79, "title": "韩国二次电池技术水平被中国反超", "feed_name": "联合早报 中国"},
  {"id": 117, "title": "忍无可忍，Ilya宫斗奥特曼！微软CTO爆内幕...", "feed_name": "36氪 热榜"},
  ...
]
```

## Output

Default output (JSON array to stdout):

```json
[
  {"id": 1, "title": "Article Title", "url": "https://example.com/article"},
  {"id": 2, "title": "Another Article", "url": "https://example.com/other"}
]
```

With summary field (`-f id,title,url,summary`):

```json
[
  {"id": 1, "title": "Article Title", "url": "https://example.com/article", "summary": "AI-generated summary..."},
  {"id": 2, "title": "Another Article", "url": "https://example.com/other", "summary": "AI-generated summary..."}
]
```

## Database

Articles are stored in `news_digest.db` (SQLite) in the current directory by default. The database is created automatically on first run. Articles older than `retention_days` are purged at the start of each fetch run. Duplicate URLs are ignored via a UNIQUE constraint.

Each article stores metadata from the RSS feed (title, description, published date, author, etc.) plus the full article content fetched from the article URL. Content is extracted as plain text using BeautifulSoup.

HTTP requests are made through a shared `requests.Session` with:
- **Automatic retries** (up to 3 attempts with exponential backoff) for 429/5xx errors
- **Browser-like headers** (User-Agent, Accept) to reduce 403 rejections
- **Rate limiting** (Ollama latency between fetches when configured; 1-second fallback otherwise)

Some sites (e.g. paywalled or bot-protected) may still return errors — in those cases the content field is left empty and the RSS description is used as a fallback for summaries.

## Design notes

- **Articles without dates are included by default.** `is_within_lookback` returns `True` when an article has no published date, and the query uses `OR published_date IS NULL`. This is intentional — silently dropping articles just because the feed omits a date would be worse than including them. If you only want dated articles, filter on `published_date` in the output.

- **`generate_summary` accepts both `description` and `content`.** The `description` parameter is not redundant — `body = content or description` uses the RSS description as a fallback when `fetch_content()` fails and returns `None`. This ensures articles still get summarized even when the full page can't be fetched.

- **`fetch_content` uses a chained ternary for element selection.** The expression `article if article else soup.body if soup.body else soup` picks the most specific container available. This is a common Python pattern and reads top-to-bottom as a priority list.