Files
youlu-openclaw-workspace/scripts/news_digest
2026-02-21 22:49:33 -08:00
..
2026-02-21 22:49:33 -08:00
2026-02-21 22:49:33 -08:00
2026-02-21 22:49:33 -08:00
2026-02-21 22:49:33 -08:00

RSS News Digest

Fetches articles from RSS/Atom feeds, downloads full article content, stores everything in SQLite with URL-based deduplication, and outputs a JSON digest to stdout. Optionally generates per-article summaries via a local Ollama model.

Setup

pip install -r requirements.txt

For AI-powered article summaries, install Ollama and pull a model:

ollama pull qwen3

Configuration

Edit config.json to add feeds and adjust settings:

{
  "settings": {
    "hours_lookback": 24,
    "retention_days": 30
  },
  "ollama": {
    "model": "qwen3",
    "prompt": "Summarize the following news article in 2-3 concise sentences:"
  },
  "feeds": [
    {
      "url": "https://feeds.bbci.co.uk/news/rss.xml",
      "name": "BBC News",
      "category": "World",
      "enabled": true
    }
  ]
}
  • hours_lookback — only include articles published within this many hours
  • retention_days — auto-delete articles older than this from the database
  • ollama.model — Ollama model name for digest summaries
  • ollama.prompt — system prompt sent to the model
  • feeds[].enabled — set to false to skip a feed without removing it

Usage

# Fetch feeds and print recent articles (default fields: id, title, url)
python main.py

# Verbose mode (logs to stderr)
python main.py -v

# Override lookback window to 48 hours
python main.py --hours 48

# Include more fields in output
python main.py -f id,title,url,description,published_date

# Include full article content in output
python main.py -f id,title,url,content

# Review stored articles without fetching new ones
python main.py --no-fetch

# Only purge old articles (no fetching or output)
python main.py --purge-only -v

# Generate AI summaries for specific article IDs
python main.py --digest 1,3,7

# Custom config and database paths
python main.py -c my_config.json -d my_news.db

Examples

Fetch feeds and list articles

$ python main.py -v
2026-02-21 22:41:44 INFO Feed 'BBC News': 35 new articles (of 36 within lookback)
2026-02-21 22:41:54 INFO Feed 'NY Times': 20 new articles (of 20 within lookback)
2026-02-21 22:41:57 INFO Feed 'Hacker News': 20 new articles (of 20 within lookback)
2026-02-21 22:42:08 INFO Feed '联合早报 中国': 24 new articles (of 24 within lookback)
2026-02-21 22:42:23 INFO Feed '澎湃新闻 热点': 15 new articles (of 15 within lookback)
2026-02-21 22:42:49 INFO Feed '36氪 热榜': 8 new articles (of 8 within lookback)
2026-02-21 22:42:59 INFO Total new articles saved: 122
[
  {"id": 1, "title": "Iran students stage first large anti-government protests since deadly crackdown", "url": "https://www.bbc.com/news/articles/..."},
  {"id": 2, "title": "Trump says he will increase his new global tariffs to 15%", "url": "https://www.bbc.com/news/articles/..."},
  ...
]

Review stored articles without fetching

$ python main.py --no-fetch -f id,title,feed_name
[
  {"id": 1, "title": "Iran students stage first large anti-government protests...", "feed_name": "BBC News"},
  {"id": 79, "title": "韩国二次电池技术水平被中国反超", "feed_name": "联合早报 中国"},
  {"id": 117, "title": "忍无可忍Ilya宫斗奥特曼微软CTO爆内幕...", "feed_name": "36氪 热榜"},
  ...
]

Generate AI summaries for specific articles

$ python main.py --digest 1,79,117
[
  {
    "id": 1,
    "title": "Iran students stage first large anti-government protests since deadly crackdown",
    "url": "https://www.bbc.com/news/articles/...",
    "summary": "Iranian students have held large-scale anti-government protests across multiple cities, marking the first such demonstrations since a deadly crackdown last month. Protesters chanted anti-regime slogans, with clashes reported between demonstrators and government supporters."
  },
  {
    "id": 79,
    "title": "韩国二次电池技术水平被中国反超",
    "url": "https://www.zaobao.com/news/china/story20260222-8614291",
    "summary": "韩国官方报告指出中国在二次电池技术领域已反超韩国2024年技术水平评估显示中国领先韩国0.2年,且中国追赶美国的势头明显。"
  },
  {
    "id": 117,
    "title": "忍无可忍Ilya宫斗奥特曼微软CTO爆内幕全因嫉妒下属太优秀",
    "url": "https://www.36kr.com/p/3693861726826112",
    "summary": "该文章描述了OpenAI内部的权力斗争事件称微软CTO披露首席科学家Ilya因嫉妒下属取得突破联合董事会罢免了CEO奥特曼引发高管集体离职。"
  }
]

Output

Default output (JSON array to stdout):

[
  {"id": 1, "title": "Article Title", "url": "https://example.com/article"},
  {"id": 2, "title": "Another Article", "url": "https://example.com/other"}
]

Digest mode (--digest):

[
  {"id": 1, "title": "Article Title", "url": "https://example.com/article", "summary": "AI-generated summary..."},
  {"id": 3, "title": "Another Article", "url": "https://example.com/other", "summary": "AI-generated summary..."}
]

Database

Articles are stored in news_digest.db (SQLite) in the current directory by default. The database is created automatically on first run. Articles older than retention_days are purged at the start of each fetch run. Duplicate URLs are ignored via a UNIQUE constraint.

Each article stores metadata from the RSS feed (title, description, published date, author, etc.) plus the full article content fetched from the article URL. Content is extracted as plain text using BeautifulSoup. Some sites (e.g. paywalled or bot-protected) may return 403 errors — in those cases the content field is left empty and the RSS description is used as a fallback for summaries.