diff --git a/MEMORY.md b/MEMORY.md index a614a6d..2714a9b 100644 --- a/MEMORY.md +++ b/MEMORY.md @@ -54,23 +54,6 @@ _这份文件记录持续性项目和重要状态,跨会话保留。_ --- -### 4. 每日新闻摘要 (News Digest) -**状态**: 已激活,运行中 -**创建**: 2026-02-22 -**配置**: -- 目录: `~/.openclaw/workspace/scripts/news_digest/` -- 主脚本: `send_digest.py` -- Cron ID: `cc95dcf3-0d6c-44f4-90af-76b836ca0c87` -- 时间: 每天早上 5:00 AM (PST) -- **运行模式**: **isolated subagent** (thinking=low, timeout=5min) -- 配置: `config.json` - -**功能**: -- RSS 抓取 → Ollama 摘要 → 邮件发送 -- 发送方: youlu@luyanxin.com → lu@luyx.org - ---- - ## 📝 重要规则 ### 邮件发送规则(v2) @@ -104,9 +87,8 @@ _这份文件记录持续性项目和重要状态,跨会话保留。_ | 普拉提监控 | `~/.openclaw/workspace/scripts/ucla_pilates_monitor.py` | | 待办提醒 | `~/.openclaw/workspace/scripts/reminder_check.py` | | 邮件处理器 | `~/.openclaw/workspace/scripts/email_processor/` | -| 新闻摘要 | `~/.openclaw/workspace/scripts/news_digest/` | | 待办列表 | `~/.openclaw/workspace/reminders/active.md` | --- -_最后更新: 2026-02-22_ +_最后更新: 2026-02-25_ diff --git a/logs/email_checks.log b/logs/email_checks.log index f27355d..6533bad 100644 --- a/logs/email_checks.log +++ b/logs/email_checks.log @@ -48,3 +48,19 @@ 2026-02-25T05:03:14-08:00 2026-02-25T06:03:13-08:00 2026-02-25T07:03:11-08:00 +2026-02-25T08:03:23-08:00 +2026-02-25T09:03:00-08:00 Email check start +2026-02-25T10:03:27-08:00 +2026-02-25T11:03:33-08:00 +2026-02-25T12:03:17-08:00 +2026-02-25T13:03:19-08:00 +2026-02-25T14:03:11-08:00 +2026-02-25T15:03:10-08:00 +2026-02-25T16:03:13-08:00 +2026-02-25T17:03:13-08:00 +2026-02-25T18:03:14-08:00 +2026-02-25T19:03:11-08:00 +2026-02-25T20:03:16-08:00 +2026-02-25T21:03:11-08:00 +2026-02-25T22:03:14-08:00 +2026-02-25T23:04:40-08:00 diff --git a/reminders/active.md b/reminders/active.md index 9e65859..d4a3278 100644 --- a/reminders/active.md +++ b/reminders/active.md @@ -7,6 +7,7 @@ | 给过敏医生打电话 | 2026-02-14 | 高 | done | 问集群过敏针吃的三个药物 | | 给tilles打电话 | 2026-02-17 | 中 | done | 把horizon检测结果发给她,并调整B超时间,跟进治疗进度 | | 跟进iui保险报销 | 2026-02-21 | 中 | pending | 确认iui(人工授精)费用保险报销进度,避免过期 | +| 短杖贴上胶布 | 2026-02-28 | 低 | pending | 周五提醒:用胶布包裹短杖把手,防滑/保护 | | 周五打电话给progyny问iui报销 | 2026-02-27 | 中 | pending | 询问iui报销相关事宜 | | 跟进CVS药物报销 | 2026-02-21 | 中 | done | 确认CVS买的药物报销是否到账,核对金额 | | 打电话给progyny问Pacific Bill | 2026-02-20 | 中 | done | 周五前询问Pacific Bill相关事宜 | diff --git a/scripts/news_digest/README.md b/scripts/news_digest/README.md deleted file mode 100644 index 1fb1c91..0000000 --- a/scripts/news_digest/README.md +++ /dev/null @@ -1,171 +0,0 @@ -# RSS News Digest - -Fetches articles from RSS/Atom feeds, downloads full article content, stores everything in SQLite with URL-based deduplication, and outputs a JSON digest to stdout. When Ollama is configured, each article is automatically summarized during fetch (truncated at the last sentence boundary within 100 words) and the summary is stored in the database. - -HTTP requests use automatic retries with exponential backoff for transient errors (429/5xx), browser-like headers to avoid 403 blocks, and rate limiting between content fetches. - -## Setup - -Install [uv](https://docs.astral.sh/uv/getting-started/installation/) (a fast Python package manager): - -```bash -curl -LsSf https://astral.sh/uv/install.sh | sh -``` - -`run.sh` handles Python and dependency installation automatically — no manual venv or `pip install` needed. - -For AI-powered article summaries, install [Ollama](https://ollama.com) and pull a model: - -```bash -ollama pull kamekichi128/qwen3-4b-instruct-2507 -``` - -## Configuration - -Edit `config.json` to add feeds and adjust settings: - -```json -{ - "settings": { - "hours_lookback": 24, - "retention_days": 30, - "max_articles_per_feed": 10 - }, - "ollama": { - "model": "kamekichi128/qwen3-4b-instruct-2507", - "prompt": "Summarize the following news article in 2-3 concise sentences (around 50 words):" - }, - "feeds": [ - { - "url": "https://feeds.bbci.co.uk/news/rss.xml", - "name": "BBC News", - "category": "World", - "enabled": true - } - ] -} -``` - -- **hours_lookback** — only include articles published within this many hours -- **retention_days** — auto-delete articles older than this from the database -- **max_articles_per_feed** — limit how many articles are saved per feed per run (0 = unlimited) -- **ollama.model** — Ollama model name for article summaries (generated during fetch, truncated to 100 words at the last sentence boundary) -- **ollama.prompt** — prompt sent to the model for each article -- Removing the `ollama` key disables summarization; articles are still fetched normally -- **feeds[].enabled** — set to `false` to skip a feed without removing it - -## Usage - -```bash -# Fetch feeds and print recent articles (default fields: id, title, url, published_date, fetched_date, feed_name) -./run.sh - -# Verbose mode (logs to stderr) -./run.sh -v - -# Override lookback window to 48 hours -./run.sh --hours 48 - -# Include more fields in output -./run.sh -f id,title,url,description,published_date - -# Include article summaries in output (requires ollama config) -./run.sh -f id,title,url,summary - -# Include full article content in output -./run.sh -f id,title,url,content - -# Review stored articles without fetching new ones -./run.sh --no-fetch - -# Only purge old articles (no fetching or output) -./run.sh --purge-only -v - -# Smoke test: full pipeline (fetch 1 article from first enabled feed + summarize) -./run.sh --test - -# Test feed fetching only (1 article from first enabled feed) -./run.sh --test feed - -# Test Ollama summarization only (uses hardcoded articles) -./run.sh --test summary - -# Custom config and database paths -./run.sh -c my_config.json -d my_news.db -``` - -## Examples - -### Fetch feeds and list articles - -``` -$ ./run.sh -v -2026-02-21 22:41:44 DEBUG Ollama summarization enabled (model: qwen3) -2026-02-21 22:41:44 INFO Feed 'BBC News': 35 new articles (of 36 within lookback) -2026-02-21 22:43:54 INFO Feed 'NY Times': 10 new articles (of 10 within lookback) -2026-02-21 22:44:57 INFO Feed 'Hacker News': 10 new articles (of 10 within lookback) -2026-02-21 22:46:08 INFO Feed '联合早报 中国': 10 new articles (of 10 within lookback) -2026-02-21 22:47:23 INFO Feed '澎湃新闻 热点': 10 new articles (of 10 within lookback) -2026-02-21 22:48:49 INFO Feed '36氪 热榜': 8 new articles (of 8 within lookback) -2026-02-21 22:48:59 INFO Total new articles saved: 83 -[ - {"id": 1, "title": "Iran students stage first large anti-government protests since deadly crackdown", "url": "https://www.bbc.com/news/articles/..."}, - {"id": 2, "title": "Trump says he will increase his new global tariffs to 15%", "url": "https://www.bbc.com/news/articles/..."}, - ... -] -``` - -### Review stored articles without fetching - -``` -$ ./run.sh --no-fetch -f id,title,feed_name -[ - {"id": 1, "title": "Iran students stage first large anti-government protests...", "feed_name": "BBC News"}, - {"id": 79, "title": "韩国二次电池技术水平被中国反超", "feed_name": "联合早报 中国"}, - {"id": 117, "title": "忍无可忍,Ilya宫斗奥特曼!微软CTO爆内幕...", "feed_name": "36氪 热榜"}, - ... -] -``` - -## Output - -Default output (JSON array to stdout): - -```json -[ - {"id": 1, "title": "Article Title", "url": "https://example.com/article"}, - {"id": 2, "title": "Another Article", "url": "https://example.com/other"} -] -``` - -With summary field (`-f id,title,url,summary`): - -```json -[ - {"id": 1, "title": "Article Title", "url": "https://example.com/article", "summary": "AI-generated summary..."}, - {"id": 2, "title": "Another Article", "url": "https://example.com/other", "summary": "AI-generated summary..."} -] -``` - -## Database - -Articles are stored in `news_digest.db` (SQLite) in the current directory by default. The database is created automatically on first run. Articles older than `retention_days` are purged at the start of each fetch run. Duplicate URLs are ignored via a UNIQUE constraint. - -Logs are written to `news_digest.log`. The log file is cleared at the start of each fetch cycle and appended to during the run. - -Each article stores metadata from the RSS feed (title, description, published date, author, etc.) plus the full article content fetched from the article URL. Content is extracted as plain text using BeautifulSoup. - -HTTP requests are made through a shared `requests.Session` with: -- **Automatic retries** (up to 3 attempts with exponential backoff) for 429/5xx errors -- **Browser-like headers** (User-Agent, Accept) to reduce 403 rejections -- **Rate limiting** (Ollama latency between fetches when configured; 1-second fallback otherwise) - -Some sites (e.g. paywalled or bot-protected) may still return errors — in those cases the content field is left empty and the RSS description is used as a fallback for summaries. - -## Design notes - -- **Articles without dates are included by default.** `is_within_lookback` returns `True` when an article has no published date, and the query uses `OR published_date IS NULL`. This is intentional — silently dropping articles just because the feed omits a date would be worse than including them. If you only want dated articles, filter on `published_date` in the output. - -- **`generate_summary` accepts both `description` and `content`.** The `description` parameter is not redundant — `body = content or description` uses the RSS description as a fallback when `fetch_content()` fails and returns `None`. This ensures articles still get summarized even when the full page can't be fetched. - -- **`fetch_content` uses a chained ternary for element selection.** The expression `article if article else soup.body if soup.body else soup` picks the most specific container available. This is a common Python pattern and reads top-to-bottom as a priority list. diff --git a/scripts/news_digest/config.json b/scripts/news_digest/config.json deleted file mode 100644 index ea53808..0000000 --- a/scripts/news_digest/config.json +++ /dev/null @@ -1,43 +0,0 @@ -{ - "settings": { - "hours_lookback": 24, - "retention_days": 14, - "max_articles_per_feed": 6 - }, - "ollama": { - "model": "kamekichi128/qwen3-4b-instruct-2507:latest", - "prompt": "Summarize the following news article in 2-3 concise sentences (around 50 words):" - }, - "feeds": [ - { - "url": "https://hnrss.org/frontpage", - "name": "Hacker News", - "category": "Tech", - "enabled": true - }, - { - "url": "https://rsshub.isrss.com/zaobao/realtime/china", - "name": "联合早报 中国", - "category": "China", - "enabled": true - }, - { - "url": "https://rsshub.isrss.com/thepaper/featured", - "name": "澎湃新闻 热点", - "category": "China", - "enabled": true - }, - { - "url": "https://rsshub.isrss.com/reuters/world", - "name": "Reuters", - "category": "World", - "enabled": true - }, - { - "url": "https://rsshub.isrss.com/readhub", - "name": "ReadHub", - "category": "Tech", - "enabled": true - } - ] -} diff --git a/scripts/news_digest/main.py b/scripts/news_digest/main.py deleted file mode 100644 index 27e80db..0000000 --- a/scripts/news_digest/main.py +++ /dev/null @@ -1,499 +0,0 @@ -#!/usr/bin/env python3 -"""RSS News Digest — fetch feeds, store articles with full content in SQLite, and summarize via Ollama during fetch. - -Recommended: run via ./run.sh, which uses `uv` to handle dependencies -automatically (no manual venv or pip install needed). - -When an `ollama` key is present in config.json, each newly fetched article is -automatically summarized and the result is stored in the database. Summaries -are truncated at the last sentence boundary within 100 words to keep them -concise. Ollama latency provides natural rate limiting between HTTP requests; -when Ollama is not configured, a 1-second sleep is used instead. - -The log file (news_digest.log) is cleared at the start of each fetch cycle -and appended to during the run via run.sh. - -Uses a requests.Session with automatic retries and browser-like headers to -handle transient HTTP errors (429/5xx). A configurable per-feed article cap -helps avoid overwhelming upstream servers. - -Use ``--test`` to smoke-test feed fetching and/or Ollama summarization without -writing to the database. -""" - -import argparse -import json -import logging -import sqlite3 -import sys -import time -from datetime import datetime, timedelta, timezone -from pathlib import Path -from time import mktime - -import feedparser -import requests -from bs4 import BeautifulSoup -from requests.adapters import HTTPAdapter -from urllib3.util.retry import Retry - -logger = logging.getLogger("news_digest") - -# Hardcoded test articles for --test summary (one English, one Chinese) -_TEST_ARTICLES = [ - { - "title": "Global Semiconductor Shortage Eases as New Factories Come Online", - "content": ( - "The global chip shortage that disrupted industries from automotive to " - "consumer electronics is finally showing signs of relief. Major semiconductor " - "manufacturers including TSMC, Samsung, and Intel have begun production at new " - "fabrication plants in Arizona, Texas, and Japan. Industry analysts project that " - "global chip capacity will increase by 15% over the next 18 months, potentially " - "leading to a supply surplus in certain categories. The shift has already begun " - "to impact pricing, with memory chip costs dropping 12% in the last quarter." - ), - }, - { - "title": "中国新能源汽车出口量首次突破年度600万辆大关", - "content": ( - "据中国汽车工业协会最新数据,2025年中国新能源汽车出口量首次突破600万辆," - "同比增长38%。比亚迪、上汽、蔚来等品牌在东南亚、欧洲和南美市场持续扩张。" - "分析人士指出,中国在电池技术和供应链方面的优势使其产品在全球市场具有较强" - "竞争力,但欧盟加征的反补贴关税可能对未来增长构成挑战。" - ), - }, -] - - -def _build_session() -> requests.Session: - """Create a requests session with automatic retries and browser-like headers.""" - session = requests.Session() - retry = Retry( - total=3, - backoff_factor=1, # 1s, 2s, 4s between retries - status_forcelist=[429, 500, 502, 503, 504], - respect_retry_after_header=True, - ) - adapter = HTTPAdapter(max_retries=retry) - session.mount("http://", adapter) - session.mount("https://", adapter) - session.headers.update({ - "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) " - "AppleWebKit/537.36 (KHTML, like Gecko) " - "Chrome/131.0.0.0 Safari/537.36", - "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", - "Accept-Language": "en-US,en;q=0.9", - }) - return session - - -_session = _build_session() - - -def load_config(path: str) -> dict: - with open(path, encoding="utf-8") as f: - return json.load(f) - - -def init_db(db_path: str) -> sqlite3.Connection: - conn = sqlite3.connect(db_path) - conn.row_factory = sqlite3.Row - conn.execute(""" - CREATE TABLE IF NOT EXISTS articles ( - id INTEGER PRIMARY KEY AUTOINCREMENT, - url TEXT UNIQUE NOT NULL, - title TEXT, - description TEXT, - content TEXT, - summary TEXT, - published_date TEXT, - fetched_date TEXT NOT NULL, - feed_name TEXT, - feed_url TEXT, - category TEXT, - author TEXT - ) - """) - conn.commit() - return conn - - -def parse_article_date(entry) -> datetime | None: - for attr in ("published_parsed", "updated_parsed"): - parsed = getattr(entry, attr, None) - if parsed: - return datetime.fromtimestamp(mktime(parsed), tz=timezone.utc) - return None - - -def is_within_lookback(dt: datetime | None, hours: int) -> bool: - if dt is None: - return True - cutoff = datetime.now(timezone.utc) - timedelta(hours=hours) - return dt >= cutoff - - -def fetch_feed(url: str) -> list[dict]: - try: - resp = _session.get(url, timeout=30) - resp.raise_for_status() - raw = resp.content - except requests.RequestException as e: - logger.warning("Failed to fetch %s: %s", url, e) - return [] - - feed = feedparser.parse(raw) - if feed.bozo and not feed.entries: - logger.warning("Feed parse error for %s: %s", url, feed.bozo_exception) - return [] - return feed.entries - - -def fetch_content(url: str) -> str | None: - try: - resp = _session.get(url, timeout=15) - resp.raise_for_status() - html = resp.content - except requests.RequestException as e: - logger.warning("Failed to fetch content from %s: %s", url, e) - return None - - soup = BeautifulSoup(html, "html.parser") - - # Remove non-content elements - for tag in soup.find_all(["script", "style", "nav", "header", "footer", "aside", "form"]): - tag.decompose() - - # Try common article content containers first - article = ( - soup.find("article") - or soup.find(attrs={"role": "main"}) - or soup.find("main") - or soup.find(class_=lambda c: c and ("article" in c or "content" in c or "post" in c)) - ) - - target = article if article else soup.body if soup.body else soup - text = target.get_text(separator="\n", strip=True) - - # Collapse excessive blank lines - lines = [line for line in text.splitlines() if line.strip()] - return "\n".join(lines) if lines else None - - -def save_articles(conn: sqlite3.Connection, articles: list[dict]) -> list[str]: - """Insert articles, return list of URLs that were newly inserted.""" - new_urls = [] - now = datetime.now(timezone.utc).isoformat() - for a in articles: - try: - conn.execute( - """INSERT OR IGNORE INTO articles - (url, title, description, published_date, fetched_date, - feed_name, feed_url, category, author) - VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)""", - ( - a["url"], - a.get("title"), - a.get("description"), - a.get("published_date"), - now, - a.get("feed_name"), - a.get("feed_url"), - a.get("category"), - a.get("author"), - ), - ) - if conn.execute("SELECT changes()").fetchone()[0] > 0: - new_urls.append(a["url"]) - except sqlite3.Error as e: - logger.warning("DB insert error for %s: %s", a.get("url"), e) - conn.commit() - return new_urls - - -def purge_old_articles(conn: sqlite3.Connection, days: int) -> int: - cutoff = (datetime.now(timezone.utc) - timedelta(days=days)).isoformat() - conn.execute("DELETE FROM articles WHERE fetched_date < ?", (cutoff,)) - deleted = conn.execute("SELECT changes()").fetchone()[0] - conn.commit() - return deleted - - -def get_recent_articles(conn: sqlite3.Connection, hours: int) -> list[dict]: - cutoff = (datetime.now(timezone.utc) - timedelta(hours=hours)).isoformat() - rows = conn.execute( - "SELECT * FROM articles WHERE published_date >= ? OR published_date IS NULL ORDER BY id", - (cutoff,), - ).fetchall() - return [dict(r) for r in rows] - - -def _truncate_summary(text: str, max_words: int = 100) -> str: - """Truncate summary at the last sentence boundary within max_words.""" - words = text.split() - if len(words) <= max_words: - return text - truncated = " ".join(words[:max_words]) - # Find the last sentence-ending punctuation - last_period = -1 - for ch in (".", "。", "!", "!", "?", "?"): - idx = truncated.rfind(ch) - if idx > last_period: - last_period = idx - if last_period > 0: - return truncated[: last_period + 1] - return truncated + "..." - - -def generate_summary(title: str, description: str | None, content: str | None, model: str, prompt: str) -> str | None: - try: - import ollama as ollama_lib - except ImportError: - logger.warning("ollama package not installed; skipping summary") - return None - - body = content or description - article_text = f"Title: {title}" - if body: - article_text += f"\n\n{body}" - user_message = f"{prompt}\n\n{article_text}" - - try: - response = ollama_lib.chat( - model=model, - messages=[{"role": "user", "content": user_message}], - ) - summary = response["message"]["content"] - return _truncate_summary(summary) - except Exception as e: - logger.warning("Ollama error for '%s': %s", title, e) - return None - - -def _run_test(mode: str, config: dict) -> None: - """Run smoke tests for feed fetching, summarization, or both. - - All JSON results go to stdout; status messages go to stderr. - """ - if mode not in ("", "feed", "summary"): - print(f"Unknown test mode: {mode!r} (use 'feed', 'summary', or omit)", file=sys.stderr) - sys.exit(1) - - feed_article = None # may be populated by feed test for use in full mode - - # --- Feed test --- - if mode in ("", "feed"): - print("=== Feed test ===", file=sys.stderr) - feeds = config.get("feeds", []) - enabled = [f for f in feeds if f.get("enabled", True)] - if not enabled: - print("FAIL: no enabled feeds in config", file=sys.stderr) - sys.exit(1) - - feed_cfg = enabled[0] - url = feed_cfg["url"] - name = feed_cfg.get("name", url) - print(f"Fetching feed: {name} ({url})", file=sys.stderr) - - entries = fetch_feed(url) - if not entries: - print("FAIL: no entries returned from feed", file=sys.stderr) - sys.exit(1) - - entry = entries[0] - link = entry.get("link", "") - title = entry.get("title", "") - print(f"Fetching content: {link}", file=sys.stderr) - content = fetch_content(link) if link else None - - result = { - "feed": name, - "title": title, - "url": link, - "content_length": len(content) if content else 0, - } - print(json.dumps(result, ensure_ascii=False, indent=2)) - - if content: - print("PASS: feed fetch", file=sys.stderr) - feed_article = {"title": title, "content": content} - else: - print("FAIL: could not fetch article content", file=sys.stderr) - if mode == "feed": - sys.exit(1) - - # --- Summary test --- - if mode in ("", "summary"): - print("=== Summary test ===", file=sys.stderr) - ollama_cfg = config.get("ollama") - if not ollama_cfg: - print("FAIL: no 'ollama' key in config", file=sys.stderr) - sys.exit(1) - - model = ollama_cfg.get("model", "kamekichi128/qwen3-4b-instruct-2507") - prompt = ollama_cfg.get("prompt", "Summarize the following news article in 2-3 concise sentences (around 50 words):") - - # Build test inputs: hardcoded articles + fetched article (full mode only) - articles = list(_TEST_ARTICLES) - if feed_article: - articles.append(feed_article) - - all_ok = True - for article in articles: - print(f"Summarizing: {article['title']}", file=sys.stderr) - summary = generate_summary(article["title"], None, article["content"], model, prompt) - result = {"title": article["title"], "summary": summary} - print(json.dumps(result, ensure_ascii=False, indent=2)) - if not summary: - all_ok = False - - if all_ok: - print("PASS: summary", file=sys.stderr) - else: - print("FAIL: one or more summaries failed", file=sys.stderr) - sys.exit(1) - - -def main(): - parser = argparse.ArgumentParser(description="RSS News Digest") - parser.add_argument("-c", "--config", default="config.json", help="Config file path") - parser.add_argument("-d", "--database", default="news_digest.db", help="SQLite database path") - parser.add_argument("--hours", type=int, help="Override lookback hours") - parser.add_argument("-f", "--fields", default="id,title,url,published_date,fetched_date,feed_name", help="Comma-separated output fields") - parser.add_argument("--purge-only", action="store_true", help="Only purge old articles") - parser.add_argument("--no-fetch", action="store_true", help="Skip fetching feeds, only query stored articles") - parser.add_argument("-v", "--verbose", action="store_true", help="Debug logging to stderr") - parser.add_argument("--test", nargs="?", const="", metavar="MODE", - help="Smoke test: 'feed', 'summary', or omit for full pipeline") - args = parser.parse_args() - - logging.basicConfig( - level=logging.DEBUG if args.verbose else logging.WARNING, - format="%(asctime)s %(levelname)s %(message)s", - stream=sys.stderr, - ) - - config_path = Path(args.config) - if not config_path.exists(): - logger.error("Config file not found: %s", config_path) - sys.exit(1) - - config = load_config(str(config_path)) - - # Handle --test before any DB operations - if args.test is not None: - _run_test(args.test, config) - return - - settings = config.get("settings", {}) - hours_lookback = args.hours or settings.get("hours_lookback", 24) - retention_days = settings.get("retention_days", 30) - max_per_feed = settings.get("max_articles_per_feed", 0) - - conn = init_db(args.database) - try: - # Purge old articles - deleted = purge_old_articles(conn, retention_days) - if deleted: - logger.info("Purged %d articles older than %d days", deleted, retention_days) - - if args.purge_only: - logger.info("Purge-only mode; exiting") - return - - # Fetch feeds - if not args.no_fetch: - # Clear log file before starting a new fetch cycle - log_file = Path(__file__).resolve().parent / "news_digest.log" - log_file.write_text("") - - feeds = config.get("feeds", []) - total_new = 0 - - # Read ollama config once for summarization during fetch - ollama_cfg = config.get("ollama") - if ollama_cfg: - ollama_model = ollama_cfg.get("model", "kamekichi128/qwen3-4b-instruct-2507") - ollama_prompt = ollama_cfg.get("prompt", "Summarize the following news article in 2-3 concise sentences (around 50 words):") - logger.debug("Ollama summarization enabled (model: %s)", ollama_model) - else: - ollama_model = ollama_prompt = None - logger.debug("Ollama not configured; skipping summarization") - - for feed_cfg in feeds: - if not feed_cfg.get("enabled", True): - logger.debug("Skipping disabled feed: %s", feed_cfg.get("name")) - continue - - url = feed_cfg["url"] - logger.debug("Fetching feed: %s (%s)", feed_cfg.get("name", url), url) - entries = fetch_feed(url) - logger.debug("Got %d entries from %s", len(entries), feed_cfg.get("name", url)) - - articles = [] - for entry in entries: - pub_date = parse_article_date(entry) - if not is_within_lookback(pub_date, hours_lookback): - continue - - link = entry.get("link", "") - if not link: - continue - - articles.append({ - "url": link, - "title": entry.get("title"), - "description": entry.get("summary"), - "published_date": pub_date.isoformat() if pub_date else None, - "feed_name": feed_cfg.get("name"), - "feed_url": url, - "category": feed_cfg.get("category"), - "author": entry.get("author"), - }) - - # Cap articles per feed to avoid flooding the DB and downstream fetches - if max_per_feed > 0: - articles = articles[:max_per_feed] - - new_urls = save_articles(conn, articles) - total_new += len(new_urls) - logger.info("Feed '%s': %d new articles (of %d within lookback)", - feed_cfg.get("name", url), len(new_urls), len(articles)) - - # Fetch full content and optionally summarize newly inserted articles - for i, article_url in enumerate(new_urls): - if i > 0 and not ollama_cfg: - time.sleep(1) # rate limit when Ollama isn't providing natural delay - logger.debug("Fetching content: %s", article_url) - content = fetch_content(article_url) - summary = None - if ollama_cfg: - row = conn.execute( - "SELECT title, description FROM articles WHERE url = ?", (article_url,) - ).fetchone() - if row: - summary = generate_summary(row["title"], row["description"], content, ollama_model, ollama_prompt) - if summary: - logger.debug("Generated summary for %s", article_url) - else: - if i > 0: - time.sleep(1) # fallback rate limit on summary failure - conn.execute( - "UPDATE articles SET content = ?, summary = ? WHERE url = ?", - (content, summary, article_url), - ) - conn.commit() - - logger.info("Total new articles saved: %d", total_new) - - # Output recent articles - recent = get_recent_articles(conn, hours_lookback) - fields = [f.strip() for f in args.fields.split(",")] - output = [{k: article[k] for k in fields if k in article} for article in recent] - print(json.dumps(output, ensure_ascii=False, indent=2)) - finally: - conn.close() - - -if __name__ == "__main__": - main() diff --git a/scripts/news_digest/requirements.txt b/scripts/news_digest/requirements.txt deleted file mode 100644 index ddba4d7..0000000 --- a/scripts/news_digest/requirements.txt +++ /dev/null @@ -1,4 +0,0 @@ -feedparser>=6.0.0 -beautifulsoup4>=4.12.0 -requests>=2.31.0 -ollama>=0.4.0 diff --git a/scripts/news_digest/run.sh b/scripts/news_digest/run.sh deleted file mode 100755 index 159d85f..0000000 --- a/scripts/news_digest/run.sh +++ /dev/null @@ -1,5 +0,0 @@ -#!/usr/bin/env bash -set -euo pipefail -SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" -LOG_FILE="$SCRIPT_DIR/news_digest.log" -uv run --python 3.12 --with-requirements "$SCRIPT_DIR/requirements.txt" "$SCRIPT_DIR/main.py" -v "$@" 2>>"$LOG_FILE" diff --git a/scripts/news_digest/send_digest.py b/scripts/news_digest/send_digest.py deleted file mode 100755 index b33854f..0000000 --- a/scripts/news_digest/send_digest.py +++ /dev/null @@ -1,175 +0,0 @@ -#!/usr/bin/env python3 -""" -Daily News Digest - Fetch news and send email summary -Runs news_digest, formats output, sends via himalaya -""" - -import json -import subprocess -import sys -from datetime import datetime -from pathlib import Path - -# Config -SCRIPT_DIR = Path(__file__).resolve().parent -DB_PATH = SCRIPT_DIR / "news_digest.db" -CONFIG_PATH = SCRIPT_DIR / "config.json" - -# Email config -EMAIL_TO = "lu@luyx.org" -EMAIL_FROM = "youlu@luyanxin.com" - -def run_news_digest(): - """Run news_digest and get articles with summaries""" - # Fetch new articles using run.sh (handles uv dependencies) - fetch_cmd = [ - str(SCRIPT_DIR / "run.sh"), - ] - - # Run fetch (this saves articles to DB and generates summaries) - try: - result = subprocess.run(fetch_cmd, capture_output=True, text=True, cwd=SCRIPT_DIR, timeout=300) - except subprocess.TimeoutExpired: - print("Fetch timed out after 300s", file=sys.stderr) - sys.exit(1) - if result.returncode != 0: - print(f"Fetch failed: {result.stderr}", file=sys.stderr) - return None - - # Now query with summary field - query_cmd = [ - str(SCRIPT_DIR / "run.sh"), - "--no-fetch", - "-f", "id,title,url,summary,feed_name,category" - ] - - try: - result = subprocess.run(query_cmd, capture_output=True, text=True, cwd=SCRIPT_DIR, timeout=30) - except subprocess.TimeoutExpired: - print("Query timed out after 30s", file=sys.stderr) - sys.exit(1) - if result.returncode != 0: - print(f"Query failed: {result.stderr}", file=sys.stderr) - return None - - # Parse JSON output - try: - articles = json.loads(result.stdout) - return articles - except json.JSONDecodeError as e: - print(f"Failed to parse JSON: {e}", file=sys.stderr) - print(f"Raw output: {result.stdout[:500]}", file=sys.stderr) - return [] - -def format_email(articles): - """Format articles into email body""" - if not articles: - return None - - today = datetime.now().strftime('%Y-%m-%d') - - lines = [ - f"📰 每日新闻摘要 ({today})", - "=" * 50, - "" - ] - - # Group by category - by_category = {} - for a in articles: - cat = a.get('category', 'Other') - if cat not in by_category: - by_category[cat] = [] - by_category[cat].append(a) - - # Sort categories - category_order = ['Tech', 'China', 'World', 'Other'] - ordered_cats = category_order + [c for c in by_category if c not in category_order] - - for cat in ordered_cats: - if cat not in by_category: - continue - - cat_name = { - 'Tech': '💻 科技', - 'China': '🇨🇳 国内', - 'World': '🌍 国际', - 'Other': '📄 其他' - }.get(cat, cat) - - lines.append(f"\n{cat_name}") - lines.append("-" * 30) - - for article in by_category[cat]: - title = article.get('title', '无标题') - url = article.get('url', '') - summary = article.get('summary', '') - source = article.get('feed_name', '未知来源') - - lines.append(f"\n• {title}") - if summary: - # Clean up summary (remove excessive newlines, preserve content) - summary_clean = summary.replace('\n', ' ').strip() - lines.append(f" {summary_clean}") - lines.append(f" 来源: {source} | {url}") - - lines.append("\n" + "=" * 50) - lines.append("由 RSS News Digest 自动生成") - - return '\n'.join(lines) - -def send_email(subject, body): - """Send email via himalaya""" - # Construct raw email message with headers - message = f"""From: {EMAIL_FROM} -To: {EMAIL_TO} -Subject: {subject} -Content-Type: text/plain; charset=utf-8 - -{body}""" - - cmd = [ - "himalaya", - "message", "send", - "-a", "Youlu" # Account name from config.toml - ] - - try: - result = subprocess.run(cmd, capture_output=True, text=True, input=message, timeout=30) - except subprocess.TimeoutExpired: - return False, "himalaya timed out after 30s" - return result.returncode == 0, result.stderr - -def main(): - # Fetch and get articles - articles = run_news_digest() - - if articles is None: - # Fetch failed, don't send email - sys.exit(1) - - if not articles: - # No new articles, send a "no news" email or skip - # Let's skip sending email if no articles - print("No articles to send") - return - - # Format email - body = format_email(articles) - if not body: - print("Failed to format email") - sys.exit(1) - - # Send email - today = datetime.now().strftime('%Y-%m-%d') - subject = f"📰 每日新闻摘要 - {today}" - - success, error = send_email(subject, body) - if success: - print(f"Email sent successfully: {len(articles)} articles") - else: - print(f"Failed to send email: {error}", file=sys.stderr) - sys.exit(1) - -if __name__ == "__main__": - main()