100 word limit

2026-02-22 14:29:01 -08:00
parent ef92820954
commit d4a4768e18
2 changed files with 30 additions and 6 deletions
--- a/scripts/news_digest/README.md
+++ b/scripts/news_digest/README.md
@@ -1,6 +1,6 @@
 # RSS News Digest

-Fetches articles from RSS/Atom feeds, downloads full article content, stores everything in SQLite with URL-based deduplication, and outputs a JSON digest to stdout. When Ollama is configured, each article is automatically summarized during fetch and the summary is stored in the database.
+Fetches articles from RSS/Atom feeds, downloads full article content, stores everything in SQLite with URL-based deduplication, and outputs a JSON digest to stdout. When Ollama is configured, each article is automatically summarized during fetch (truncated at the last sentence boundary within 100 words) and the summary is stored in the database.

 HTTP requests use automatic retries with exponential backoff for transient errors (429/5xx), browser-like headers to avoid 403 blocks, and rate limiting between content fetches.

@@ -49,7 +49,7 @@ Edit `config.json` to add feeds and adjust settings:
 - **hours_lookback** — only include articles published within this many hours
 - **retention_days** — auto-delete articles older than this from the database
 - **max_articles_per_feed** — limit how many articles are saved per feed per run (0 = unlimited)
- **ollama.model** — Ollama model name for article summaries (generated during fetch)
+- **ollama.model** — Ollama model name for article summaries (generated during fetch, truncated to 100 words at the last sentence boundary)
 - **ollama.prompt** — prompt sent to the model for each article
 - Removing the `ollama` key disables summarization; articles are still fetched normally
 - **feeds[].enabled** — set to `false` to skip a feed without removing it
@@ -151,6 +151,8 @@ With summary field (`-f id,title,url,summary`):

 Articles are stored in `news_digest.db` (SQLite) in the current directory by default. The database is created automatically on first run. Articles older than `retention_days` are purged at the start of each fetch run. Duplicate URLs are ignored via a UNIQUE constraint.

+Logs are written to `news_digest.log`. The log file is cleared at the start of each fetch cycle and appended to during the run.
+
 Each article stores metadata from the RSS feed (title, description, published date, author, etc.) plus the full article content fetched from the article URL. Content is extracted as plain text using BeautifulSoup.

 HTTP requests are made through a shared `requests.Session` with: