100 word limit

2026-02-22 14:29:01 -08:00
parent ef92820954
commit d4a4768e18
2 changed files with 30 additions and 6 deletions
--- a/scripts/news_digest/README.md
+++ b/scripts/news_digest/README.md
@@ -1,6 +1,6 @@
 # RSS News Digest

-Fetches articles from RSS/Atom feeds, downloads full article content, stores everything in SQLite with URL-based deduplication, and outputs a JSON digest to stdout. When Ollama is configured, each article is automatically summarized during fetch and the summary is stored in the database.
+Fetches articles from RSS/Atom feeds, downloads full article content, stores everything in SQLite with URL-based deduplication, and outputs a JSON digest to stdout. When Ollama is configured, each article is automatically summarized during fetch (truncated at the last sentence boundary within 100 words) and the summary is stored in the database.

 HTTP requests use automatic retries with exponential backoff for transient errors (429/5xx), browser-like headers to avoid 403 blocks, and rate limiting between content fetches.

@@ -49,7 +49,7 @@ Edit `config.json` to add feeds and adjust settings:
 - **hours_lookback** — only include articles published within this many hours
 - **retention_days** — auto-delete articles older than this from the database
 - **max_articles_per_feed** — limit how many articles are saved per feed per run (0 = unlimited)
- **ollama.model** — Ollama model name for article summaries (generated during fetch)
+- **ollama.model** — Ollama model name for article summaries (generated during fetch, truncated to 100 words at the last sentence boundary)
 - **ollama.prompt** — prompt sent to the model for each article
 - Removing the `ollama` key disables summarization; articles are still fetched normally
 - **feeds[].enabled** — set to `false` to skip a feed without removing it
@@ -151,6 +151,8 @@ With summary field (`-f id,title,url,summary`):

 Articles are stored in `news_digest.db` (SQLite) in the current directory by default. The database is created automatically on first run. Articles older than `retention_days` are purged at the start of each fetch run. Duplicate URLs are ignored via a UNIQUE constraint.

+Logs are written to `news_digest.log`. The log file is cleared at the start of each fetch cycle and appended to during the run.
+
 Each article stores metadata from the RSS feed (title, description, published date, author, etc.) plus the full article content fetched from the article URL. Content is extracted as plain text using BeautifulSoup.

 HTTP requests are made through a shared `requests.Session` with:
--- a/scripts/news_digest/main.py
+++ b/scripts/news_digest/main.py
@@ -5,9 +5,13 @@ Recommended: run via ./run.sh, which uses `uv` to handle dependencies
 automatically (no manual venv or pip install needed).

 When an `ollama` key is present in config.json, each newly fetched article is
-automatically summarized and the result is stored in the database.  Ollama
-latency provides natural rate limiting between HTTP requests; when Ollama is
-not configured, a 1-second sleep is used instead.
+automatically summarized and the result is stored in the database.  Summaries
+are truncated at the last sentence boundary within 100 words to keep them
+concise.  Ollama latency provides natural rate limiting between HTTP requests;
+when Ollama is not configured, a 1-second sleep is used instead.
+
+The log file (news_digest.log) is cleared at the start of each fetch cycle
+and appended to during the run via run.sh.

 Uses a requests.Session with automatic retries and browser-like headers to
 handle transient HTTP errors (429/5xx).  A configurable per-feed article cap
@@ -224,6 +228,23 @@ def get_recent_articles(conn: sqlite3.Connection, hours: int) -> list[dict]:
    return [dict(r) for r in rows]


+def _truncate_summary(text: str, max_words: int = 100) -> str:
+    """Truncate summary at the last sentence boundary within max_words."""
+    words = text.split()
+    if len(words) <= max_words:
+        return text
+    truncated = " ".join(words[:max_words])
+    # Find the last sentence-ending punctuation
+    last_period = -1
+    for ch in (".", "。", "!", "！", "?", "？"):
+        idx = truncated.rfind(ch)
+        if idx > last_period:
+            last_period = idx
+    if last_period > 0:
+        return truncated[: last_period + 1]
+    return truncated + "..."
+
+
 def generate_summary(title: str, description: str | None, content: str | None, model: str, prompt: str) -> str | None:
    try:
        import ollama as ollama_lib
@@ -242,7 +263,8 @@ def generate_summary(title: str, description: str | None, content: str | None, m
            model=model,
            messages=[{"role": "user", "content": user_message}],
        )
-        return response["message"]["content"]
+        summary = response["message"]["content"]
+        return _truncate_summary(summary)
    except Exception as e:
        logger.warning("Ollama error for '%s': %s", title, e)
        return None