comment from youlu

2026-02-22 11:03:00 -08:00
parent 20ee5c2211
commit c0d0a32c8c
2 changed files with 103 additions and 96 deletions
--- a/scripts/news_digest/README.md
+++ b/scripts/news_digest/README.md
@@ -159,3 +159,11 @@ HTTP requests are made through a shared `requests.Session` with:
 - **Rate limiting** (Ollama latency between fetches when configured; 1-second fallback otherwise)

 Some sites (e.g. paywalled or bot-protected) may still return errors — in those cases the content field is left empty and the RSS description is used as a fallback for summaries.
+
+## Design notes
+
+- **Articles without dates are included by default.** `is_within_lookback` returns `True` when an article has no published date, and the query uses `OR published_date IS NULL`. This is intentional — silently dropping articles just because the feed omits a date would be worse than including them. If you only want dated articles, filter on `published_date` in the output.
+
+- **`generate_summary` accepts both `description` and `content`.** The `description` parameter is not redundant — `body = content or description` uses the RSS description as a fallback when `fetch_content()` fails and returns `None`. This ensures articles still get summarized even when the full page can't be fetched.
+
+- **`fetch_content` uses a chained ternary for element selection.** The expression `article if article else soup.body if soup.body else soup` picks the most specific container available. This is a common Python pattern and reads top-to-bottom as a priority list.