vault backup: 2026-02-11 11:14:35
This commit is contained in:
80
notes/martial_arts/CONVERSION_PLAN.md
Normal file
80
notes/martial_arts/CONVERSION_PLAN.md
Normal file
@@ -0,0 +1,80 @@
|
||||
# Convert Martial Arts PDF Notes to Markdown
|
||||
|
||||
> **Slash command**: This workflow is saved as a Claude Code skill at `.claude/commands/convert-pdfs.md`. In any Claude Code session, run `/project:convert-pdfs` to execute it automatically.
|
||||
|
||||
Convert all handwritten martial arts training note PDFs in `notes/martial_arts/` (including subdirectories) into structured markdown files for the Obsidian vault.
|
||||
|
||||
## Pipeline
|
||||
|
||||
1. **Discover PDFs**: Glob `notes/martial_arts/**/*.pdf` recursively. Skip any PDF that already has a matching `.md` file in the same directory (safe to restart after crashes).
|
||||
|
||||
2. **Convert PDF to PNG**: For each unconverted PDF, create a unique temp directory and run:
|
||||
```
|
||||
mkdir -p /tmp/pdf_pages/<basename>
|
||||
pdftoppm -png -r 200 <input.pdf> /tmp/pdf_pages/<basename>/page
|
||||
```
|
||||
Requires `poppler` (`brew install poppler`).
|
||||
|
||||
3. **Launch Task subagents**: For each PDF, launch a `general-purpose` Task subagent in the background. Each subagent:
|
||||
- Reads the PNG page images visually
|
||||
- Transcribes the handwritten content (mix of English and Chinese)
|
||||
- Writes a `.md` file in the same directory as the source PDF
|
||||
|
||||
Use background subagents to process multiple PDFs in parallel (batches of ~5). Each subagent gets fresh context, preventing the 30MB API request limit from being hit.
|
||||
|
||||
4. **Verify**: After all subagents complete, confirm every PDF has a matching `.md` file.
|
||||
|
||||
## Markdown Format
|
||||
|
||||
All generated `.md` files must include YAML frontmatter matching `templates/武术笔记.md`:
|
||||
|
||||
```markdown
|
||||
---
|
||||
类型: 笔记
|
||||
tags:
|
||||
- 笔记
|
||||
- 武术
|
||||
日期: <date from filename, e.g. 2024-08-06>
|
||||
老师: <instructor name from filename>
|
||||
武术: <martial art name>
|
||||
---
|
||||
|
||||
# [Title]
|
||||
|
||||
**日期**: MM.DD
|
||||
|
||||
## 1. [Section Title]
|
||||
|
||||
a. [detail]
|
||||
b. [detail]
|
||||
```
|
||||
|
||||
### Filename pattern
|
||||
Filenames follow: `<art>-<YYYY.MM.DD>-<instructor>.pdf`
|
||||
- Extract date, instructor, and art from the filename.
|
||||
|
||||
### Title conventions by art
|
||||
- **FMA/Silat/SEAMA**: `# [Art] — [Instructor] 师傅`
|
||||
- **八极拳 (Bajiquan)**: `# 八极拳 Lesson [NNN]` (identify lesson number from the handwritten notes if possible)
|
||||
- **劈挂拳 (Piguaquan)**: `# 劈挂拳 — [Instructor] 师傅`
|
||||
- **Other** (MMA, Muay Thai, Lethwei, etc.): `# [Art] — [Instructor] 师傅`
|
||||
|
||||
### 武术 field values
|
||||
Use these canonical names: `FMA`, `Silat`, `SEAMA`, `八极拳`, `劈挂拳`, `MMA`, `Muay Thai`, `Lethwei`
|
||||
|
||||
## Subagent Prompt Template
|
||||
|
||||
When launching each Task subagent, provide:
|
||||
- The list of PNG file paths to read visually
|
||||
- The output `.md` file path
|
||||
- The pre-filled YAML frontmatter
|
||||
- The title to use
|
||||
- An example of a completed conversion for reference
|
||||
- Instructions to transcribe faithfully, preserving both English and Chinese as written
|
||||
|
||||
## Previous Issues & Lessons Learned
|
||||
|
||||
- **30MB API limit**: The first attempt used a Python/Quartz script at 3x resolution. Processing multiple PDFs sequentially in one conversation accumulated image data past the limit. Fix: use Task subagents (each gets fresh context) and `pdftoppm` at 200 DPI.
|
||||
- **Use unique temp dirs**: When parallelizing, use `/tmp/pdf_pages/<basename>/` per PDF instead of a shared `/tmp/pdf_pages/` to avoid collisions.
|
||||
- **Resilience**: Skipping PDFs with existing `.md` makes the process safe to restart after crashes.
|
||||
- **Discovery is dynamic**: Always glob for PDFs at runtime — do not hardcode file lists.
|
||||
Reference in New Issue
Block a user