Add research catalog CLI and training plan

2025-12-30 16:59:22 -05:00
parent 5b600a4a11
commit f37ad164bc
12 changed files with 586 additions and 2 deletions
--- a/docs/PDF_WORKFLOW.md
+++ b/docs/PDF_WORKFLOW.md
@@ -0,0 +1,34 @@
+# PDF Workflow
+
+Goal: keep research PDFs in a known place, catalog them, and open them fast.
+
+## Defaults
+- Research root: `~/Documents/Research`
+- Catalog output: `~/src/context/index/research_catalog.json`
+
+## Commands
+```sh
+python -m afs_scawful research catalog
+python -m afs_scawful research list
+python -m afs_scawful research show 2512-20957v2-XXXXXXXX
+python -m afs_scawful research open 2512-20957v2-XXXXXXXX --open
+```
+
+## Overrides
+- `AFS_RESEARCH_ROOT=/path/to/Research`
+- `AFS_RESEARCH_CATALOG=/path/to/research_catalog.json`
+- Optional config: `research_paths.toml` in `~/.config/afs/afs_scawful/` or
+  `~/.config/afs/plugins/afs_scawful/config/`
+
+Example `research_paths.toml`:
+```toml
+[paths]
+research_root = "~/Documents/Research"
+research_catalog = "~/src/context/index/research_catalog.json"
+```
+
+## Notes
+- Abstract excerpts are auto-extracted from the first pages; verify before quoting.
+- `--open` uses the OS default PDF viewer (Preview on macOS).
+- For richer metadata extraction, install the optional dependency:
+  `pip install -e '.[research]'`
--- a/docs/STATUS.md
+++ b/docs/STATUS.md
@@ -1,7 +1,7 @@
 # STATUS

 Stage: Prototype
-Now: config helpers; dataset registry builder; resource indexer; training sample model; validator base + initial validators; doc-section generator; pytest coverage.
+Now: config helpers; dataset registry builder; resource indexer; training sample model; validator base + initial validators; doc-section generator; research catalog CLI + PDF workflow docs; pytest coverage.
 Not yet: more generators; training runner; dataset QA reports.
 Next: add generator QA summary + manifest; wire generator outputs into AFS Studio.
 Issues: no training runtime yet.
--- a/docs/TRAINING_PLAN.md
+++ b/docs/TRAINING_PLAN.md
@@ -0,0 +1,48 @@
+# Training Plan (AFS Scawful)
+
+Scope: local-only training data pipelines and evaluation for AFS workflows.
+Research-only. See `../afs/docs/RESEARCH_SOURCES.md` for citations.
+
+## Goals
+- Keep datasets reproducible, small, and auditable.
+- Prioritize agentic filesystem primitives before model training complexity.
+- Use evaluation loops to avoid training on noise.
+
+## Phase 0 — Inventory + Research Catalog (now)
+- Use `afs_scawful research catalog` to index `~/Documents/Research`.
+- Keep the catalog JSON in `~/src/context/index/research_catalog.json`.
+- Verify metadata/abstract excerpts before quoting. [R1]
+
+## Phase 1 — Dataset QA (near-term)
+- Expand dataset registry with QA summaries (counts, schema drift, invalid rows).
+- Define a minimal JSON schema for training samples.
+- Track provenance per dataset and per generator. [R1]
+
+## Phase 2 — Task Design (near-term)
+- Start with repo-level navigation tasks that assume a small tool surface. [R3]
+- Keep tasks focused on file discovery, symbol lookup, and context assembly.
+- Use small, deterministic datasets to validate task framing before scaling.
+
+## Phase 3 — Context Packaging (mid-term)
+- Treat training samples as explicit context pipelines with clear state and error
+  propagation. [R4]
+- Build a minimal "context transcript" format (inputs, tool calls, outputs).
+
+## Phase 4 — Evaluation (mid-term)
+- Add human+agent evaluation metrics to avoid overfitting to synthetic tasks. [R7]
+- Include tone-variant prompts as a controlled ablation (optional). [R6]
+
+## Phase 5 — Efficiency References (later)
+- Use MoE efficiency papers only when scaling becomes a bottleneck. [R5]
+
+## Unknown / needs verification
+- Which tasks best reflect AFS workflows (agentic filesystem vs orchestration).
+- Whether RL is needed or if supervised data is sufficient for early stages.
+
+## Citations
+- [R1] `../afs/docs/RESEARCH_SOURCES.md`
+- [R3] `../afs/docs/RESEARCH_SOURCES.md`
+- [R4] `../afs/docs/RESEARCH_SOURCES.md`
+- [R5] `../afs/docs/RESEARCH_SOURCES.md`
+- [R6] `../afs/docs/RESEARCH_SOURCES.md`
+- [R7] `../afs/docs/RESEARCH_SOURCES.md`
--- a/docs/TRAINING_ROADMAP.md
+++ b/docs/TRAINING_ROADMAP.md
@@ -5,6 +5,7 @@ Scope: AFS Scawful training data pipelines and monitoring. Research-only.
 ## Committed (exists now)
 - Dataset registry indexing (local)
 - Resource indexing (local)
+- Research PDF catalog (local)
 - Plugin config loader for training paths/resources
 - Validator base + initial validators (ASM/C++/KG/ASAR)
 - Generator base + doc-section generator