QiIngest Pipeline
Purpose
QiIngest Pipeline is the future local-first ingestion workflow for documents and document-like artifacts. This page is architecture and planning only. It does not claim that the full pipeline already exists in this repo.
Goal
Build a controlled local-first path that can:
- watch selected folders
- hash files
- detect duplicates
- import documents into Paperless
- extract text and OCR results
- classify documents by content and context
- propose cleaned filenames
- propose metadata, tags, correspondents, and document types
- update a file inventory
- embed document text into a vector store such as Qdrant
- optionally map entities and relationships into Neo4j
- create an approval queue for Cody
- update routing rules based on Cody-approved corrections
- improve routing accuracy over time
Current Repo Reality
What Exists
- Paperless is already treated in repo docs as the first real ingestion target.
docs/10_qicore/50_operations/20_qiserver/_index.mddocuments the Paperless 10-document test rule before bulk import.docs/10_qicore/50_operations/30_dev_history/2026-05-17_open_loop_reset_paperless_ingestion_runbook.mdoutlines a controlled staging and manifest concept for Paperless.docs/30_qiarchive/10_ingestion/,20_extraction/,40_embeddings/, and50_graphs/contain placeholder planning folders.
What Does Not Exist In This Repo
- no committed folder watcher for QiIngest
- no committed duplicate-detector for QiIngest
- no committed Paperless staging script for QiIngest
- no committed OCR or text extraction worker for QiIngest
- no committed classifier or filename-normalizer for QiIngest
- no committed inventory database or inventory writer for QiIngest
- no committed Qdrant writer for QiIngest
- no committed Neo4j mapper for QiIngest
- no committed approval queue implementation for QiIngest
Important Non-Match To Avoid
src/pages/api/hash.jsis a Homepage config hash endpoint, not a file-ingest hashing pipeline.
Proposed Pipeline Stages
Stage 1: Watch And Register
Input: - approved local folders - manual drop zones - optionally exported document batches
Actions: - detect new or changed files - compute stable hash values - record source path, timestamps, size, and file type - check duplicate history before deeper processing
Output: - inventory registration record - staging decision: new, duplicate, skip, manual review
Stage 2: Paperless Intake
Actions: - stage safe copies into Paperless consume flow - keep originals untouched until verified - follow the 10-document maximum proof step before bulk import
Output: - Paperless import receipt - initial document identity linkage back to the local inventory
Stage 3: OCR And Text Extraction
Actions: - capture OCR text from scans - extract text from native digital documents when available - preserve raw extracted text plus cleaned text
Output: - raw text - cleaned text - extraction status
Stage 4: Classification And Naming Suggestions
Actions: - classify by content, source context, and historical routing patterns - propose cleaned filenames - propose tags, correspondents, document types, and routing targets
Output: - suggestion set, not silent auto-rename
Stage 5: Inventory And Derived Layers
Actions: - update the file inventory with canonical facts - write embeddings to Qdrant or another approved vector store - optionally create Neo4j entity and relationship projections
Rule: - inventory facts stay canonical - vectors and graph remain derived
Stage 6: Approval Queue And Learning Loop
Actions: - queue proposed corrections for Cody review - record Cody-approved edits - update routing rules from approved corrections only - keep audit trail of changed rules and their source decisions
Suggested Data Artifacts
| Artifact | Purpose |
|---|---|
| file inventory record | canonical record of source file and ingest state |
| content hash | duplicate detection and lineage |
| Paperless receipt | proof of import or failure |
| OCR text | raw extraction output |
| cleaned text | downstream classification/search input |
| suggestion payload | proposed filename, tags, document type, correspondent, route |
| approval decision record | Cody review outcome |
| routing rule update receipt | record of what learning rule changed and why |
Verified Facts
- The repo already documents Paperless as the first real ingest lane.
- The repo already documents vector and graph layers as downstream/derived concepts.
- No full QiIngest implementation scripts are committed here today.
- Existing related tooling in this repo is limited to documentation audit and Homepage config-map generation.
Assumptions
- Local-first means the first detection, hashing, staging, and approval loop should start from Cody-controlled folders or drop zones.
- Paperless should be the first durable document-ingest system before vector or graph expansion.
- Qdrant and Neo4j should remain optional downstream layers, not required first milestones.
Unknowns
- Which local folders should be watched first.
- Where the inventory should live.
- Which approval queue surface Cody wants to use.
- Whether filename cleanup should happen before or after Paperless import.
- Whether routing rules should stay file-based, database-backed, or both.
Needs Cody Confirmation
- Initial watch-folder set.
- Preferred approval queue surface.
- Preferred inventory location.
- Whether Paperless should stay the first mandatory destination for all document-like material or only a subset.
Repo-Only Content
- Exact source references to current runbook docs and placeholder archive folders.
- Warning that
src/pages/api/hash.jsis unrelated to QiIngest. - Local implementation gap inventory.
Wiki.js-Ready Content
- Stage-by-stage ingest overview.
- Manual-approval rule.
- Canonical-versus-derived rule.
- Paperless-first validation rule.
Future Automation Candidates
- folder watcher
- duplicate detector
- Paperless staging script
- OCR/extraction worker
- classifier and metadata suggester
- inventory writer
- Qdrant writer
- optional Neo4j mapper
- approval queue UI or CLI
- correction-to-routing-rule updater