Skip to content

Metadata

Metadata Philosophy

Metadata is not decoration. Metadata is the minimum structure required for the system to understand what something is, where it belongs, how it relates, and whether it can be trusted.

Metadata must be: * Incremental — builds up over the pipeline, not all at once * Auditable — every change is traceable * Structured — stored in canonical fields, not free-text blobs * Attached to canonical identity — orphaned metadata is useless * Preserved through movement and transformation

Minimum Metadata Classes

Identity Metadata

Field Description
canonical_id UUID or ULID — the machine truth
domain_prefix Namespace grouping (e.g. BBR4821)
short_code Q + 6 hex — human-visible code
checksum SHA-256 fingerprint

Source Metadata

Field Description
source_type How it entered (watcher, upload, sync)
source_path Where it came from
origin_event What triggered ingest
original_filename Pre-normalization name
imported_by User or system that imported
ingest_timestamp When registration occurred

Structural Metadata

Field Description
mime_type Detected MIME type
extension Lowercase file extension
chunk_count Number of chunks generated
page_count Pages if applicable
parser_method Which parser was used
extraction_method How text was obtained

Semantic Metadata

Field Description
document_type Inferred type (tax return, contract, etc.)
inferred_entities Extracted entities
tax_year If applicable
matter_or_case Associated case if applicable
tags Classification tags
confidence AI classification confidence score

Lifecycle Metadata

Field Description
status Current pipeline state
review_state Awaiting, reviewed, confirmed
route_state Suggested, confirmed, overridden
storage_path Current physical location
created_at Creation timestamp
updated_at Last modified timestamp

Automation Gate

Minimum metadata required before any automation proceeds: canonical_id, domain_prefix, short_code, source_path, ingest_timestamp, mime_type, status.

If these fields are absent, the pipeline must halt and flag the record as incomplete.

Content Metadata Profile (Front Matter)

For flat files, notes, and unstructured assets, an optional front matter block may be applied. The specification is strictly governed by standards/content_metadata_profile.yaml.

CRITICAL DOCTRINE: Front matter is Supportive Metadata. It is NEVER the system of record.

  • Canonical identity comes from QiArchive.
  • Schema placement comes from the Pipeline.
  • Node topology comes from the Graph.

A document cannot change its database relationships purely by altering its front matter tags.

Common supportive fields defined in the profile include: * status: Content-level draft/review state. * sensitivity / classification: Governed by registry/sensitivity_classification.yaml. * realm_label: An optional UI/UX workspace filter (e.g., personal, work, legal) governed by registry/workspace_realms.yaml. Realms do not dictate physical placement, tenant isolation, or schema mapping.