Corpus Seeding

Corpus seeding transforms a repository or documentation set into structured cognitive memory. Unlike RAG (retrieval-augmented generation), which fetches raw text at query time, corpus seeding produces actual beliefs, episodes, notes, and values — giving the agent genuine understanding rather than just search results.

Overview

The pipeline has two stages:

Ingest — chunk source files into semantically meaningful raw entries
Process — iteratively promote raw entries through the memory hierarchy until convergence

Source Files → Raw Entries → Episodes/Notes → Beliefs → Values
                 (seed)        (process exhaust)

Ingesting a Corpus

Source Code

kernle -s my-agent seed repo ./path/to/repo

The repo ingestor uses AST-based chunking for Python (extracting functions and classes as discrete units) and paragraph-based chunking for other languages. Each chunk becomes a raw entry tagged with file path, chunk type, and semantic name. Options:

Flag	Default	Description
`--extensions, -e`	py,js,ts,jsx,tsx,go,rs,java,rb,c,cpp,h,hpp,cs,swift,kt,scala,sh,bash,zsh	File extensions to include
`--exclude, -x`	(none)	fnmatch patterns to exclude (e.g., `.test.,vendor/*`)
`--max-chunk-size`	2000	Maximum characters per chunk
`--dry-run, -n`	off	Preview without creating entries
`--json, -j`	off	Output as JSON

Example with filters:

kernle -s my-agent seed repo ./myproject \
  --extensions py,md \
  --exclude "tests/*,docs/archive/*" \
  --max-chunk-size 3000

Documentation

kernle -s my-agent seed docs ./path/to/docs

The docs ingestor supports Markdown (heading-based chunking), reStructuredText, plain text (paragraph-based), and PDF (page-based chunking via pdfminer.six or PyPDF2). Options:

Flag	Default	Description
`--extensions, -e`	md,txt,rst,pdf	File extensions to include
`--max-chunk-size`	2000	Maximum characters per chunk
`--dry-run, -n`	off	Preview without creating entries
`--json, -j`	off	Output as JSON

Chunking Strategies

File Type	Strategy	Details
Python	AST-based	Extracts top-level functions and classes; module-level code collected separately
Markdown/RST	Heading-based	Splits on heading boundaries; preamble before first heading preserved
PDF	Page-based	Extracts text per page; large pages fall back to paragraph chunking
All others	Paragraph-based	Splits on double-newlines; merges until max chunk size

Deduplication

Chunks are content-hash deduplicated. Re-running seed repo or seed docs on the same corpus skips already-ingested chunks, making it safe to re-run after adding new files.

Check Ingestion Status

kernle -s my-agent seed status

Returns counts of corpus raw entries (total, repo, docs).

Processing to Exhaustion

After seeding, process the raw entries through the memory pipeline:

kernle -s my-agent process exhaust

This runs the processing pipeline in iterative cycles with escalating intensity:

Cycles	Intensity	Transitions
1-3	Light	raw → episode, raw → note
4-6	Medium	+ episode → belief, episode → goal, episode → relationship, episode → drive
7+	Heavy	+ belief → value

Processing continues until convergence (2 consecutive cycles with 0 new promotions) or the maximum cycle count is reached.

Processing requires an inference model (e.g., Anthropic API key configured). Without inference, only basic note extraction and deduplication occur.

Options:

Flag	Default	Description
`--max-cycles`	20	Maximum processing cycles (safety cap)
`--no-auto-promote`	off	Create suggestions instead of directly promoting
`--dry-run, -n`	off	Preview what would run
`--json, -j`	off	Output as JSON

Safety: A checkpoint (pre-exhaust) is automatically created before the first cycle.

Complete Example

# Create a fresh stack
kernle -s project-expert status

# Seed from repo and docs
kernle -s project-expert seed repo ./my-project --extensions py,ts
kernle -s project-expert seed docs ./my-project/docs

# Check what was ingested
kernle -s project-expert seed status

# Process everything
export ANTHROPIC_API_KEY=sk-...
kernle -s project-expert process exhaust

# Inspect the result
kernle -s project-expert status
kernle -s project-expert anxiety

Inspecting Results

After seeding and processing, use the dev dashboard to visually inspect the memory stack:

python dev/dashboard.py --stack project-expert

Or use CLI commands:

# Memory counts
kernle -s project-expert status

# List raw entries
kernle -s project-expert raw list --limit 20

# List beliefs
kernle -s project-expert belief list

# Check anxiety (pipeline health)
kernle -s project-expert anxiety

Provenance

All promoted memories maintain full provenance chains:

Raw Entry (corpus chunk) → Episode/Note → Belief → Value

Every memory records derived_from references back to its source. The dev dashboard shows provenance chains for any memory — click a row and check the “Derived from” and “Derived memories (children)” sections.

Directory Exclusions

The following directories are always excluded from corpus ingestion: .git, __pycache__, node_modules, .venv, venv, .tox, .mypy_cache, .pytest_cache, .ruff_cache, dist, build, .eggs, *.egg-info

​Overview

​Ingesting a Corpus

​Source Code

​Documentation

​Chunking Strategies

​Deduplication

​Check Ingestion Status

​Processing to Exhaustion

​Complete Example

​Inspecting Results

​Provenance

​Directory Exclusions

Overview

Ingesting a Corpus

Source Code

Documentation

Chunking Strategies

Deduplication

Check Ingestion Status

Processing to Exhaustion

Complete Example

Inspecting Results

Provenance

Directory Exclusions