Corpus seeding transforms a repository or documentation set into structured cognitive memory. Unlike RAG (retrieval-augmented generation), which fetches raw text at query time, corpus seeding produces actual beliefs, episodes, notes, and values — giving the agent genuine understanding rather than just search results.
Overview
The pipeline has two stages:
- Ingest — chunk source files into semantically meaningful raw entries
- Process — iteratively promote raw entries through the memory hierarchy until convergence
Source Files → Raw Entries → Episodes/Notes → Beliefs → Values
(seed) (process exhaust)
Ingesting a Corpus
Source Code
kernle -s my-agent seed repo ./path/to/repo
The repo ingestor uses AST-based chunking for Python (extracting functions and classes as discrete units) and paragraph-based chunking for other languages. Each chunk becomes a raw entry tagged with file path, chunk type, and semantic name.
Options:
| Flag | Default | Description |
|---|
--extensions, -e | py,js,ts,jsx,tsx,go,rs,java,rb,c,cpp,h,hpp,cs,swift,kt,scala,sh,bash,zsh | File extensions to include |
--exclude, -x | (none) | fnmatch patterns to exclude (e.g., *.test.*,vendor/*) |
--max-chunk-size | 2000 | Maximum characters per chunk |
--dry-run, -n | off | Preview without creating entries |
--json, -j | off | Output as JSON |
Example with filters:
kernle -s my-agent seed repo ./myproject \
--extensions py,md \
--exclude "tests/*,docs/archive/*" \
--max-chunk-size 3000
Documentation
kernle -s my-agent seed docs ./path/to/docs
The docs ingestor supports Markdown (heading-based chunking), reStructuredText, plain text (paragraph-based), and PDF (page-based chunking via pdfminer.six or PyPDF2).
Options:
| Flag | Default | Description |
|---|
--extensions, -e | md,txt,rst,pdf | File extensions to include |
--max-chunk-size | 2000 | Maximum characters per chunk |
--dry-run, -n | off | Preview without creating entries |
--json, -j | off | Output as JSON |
Chunking Strategies
| File Type | Strategy | Details |
|---|
| Python | AST-based | Extracts top-level functions and classes; module-level code collected separately |
| Markdown/RST | Heading-based | Splits on heading boundaries; preamble before first heading preserved |
| PDF | Page-based | Extracts text per page; large pages fall back to paragraph chunking |
| All others | Paragraph-based | Splits on double-newlines; merges until max chunk size |
Deduplication
Chunks are content-hash deduplicated. Re-running seed repo or seed docs on the same corpus skips already-ingested chunks, making it safe to re-run after adding new files.
Check Ingestion Status
kernle -s my-agent seed status
Returns counts of corpus raw entries (total, repo, docs).
Processing to Exhaustion
After seeding, process the raw entries through the memory pipeline:
kernle -s my-agent process exhaust
This runs the processing pipeline in iterative cycles with escalating intensity:
| Cycles | Intensity | Transitions |
|---|
| 1-3 | Light | raw → episode, raw → note |
| 4-6 | Medium | + episode → belief, episode → goal, episode → relationship, episode → drive |
| 7+ | Heavy | + belief → value |
Processing continues until convergence (2 consecutive cycles with 0 new promotions) or the maximum cycle count is reached.
Processing requires an inference model (e.g., Anthropic API key configured). Without inference, only basic note extraction and deduplication occur.
Options:
| Flag | Default | Description |
|---|
--max-cycles | 20 | Maximum processing cycles (safety cap) |
--no-auto-promote | off | Create suggestions instead of directly promoting |
--dry-run, -n | off | Preview what would run |
--json, -j | off | Output as JSON |
Safety: A checkpoint (pre-exhaust) is automatically created before the first cycle.
Complete Example
# Create a fresh stack
kernle -s project-expert status
# Seed from repo and docs
kernle -s project-expert seed repo ./my-project --extensions py,ts
kernle -s project-expert seed docs ./my-project/docs
# Check what was ingested
kernle -s project-expert seed status
# Process everything
export ANTHROPIC_API_KEY=sk-...
kernle -s project-expert process exhaust
# Inspect the result
kernle -s project-expert status
kernle -s project-expert anxiety
Inspecting Results
After seeding and processing, use the dev dashboard to visually inspect the memory stack:
python dev/dashboard.py --stack project-expert
Or use CLI commands:
# Memory counts
kernle -s project-expert status
# List raw entries
kernle -s project-expert raw list --limit 20
# List beliefs
kernle -s project-expert belief list
# Check anxiety (pipeline health)
kernle -s project-expert anxiety
Provenance
All promoted memories maintain full provenance chains:
Raw Entry (corpus chunk) → Episode/Note → Belief → Value
Every memory records derived_from references back to its source. The dev dashboard shows provenance chains for any memory — click a row and check the “Derived from” and “Derived memories (children)” sections.
Directory Exclusions
The following directories are always excluded from corpus ingestion:
.git, __pycache__, node_modules, .venv, venv, .tox, .mypy_cache, .pytest_cache, .ruff_cache, dist, build, .eggs, *.egg-info