Skip to main content
Corpus seeding transforms a repository or documentation set into structured cognitive memory. Unlike RAG (retrieval-augmented generation), which fetches raw text at query time, corpus seeding produces actual beliefs, episodes, notes, and values — giving the agent genuine understanding rather than just search results.

Overview

The pipeline has two stages:
  1. Ingest — chunk source files into semantically meaningful raw entries
  2. Process — iteratively promote raw entries through the memory hierarchy until convergence
Source Files → Raw Entries → Episodes/Notes → Beliefs → Values
                 (seed)        (process exhaust)

Ingesting a Corpus

Source Code

kernle -s my-agent seed repo ./path/to/repo
The repo ingestor uses AST-based chunking for Python (extracting functions and classes as discrete units) and paragraph-based chunking for other languages. Each chunk becomes a raw entry tagged with file path, chunk type, and semantic name. Options:
FlagDefaultDescription
--extensions, -epy,js,ts,jsx,tsx,go,rs,java,rb,c,cpp,h,hpp,cs,swift,kt,scala,sh,bash,zshFile extensions to include
--exclude, -x(none)fnmatch patterns to exclude (e.g., *.test.*,vendor/*)
--max-chunk-size2000Maximum characters per chunk
--dry-run, -noffPreview without creating entries
--json, -joffOutput as JSON
Example with filters:
kernle -s my-agent seed repo ./myproject \
  --extensions py,md \
  --exclude "tests/*,docs/archive/*" \
  --max-chunk-size 3000

Documentation

kernle -s my-agent seed docs ./path/to/docs
The docs ingestor supports Markdown (heading-based chunking), reStructuredText, plain text (paragraph-based), and PDF (page-based chunking via pdfminer.six or PyPDF2). Options:
FlagDefaultDescription
--extensions, -emd,txt,rst,pdfFile extensions to include
--max-chunk-size2000Maximum characters per chunk
--dry-run, -noffPreview without creating entries
--json, -joffOutput as JSON

Chunking Strategies

File TypeStrategyDetails
PythonAST-basedExtracts top-level functions and classes; module-level code collected separately
Markdown/RSTHeading-basedSplits on heading boundaries; preamble before first heading preserved
PDFPage-basedExtracts text per page; large pages fall back to paragraph chunking
All othersParagraph-basedSplits on double-newlines; merges until max chunk size

Deduplication

Chunks are content-hash deduplicated. Re-running seed repo or seed docs on the same corpus skips already-ingested chunks, making it safe to re-run after adding new files.

Check Ingestion Status

kernle -s my-agent seed status
Returns counts of corpus raw entries (total, repo, docs).

Processing to Exhaustion

After seeding, process the raw entries through the memory pipeline:
kernle -s my-agent process exhaust
This runs the processing pipeline in iterative cycles with escalating intensity:
CyclesIntensityTransitions
1-3Lightraw → episode, raw → note
4-6Medium+ episode → belief, episode → goal, episode → relationship, episode → drive
7+Heavy+ belief → value
Processing continues until convergence (2 consecutive cycles with 0 new promotions) or the maximum cycle count is reached.
Processing requires an inference model (e.g., Anthropic API key configured). Without inference, only basic note extraction and deduplication occur.
Options:
FlagDefaultDescription
--max-cycles20Maximum processing cycles (safety cap)
--no-auto-promoteoffCreate suggestions instead of directly promoting
--dry-run, -noffPreview what would run
--json, -joffOutput as JSON
Safety: A checkpoint (pre-exhaust) is automatically created before the first cycle.

Complete Example

# Create a fresh stack
kernle -s project-expert status

# Seed from repo and docs
kernle -s project-expert seed repo ./my-project --extensions py,ts
kernle -s project-expert seed docs ./my-project/docs

# Check what was ingested
kernle -s project-expert seed status

# Process everything
export ANTHROPIC_API_KEY=sk-...
kernle -s project-expert process exhaust

# Inspect the result
kernle -s project-expert status
kernle -s project-expert anxiety

Inspecting Results

After seeding and processing, use the dev dashboard to visually inspect the memory stack:
python dev/dashboard.py --stack project-expert
Or use CLI commands:
# Memory counts
kernle -s project-expert status

# List raw entries
kernle -s project-expert raw list --limit 20

# List beliefs
kernle -s project-expert belief list

# Check anxiety (pipeline health)
kernle -s project-expert anxiety

Provenance

All promoted memories maintain full provenance chains:
Raw Entry (corpus chunk) → Episode/Note → Belief → Value
Every memory records derived_from references back to its source. The dev dashboard shows provenance chains for any memory — click a row and check the “Derived from” and “Derived memories (children)” sections.

Directory Exclusions

The following directories are always excluded from corpus ingestion: .git, __pycache__, node_modules, .venv, venv, .tox, .mypy_cache, .pytest_cache, .ruff_cache, dist, build, .eggs, *.egg-info