CodeSmriti Technical Brief

Abstract

CodeSmriti is a semantic code memory system that indexes GitHub repositories into a vector database for natural language retrieval. The system employs a bottom-up aggregation pipeline: symbols are summarized first, then aggregated into file summaries, folder modules, and finally repository overviews. Hybrid chunking combines tree-sitter AST parsing with LLM-assisted semantic chunking for languages that resist static analysis (Svelte, embedded SQL, templating). Integration with Claude Desktop and VSCode is provided via the Model Context Protocol (MCP). This brief documents the architecture, chunking strategies, and performance characteristics.

1. Problem Statement

Engineering teams accumulate solutions across dozens of repositories over years of development. Without institutional memory, organizations face compounding inefficiencies:

Finding	Impact	Source
61% of developers spend >30 min/day searching	~100 hours/year per developer	[1]
$47M/year lost per large enterprise	Knowledge silos, duplication	[2]
3-9 months to full productivity	Slow onboarding	[3]
42% of knowledge unique to individuals	Risk when engineers depart	[2]

Traditional code search (grep, GitHub search) operates on keywords, not semantics. Questions like "How did we solve rate limiting?" or "What's our authentication pattern?" require understanding intent, not matching strings.

2. Architecture

┌──────────────┐ ┌─────────────────────────────────────┐ ┌──────────────┐ │ GitHub │────▶│ Ingestion Pipeline │────▶│ Couchbase │ │ Repos │ │ │ │ Vector DB │ └──────────────┘ │ 1. Parse files (tree-sitter) │ └──────┬───────┘ │ 2. LLM chunk underchunked files │ │ │ 3. Summarize symbols → files │ 768-dim embeddings │ 4. Aggregate files → modules │ (nomic-embed-text-v1.5) │ 5. Aggregate modules → repo │ │ └─────────────────────────────────────┘ │ │ ┌───────────────────────────────────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ TOOL LAYER (shared) │ │ │ list_repos | explore_structure | search_code | get_file │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ ▼ ▼ │ ┌─────────────────────┐ ┌─────────────────────────────┐ │ │ MCP Server │ │ PydanticAI Agent │ └─────────────▶│ (rag_mcp_server) │ │ (pydantic_rag_agent) │ │ │ │ │ │ - Direct tool call │ │ - LLM orchestrates tools │ │ - Claude reasons │ │ - ask_codebase() wraps │ │ - No LLM needed │ │ - Local LLM synthesizes │ └─────────────────────┘ └─────────────────────────────┘ │ │ ▼ ▼ Claude Code LMStudio / Ollama

Components

Ingestion Pipeline: Bottom-up processor: tree-sitter parsing → LLM chunking → symbol summarization → file aggregation → module aggregation → repo summary
Vector Database: Couchbase with FTS vector search index (768 dimensions)
Tool Layer: Shared implementations for all RAG operations (search, explore, file retrieval)
API Server: FastAPI providing RAG endpoints with hierarchical search (symbol/file/module/repo levels)
MCP Server: Model Context Protocol for Claude Code—direct tool access without intermediate LLM
PydanticAI Agent: LLM-driven mode for local models (LMStudio, Ollama)—LLM orchestrates tool calls

Key Technologies

Component	Technology	Rationale
Parsing	tree-sitter	40+ languages, incremental, error-tolerant [4]
LLM Enrichment	Local LLM (Qwen/Llama)	Summary generation, semantic chunking, aggregation
Embeddings	nomic-embed-text-v1.5	768-dim, 8192 token context, local inference
Vector DB	Couchbase	Hybrid FTS + vector, multi-tenant, production-ready
Integration	MCP	Standard protocol for LLM tool use

3. Chunking Strategy

Research confirms that naive chunking methods struggle with code: "Naive chunking methods struggle with accurately delineating meaningful segments of code, leading to issues with boundary definition and the inclusion of irrelevant or incomplete information" [5].

CodeSmriti employs a hybrid approach: AST-aware chunking via tree-sitter as the primary method, with LLM-assisted semantic chunking as a fallback for files that resist static analysis. This has been shown to improve Recall@5 by 4.3 points on code retrieval benchmarks [6].

Core Principle: Bottom-Up Aggregation

Summaries are built from the ground up: individual symbols (functions, classes) are summarized first, then aggregated into file-level summaries, then into folder-based module summaries, and finally into a repository overview. This ensures every level has meaningful, accurate context rather than relying on top-down inference.

Hierarchical Document Types

Document Type	Content	Purpose
repo_summary	Aggregated from module summaries	Repository-level overview and tech stack
module_summary	Aggregated from file summaries	Folder/package-level context
file_index	Aggregated from symbol summaries	File purpose, key components, imports
symbol_index	LLM summary of function/class	Detailed symbol documentation (≥5 lines)

Hybrid Chunking: AST + LLM

Tree-sitter provides AST-level symbol extraction for most languages. However, some files resist static analysis—Svelte components with complex template logic, SQL embedded in Python strings, or domain-specific patterns.

For these "underchunked" files (high line count but few detected symbols), CodeSmriti invokes an LLM semantic chunker to identify logical boundaries:

Primary (tree-sitter):     function_definition, class_definition, method_definition
Fallback (LLM chunker):    SQL queries, business logic blocks, configuration sections,
                           template event handlers, data transformations

The LLM chunker identifies semantic units like user_subscription_data_fetch or zone_calculation_algorithm that tree-sitter cannot detect, ensuring comprehensive coverage even for unconventional code patterns.

Junk Filtering

Skip files that add noise without value:

node_modules/    package-lock.json    *.min.js
dist/            yarn.lock            *.min.css
__pycache__/     Cargo.lock           *.map
.git/            poetry.lock          *generated*

4. Document Schema

                    Key Insight: Since we have access to the actual repository,
                    we don't need to store code redundantly. Store summaries + line references,
                    fetch code on demand. Content-based hashing ensures deduplication across re-indexes.
                

Hierarchical Document Structure

repo_summary ───────────────────────────────────────────────────────── │ Aggregated from module summaries. Tech stack, total files/lines. │ document_id: hash(repo:{repo_id}:{commit}) │ └─ module_summary ──────────────────────────────────────────────────── │ Folder-based. Aggregated from file summaries. │ document_id: hash(module:{repo_id}:{path}:{commit}) │ └─ file_index ───────────────────────────────────────────────────── │ Aggregated from symbol summaries. Imports, language, all symbols. │ document_id: hash(file:{repo_id}:{path}:{commit}) │ └─ symbol_index ──────────────────────────────────────────────── Individual function/class with LLM summary. ≥5 lines only. document_id: hash(symbol:{repo_id}:{path}:{name}:{commit})

Quality Tracking

Each document carries quality metadata indicating its enrichment level:

Enrichment Level	Description	Fallback Behavior
`llm_summary`	Full LLM-generated summary	—
`basic`	Docstring + structure only	LLM unavailable or timed out
`none`	No summary available	Parsing failed

A circuit breaker pattern prevents cascading failures: if LLM calls fail repeatedly, the system gracefully degrades to basic enrichment without blocking ingestion.

Content-Based Deduplication

Document IDs are SHA256 hashes of content keys (repo, path, symbol name, commit). Re-indexing the same commit produces identical IDs, enabling efficient upserts without duplicate detection logic.

Embedding Strategy

Embeddings capture both semantic meaning (from LLM summary) and code patterns (from actual code at index time). For symbols, the embedding combines the summary with a code snippet. This answers both "what does this do?" and "how is it implemented?"

5. Performance

V4 Production Stats (November 2025)

Full ingestion of 101 repositories on Mac M3 Ultra with LM Studio (qwen3-30b-a3b-2507):

Metric	Value	Notes
Total repositories	101	99 processed, 2 skipped (pre-existing)
Documents indexed	48,795	13K files, 31K symbols, 4K modules
Total ingestion time	32 hours	~20 min average per repo
LLM tokens per file	~1,850	Estimated input+output

Throughput

Operation	Performance	Configuration
Embedding generation	1,280 docs/min	MPS GPU, nomic-embed-text-v1.5
Search latency	<100ms	Hybrid vector + FTS
Average repo ingestion	~20 min	With LLM enrichment
Ingestion without LLM	1-2 min	Basic summaries only

Resumable Ingestion

The --skip-existing flag enables resumable batch ingestion. Repos with existing V4 documents are skipped, allowing recovery from failures without re-processing completed work.

6. Case Studies

V4 evaluation suite: 47 questions across 9 categories, tested against 32 proprietary repositories. The evaluation covers symbol lookups, pattern discovery, documentation retrieval, and cross-repo queries.

Evaluation Results (v0.4.0)

Category	Questions	Avg Score	Notes
Symbol Pattern	3	93%	Decorators, mixins, base classes
Symbol Specific	5	85%	Exact function/class lookups
Svelte Components	11	78%	Template logic, reactive statements
Code Explanation	1	80%	"How does X work?"
Documentation	8	76%	RST/MD docs, design principles
File Pattern	7	76%	Models, views, configuration
Architecture	6	76%	Multi-tenant, permissions
Discovery	4	75%	Cross-repo exploration
Implementation	2	70%	Complex feature deep-dives

Overall: 47/47 passing (100%), average quality 78%

Example Output: Symbol Query

Query: "job_counter decorator implementation"

# `job_counter` Decorator Implementation

The `job_counter` decorator is implemented in `common/consumer_decorators.py`
and tracks task execution metrics using Redis.

## Key Features
- Tracks success, failure, and pending task counts
- Uses Redis for state management
- Supports both synchronous and asynchronous functions

## Implementation
```python
def job_counter(func: Callable) -> Callable:
    @wraps(func)
    async def async_wrapper(*args: Any, **kwargs: Any) -> Any:
        key_success, key_failed_list, key_pending_list, job_string = _make_keys(func, *args)
        try:
            redis_connection.lpush(key_pending_list, job_string)
            result = await func(*args, **kwargs)
            redis_connection.incr(key_success)
            return result
        except Exception as e:
            redis_connection.lpush(key_failed_list, f"Error: {str(e)}")
            redis_connection.ltrim(key_failed_list, 0, 50)
            raise
        finally:
            redis_connection.lpop(key_pending_list)
```

Example Output: Documentation Query

Query: "L1 Python best practices"

# L1 Python Best Practices

## Naming Conventions
- Use `snake_case` for function and variable names
- Use `PascalCase` for class names
- Constants should be `UPPER_CASE`

## Error Handling
- Use specific exception types rather than bare `except:`
- Always include error messages for debugging
- Use context managers for resource management

## Type Hints
- Always use type hints for function parameters and return values
- Use `typing` module for complex types

```python
def process_user_data(
    user_list: List[Dict[str, str]],
    filter_criteria: Optional[Dict[str, str]] = None
) -> Tuple[Dict[str, List], int]:
    ...
```

Source: `docs/source/developer_guide/audit_standards/L1_python_best_practices.rst`

Search Level Selection

The LLM automatically selects the appropriate search level based on query intent:

symbol — "Find the job_counter decorator" → specific function/class
file — "How do we handle authentication?" → implementation files
module — "What's in the permissions folder?" → directory overview
doc — "What are our coding standards?" → RST/MD documentation
repo — "What repos handle geospatial data?" → cross-repo discovery

Strengths

Excellent at finding code patterns across repositories
Strong at comparing implementations (decorators, mixins)
Good at synthesizing answers from scattered sources
Documentation queries now use dedicated doc level search

7. Quick Start

# Clone repository
git clone https://github.com/kbhalerao/code-smriti
cd code-smriti

# Configure environment
cp .env.example .env
# Edit .env with your GitHub token and Couchbase credentials

# Start services
docker-compose up -d

# Verify
curl http://localhost/health

MCP Integration (Claude Desktop / Claude Code)

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "code-smriti": {
      "command": "uv",
      "args": ["run", "--with", "mcp", "--with", "httpx",
               "path/to/code-smriti/services/mcp-server/rag_mcp_server.py"],
      "env": {
        "CODESMRITI_API_URL": "http://localhost",
        "CODESMRITI_USERNAME": "your-username",
        "CODESMRITI_PASSWORD": "your-password"
      }
    }
  }
}

Available Tools

Tool	Description
`list_repos`	Discover available repositories with document counts
`explore_structure`	Navigate directory structure, find key files
`search_codebase`	Semantic search with `level` param: symbol, file, module, repo, or doc
`get_file`	Fetch actual code with optional line ranges
`ask_codebase`	RAG query with LLM synthesis (calls backend LLM)

Two Usage Modes

Mode	Use Case	How It Works
MCP Mode	Claude Code / Claude Desktop	Claude calls tools directly, does its own reasoning and synthesis
LLM Mode	LMStudio / Ollama	Local LLM orchestrates tools via PydanticAI agent

References

Stack Overflow. 2024 Developer Survey. survey.stackoverflow.co/2024
Panopto. Workplace Knowledge and Productivity Report. 2018. PRNewswire
Various industry studies on engineering onboarding. HackerNoon, Pluralsight
Symflower. Parsing Code with Tree-sitter. 2023. symflower.com
Qodo. RAG for Large-Scale Code Repos. 2024. qodo.ai
cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree. arXiv:2506.15655
Weaviate. Evaluation Metrics for Search and Recommendation Systems. weaviate.io
Chroma Research. Evaluating Chunking Strategies for Retrieval. research.trychroma.com