CodeSmriti is a semantic code memory system that indexes GitHub repositories into a vector database for natural language retrieval. The system employs a bottom-up aggregation pipeline: symbols are summarized first, then aggregated into file summaries, folder modules, and finally repository overviews. Hybrid chunking combines tree-sitter AST parsing with LLM-assisted semantic chunking for languages that resist static analysis (Svelte, embedded SQL, templating). Integration with Claude Desktop and VSCode is provided via the Model Context Protocol (MCP). This brief documents the architecture, chunking strategies, and performance characteristics.
1. Problem Statement
Engineering teams accumulate solutions across dozens of repositories over years of development. Without institutional memory, organizations face compounding inefficiencies:
| Finding | Impact | Source |
|---|---|---|
| 61% of developers spend >30 min/day searching | ~100 hours/year per developer | [1] |
| $47M/year lost per large enterprise | Knowledge silos, duplication | [2] |
| 3-9 months to full productivity | Slow onboarding | [3] |
| 42% of knowledge unique to individuals | Risk when engineers depart | [2] |
Traditional code search (grep, GitHub search) operates on keywords, not semantics. Questions like "How did we solve rate limiting?" or "What's our authentication pattern?" require understanding intent, not matching strings.
2. Architecture
Components
- Ingestion Pipeline: Bottom-up processor: tree-sitter parsing → LLM chunking → symbol summarization → file aggregation → module aggregation → repo summary
- Vector Database: Couchbase with FTS vector search index (768 dimensions)
- Tool Layer: Shared implementations for all RAG operations (search, explore, file retrieval)
- API Server: FastAPI providing RAG endpoints with hierarchical search (symbol/file/module/repo levels)
- MCP Server: Model Context Protocol for Claude Code—direct tool access without intermediate LLM
- PydanticAI Agent: LLM-driven mode for local models (LMStudio, Ollama)—LLM orchestrates tool calls
Key Technologies
| Component | Technology | Rationale |
|---|---|---|
| Parsing | tree-sitter | 40+ languages, incremental, error-tolerant [4] |
| LLM Enrichment | Local LLM (Qwen/Llama) | Summary generation, semantic chunking, aggregation |
| Embeddings | nomic-embed-text-v1.5 | 768-dim, 8192 token context, local inference |
| Vector DB | Couchbase | Hybrid FTS + vector, multi-tenant, production-ready |
| Integration | MCP | Standard protocol for LLM tool use |
3. Chunking Strategy
Research confirms that naive chunking methods struggle with code: "Naive chunking methods struggle with accurately delineating meaningful segments of code, leading to issues with boundary definition and the inclusion of irrelevant or incomplete information" [5].
CodeSmriti employs a hybrid approach: AST-aware chunking via tree-sitter as the primary method, with LLM-assisted semantic chunking as a fallback for files that resist static analysis. This has been shown to improve Recall@5 by 4.3 points on code retrieval benchmarks [6].
Core Principle: Bottom-Up Aggregation
Summaries are built from the ground up: individual symbols (functions, classes) are summarized first, then aggregated into file-level summaries, then into folder-based module summaries, and finally into a repository overview. This ensures every level has meaningful, accurate context rather than relying on top-down inference.
Hierarchical Document Types
| Document Type | Content | Purpose |
|---|---|---|
| repo_summary | Aggregated from module summaries | Repository-level overview and tech stack |
| module_summary | Aggregated from file summaries | Folder/package-level context |
| file_index | Aggregated from symbol summaries | File purpose, key components, imports |
| symbol_index | LLM summary of function/class | Detailed symbol documentation (≥5 lines) |
Hybrid Chunking: AST + LLM
Tree-sitter provides AST-level symbol extraction for most languages. However, some files resist static analysis—Svelte components with complex template logic, SQL embedded in Python strings, or domain-specific patterns.
For these "underchunked" files (high line count but few detected symbols), CodeSmriti invokes an LLM semantic chunker to identify logical boundaries:
Primary (tree-sitter): function_definition, class_definition, method_definition
Fallback (LLM chunker): SQL queries, business logic blocks, configuration sections,
template event handlers, data transformations
The LLM chunker identifies semantic units like user_subscription_data_fetch
or zone_calculation_algorithm that tree-sitter cannot detect, ensuring
comprehensive coverage even for unconventional code patterns.
Junk Filtering
Skip files that add noise without value:
node_modules/ package-lock.json *.min.js
dist/ yarn.lock *.min.css
__pycache__/ Cargo.lock *.map
.git/ poetry.lock *generated*
4. Document Schema
Hierarchical Document Structure
Quality Tracking
Each document carries quality metadata indicating its enrichment level:
| Enrichment Level | Description | Fallback Behavior |
|---|---|---|
llm_summary |
Full LLM-generated summary | — |
basic |
Docstring + structure only | LLM unavailable or timed out |
none |
No summary available | Parsing failed |
A circuit breaker pattern prevents cascading failures: if LLM calls fail repeatedly, the system gracefully degrades to basic enrichment without blocking ingestion.
Content-Based Deduplication
Document IDs are SHA256 hashes of content keys (repo, path, symbol name, commit). Re-indexing the same commit produces identical IDs, enabling efficient upserts without duplicate detection logic.
Embedding Strategy
Embeddings capture both semantic meaning (from LLM summary) and code patterns (from actual code at index time). For symbols, the embedding combines the summary with a code snippet. This answers both "what does this do?" and "how is it implemented?"
5. Performance
V4 Production Stats (November 2025)
Full ingestion of 101 repositories on Mac M3 Ultra with LM Studio (qwen3-30b-a3b-2507):
| Metric | Value | Notes |
|---|---|---|
| Total repositories | 101 | 99 processed, 2 skipped (pre-existing) |
| Documents indexed | 48,795 | 13K files, 31K symbols, 4K modules |
| Total ingestion time | 32 hours | ~20 min average per repo |
| LLM tokens per file | ~1,850 | Estimated input+output |
Throughput
| Operation | Performance | Configuration |
|---|---|---|
| Embedding generation | 1,280 docs/min | MPS GPU, nomic-embed-text-v1.5 |
| Search latency | <100ms | Hybrid vector + FTS |
| Average repo ingestion | ~20 min | With LLM enrichment |
| Ingestion without LLM | 1-2 min | Basic summaries only |
Resumable Ingestion
The --skip-existing flag enables resumable batch ingestion.
Repos with existing V4 documents are skipped, allowing recovery from
failures without re-processing completed work.
6. Case Studies
V4 evaluation suite: 47 questions across 9 categories, tested against 32 proprietary repositories. The evaluation covers symbol lookups, pattern discovery, documentation retrieval, and cross-repo queries.
Evaluation Results (v0.4.0)
| Category | Questions | Avg Score | Notes |
|---|---|---|---|
| Symbol Pattern | 3 | 93% | Decorators, mixins, base classes |
| Symbol Specific | 5 | 85% | Exact function/class lookups |
| Svelte Components | 11 | 78% | Template logic, reactive statements |
| Code Explanation | 1 | 80% | "How does X work?" |
| Documentation | 8 | 76% | RST/MD docs, design principles |
| File Pattern | 7 | 76% | Models, views, configuration |
| Architecture | 6 | 76% | Multi-tenant, permissions |
| Discovery | 4 | 75% | Cross-repo exploration |
| Implementation | 2 | 70% | Complex feature deep-dives |
Overall: 47/47 passing (100%), average quality 78%
Example Output: Symbol Query
Query: "job_counter decorator implementation"
# `job_counter` Decorator Implementation
The `job_counter` decorator is implemented in `common/consumer_decorators.py`
and tracks task execution metrics using Redis.
## Key Features
- Tracks success, failure, and pending task counts
- Uses Redis for state management
- Supports both synchronous and asynchronous functions
## Implementation
```python
def job_counter(func: Callable) -> Callable:
@wraps(func)
async def async_wrapper(*args: Any, **kwargs: Any) -> Any:
key_success, key_failed_list, key_pending_list, job_string = _make_keys(func, *args)
try:
redis_connection.lpush(key_pending_list, job_string)
result = await func(*args, **kwargs)
redis_connection.incr(key_success)
return result
except Exception as e:
redis_connection.lpush(key_failed_list, f"Error: {str(e)}")
redis_connection.ltrim(key_failed_list, 0, 50)
raise
finally:
redis_connection.lpop(key_pending_list)
```
Example Output: Documentation Query
Query: "L1 Python best practices"
# L1 Python Best Practices
## Naming Conventions
- Use `snake_case` for function and variable names
- Use `PascalCase` for class names
- Constants should be `UPPER_CASE`
## Error Handling
- Use specific exception types rather than bare `except:`
- Always include error messages for debugging
- Use context managers for resource management
## Type Hints
- Always use type hints for function parameters and return values
- Use `typing` module for complex types
```python
def process_user_data(
user_list: List[Dict[str, str]],
filter_criteria: Optional[Dict[str, str]] = None
) -> Tuple[Dict[str, List], int]:
...
```
Source: `docs/source/developer_guide/audit_standards/L1_python_best_practices.rst`
Search Level Selection
The LLM automatically selects the appropriate search level based on query intent:
symbol— "Find the job_counter decorator" → specific function/classfile— "How do we handle authentication?" → implementation filesmodule— "What's in the permissions folder?" → directory overviewdoc— "What are our coding standards?" → RST/MD documentationrepo— "What repos handle geospatial data?" → cross-repo discovery
Strengths
- Excellent at finding code patterns across repositories
- Strong at comparing implementations (decorators, mixins)
- Good at synthesizing answers from scattered sources
- Documentation queries now use dedicated
doclevel search
7. Quick Start
# Clone repository
git clone https://github.com/kbhalerao/code-smriti
cd code-smriti
# Configure environment
cp .env.example .env
# Edit .env with your GitHub token and Couchbase credentials
# Start services
docker-compose up -d
# Verify
curl http://localhost/health
MCP Integration (Claude Desktop / Claude Code)
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"code-smriti": {
"command": "uv",
"args": ["run", "--with", "mcp", "--with", "httpx",
"path/to/code-smriti/services/mcp-server/rag_mcp_server.py"],
"env": {
"CODESMRITI_API_URL": "http://localhost",
"CODESMRITI_USERNAME": "your-username",
"CODESMRITI_PASSWORD": "your-password"
}
}
}
}
Available Tools
| Tool | Description |
|---|---|
list_repos |
Discover available repositories with document counts |
explore_structure |
Navigate directory structure, find key files |
search_codebase |
Semantic search with level param: symbol, file, module, repo, or doc |
get_file |
Fetch actual code with optional line ranges |
ask_codebase |
RAG query with LLM synthesis (calls backend LLM) |
Two Usage Modes
| Mode | Use Case | How It Works |
|---|---|---|
| MCP Mode | Claude Code / Claude Desktop | Claude calls tools directly, does its own reasoning and synthesis |
| LLM Mode | LMStudio / Ollama | Local LLM orchestrates tools via PydanticAI agent |
References
- Stack Overflow. 2024 Developer Survey. survey.stackoverflow.co/2024
- Panopto. Workplace Knowledge and Productivity Report. 2018. PRNewswire
- Various industry studies on engineering onboarding. HackerNoon, Pluralsight
- Symflower. Parsing Code with Tree-sitter. 2023. symflower.com
- Qodo. RAG for Large-Scale Code Repos. 2024. qodo.ai
- cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree. arXiv:2506.15655
- Weaviate. Evaluation Metrics for Search and Recommendation Systems. weaviate.io
- Chroma Research. Evaluating Chunking Strategies for Retrieval. research.trychroma.com