CodeSmriti

स्मृति — memory, that which is remembered
Technical Brief v0.4.0 November 2025 github.com/kbhalerao/code-smriti API Docs
Abstract

CodeSmriti is a semantic code memory system that indexes GitHub repositories into a vector database for natural language retrieval. The system employs a bottom-up aggregation pipeline: symbols are summarized first, then aggregated into file summaries, folder modules, and finally repository overviews. Hybrid chunking combines tree-sitter AST parsing with LLM-assisted semantic chunking for languages that resist static analysis (Svelte, embedded SQL, templating). Integration with Claude Desktop and VSCode is provided via the Model Context Protocol (MCP). This brief documents the architecture, chunking strategies, and performance characteristics.

1. Problem Statement

Engineering teams accumulate solutions across dozens of repositories over years of development. Without institutional memory, organizations face compounding inefficiencies:

Finding Impact Source
61% of developers spend >30 min/day searching ~100 hours/year per developer [1]
$47M/year lost per large enterprise Knowledge silos, duplication [2]
3-9 months to full productivity Slow onboarding [3]
42% of knowledge unique to individuals Risk when engineers depart [2]

Traditional code search (grep, GitHub search) operates on keywords, not semantics. Questions like "How did we solve rate limiting?" or "What's our authentication pattern?" require understanding intent, not matching strings.

2. Architecture

┌──────────────┐ ┌─────────────────────────────────────┐ ┌──────────────┐ │ GitHub │────▶│ Ingestion Pipeline │────▶│ Couchbase │ │ Repos │ │ │ │ Vector DB │ └──────────────┘ │ 1. Parse files (tree-sitter) │ └──────┬───────┘ │ 2. LLM chunk underchunked files │ │ │ 3. Summarize symbols → files │ 768-dim embeddings │ 4. Aggregate files → modules │ (nomic-embed-text-v1.5) │ 5. Aggregate modules → repo │ │ └─────────────────────────────────────┘ │ │ ┌───────────────────────────────────────────────────────────────────────┘ │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ TOOL LAYER (shared) │ │ │ list_repos | explore_structure | search_code | get_file │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ ▼ ▼ │ ┌─────────────────────┐ ┌─────────────────────────────┐ │ │ MCP Server │ │ PydanticAI Agent │ └─────────────▶│ (rag_mcp_server) │ │ (pydantic_rag_agent) │ │ │ │ │ │ - Direct tool call │ │ - LLM orchestrates tools │ │ - Claude reasons │ │ - ask_codebase() wraps │ │ - No LLM needed │ │ - Local LLM synthesizes │ └─────────────────────┘ └─────────────────────────────┘ │ │ ▼ ▼ Claude Code LMStudio / Ollama

Components

  • Ingestion Pipeline: Bottom-up processor: tree-sitter parsing → LLM chunking → symbol summarization → file aggregation → module aggregation → repo summary
  • Vector Database: Couchbase with FTS vector search index (768 dimensions)
  • Tool Layer: Shared implementations for all RAG operations (search, explore, file retrieval)
  • API Server: FastAPI providing RAG endpoints with hierarchical search (symbol/file/module/repo levels)
  • MCP Server: Model Context Protocol for Claude Code—direct tool access without intermediate LLM
  • PydanticAI Agent: LLM-driven mode for local models (LMStudio, Ollama)—LLM orchestrates tool calls

Key Technologies

Component Technology Rationale
Parsing tree-sitter 40+ languages, incremental, error-tolerant [4]
LLM Enrichment Local LLM (Qwen/Llama) Summary generation, semantic chunking, aggregation
Embeddings nomic-embed-text-v1.5 768-dim, 8192 token context, local inference
Vector DB Couchbase Hybrid FTS + vector, multi-tenant, production-ready
Integration MCP Standard protocol for LLM tool use

3. Chunking Strategy

Research confirms that naive chunking methods struggle with code: "Naive chunking methods struggle with accurately delineating meaningful segments of code, leading to issues with boundary definition and the inclusion of irrelevant or incomplete information" [5].

CodeSmriti employs a hybrid approach: AST-aware chunking via tree-sitter as the primary method, with LLM-assisted semantic chunking as a fallback for files that resist static analysis. This has been shown to improve Recall@5 by 4.3 points on code retrieval benchmarks [6].

Core Principle: Bottom-Up Aggregation

Summaries are built from the ground up: individual symbols (functions, classes) are summarized first, then aggregated into file-level summaries, then into folder-based module summaries, and finally into a repository overview. This ensures every level has meaningful, accurate context rather than relying on top-down inference.

Hierarchical Document Types

Document Type Content Purpose
repo_summary Aggregated from module summaries Repository-level overview and tech stack
module_summary Aggregated from file summaries Folder/package-level context
file_index Aggregated from symbol summaries File purpose, key components, imports
symbol_index LLM summary of function/class Detailed symbol documentation (≥5 lines)

Hybrid Chunking: AST + LLM

Tree-sitter provides AST-level symbol extraction for most languages. However, some files resist static analysis—Svelte components with complex template logic, SQL embedded in Python strings, or domain-specific patterns.

For these "underchunked" files (high line count but few detected symbols), CodeSmriti invokes an LLM semantic chunker to identify logical boundaries:

Primary (tree-sitter):     function_definition, class_definition, method_definition
Fallback (LLM chunker):    SQL queries, business logic blocks, configuration sections,
                           template event handlers, data transformations

The LLM chunker identifies semantic units like user_subscription_data_fetch or zone_calculation_algorithm that tree-sitter cannot detect, ensuring comprehensive coverage even for unconventional code patterns.

Junk Filtering

Skip files that add noise without value:

node_modules/    package-lock.json    *.min.js
dist/            yarn.lock            *.min.css
__pycache__/     Cargo.lock           *.map
.git/            poetry.lock          *generated*

4. Document Schema

Key Insight: Since we have access to the actual repository, we don't need to store code redundantly. Store summaries + line references, fetch code on demand. Content-based hashing ensures deduplication across re-indexes.

Hierarchical Document Structure

repo_summary ───────────────────────────────────────────────────────── │ Aggregated from module summaries. Tech stack, total files/lines. │ document_id: hash(repo:{repo_id}:{commit}) │ └─ module_summary ──────────────────────────────────────────────────── │ Folder-based. Aggregated from file summaries. │ document_id: hash(module:{repo_id}:{path}:{commit}) │ └─ file_index ───────────────────────────────────────────────────── │ Aggregated from symbol summaries. Imports, language, all symbols. │ document_id: hash(file:{repo_id}:{path}:{commit}) │ └─ symbol_index ──────────────────────────────────────────────── Individual function/class with LLM summary. ≥5 lines only. document_id: hash(symbol:{repo_id}:{path}:{name}:{commit})

Quality Tracking

Each document carries quality metadata indicating its enrichment level:

Enrichment Level Description Fallback Behavior
llm_summary Full LLM-generated summary
basic Docstring + structure only LLM unavailable or timed out
none No summary available Parsing failed

A circuit breaker pattern prevents cascading failures: if LLM calls fail repeatedly, the system gracefully degrades to basic enrichment without blocking ingestion.

Content-Based Deduplication

Document IDs are SHA256 hashes of content keys (repo, path, symbol name, commit). Re-indexing the same commit produces identical IDs, enabling efficient upserts without duplicate detection logic.

Embedding Strategy

Embeddings capture both semantic meaning (from LLM summary) and code patterns (from actual code at index time). For symbols, the embedding combines the summary with a code snippet. This answers both "what does this do?" and "how is it implemented?"

5. Performance

V4 Production Stats (November 2025)

Full ingestion of 101 repositories on Mac M3 Ultra with LM Studio (qwen3-30b-a3b-2507):

Metric Value Notes
Total repositories 101 99 processed, 2 skipped (pre-existing)
Documents indexed 48,795 13K files, 31K symbols, 4K modules
Total ingestion time 32 hours ~20 min average per repo
LLM tokens per file ~1,850 Estimated input+output

Throughput

Operation Performance Configuration
Embedding generation 1,280 docs/min MPS GPU, nomic-embed-text-v1.5
Search latency <100ms Hybrid vector + FTS
Average repo ingestion ~20 min With LLM enrichment
Ingestion without LLM 1-2 min Basic summaries only

Resumable Ingestion

The --skip-existing flag enables resumable batch ingestion. Repos with existing V4 documents are skipped, allowing recovery from failures without re-processing completed work.

6. Case Studies

V4 evaluation suite: 47 questions across 9 categories, tested against 32 proprietary repositories. The evaluation covers symbol lookups, pattern discovery, documentation retrieval, and cross-repo queries.

Evaluation Results (v0.4.0)

Category Questions Avg Score Notes
Symbol Pattern 3 93% Decorators, mixins, base classes
Symbol Specific 5 85% Exact function/class lookups
Svelte Components 11 78% Template logic, reactive statements
Code Explanation 1 80% "How does X work?"
Documentation 8 76% RST/MD docs, design principles
File Pattern 7 76% Models, views, configuration
Architecture 6 76% Multi-tenant, permissions
Discovery 4 75% Cross-repo exploration
Implementation 2 70% Complex feature deep-dives

Overall: 47/47 passing (100%), average quality 78%

Example Output: Symbol Query

Query: "job_counter decorator implementation"

# `job_counter` Decorator Implementation

The `job_counter` decorator is implemented in `common/consumer_decorators.py`
and tracks task execution metrics using Redis.

## Key Features
- Tracks success, failure, and pending task counts
- Uses Redis for state management
- Supports both synchronous and asynchronous functions

## Implementation
```python
def job_counter(func: Callable) -> Callable:
    @wraps(func)
    async def async_wrapper(*args: Any, **kwargs: Any) -> Any:
        key_success, key_failed_list, key_pending_list, job_string = _make_keys(func, *args)
        try:
            redis_connection.lpush(key_pending_list, job_string)
            result = await func(*args, **kwargs)
            redis_connection.incr(key_success)
            return result
        except Exception as e:
            redis_connection.lpush(key_failed_list, f"Error: {str(e)}")
            redis_connection.ltrim(key_failed_list, 0, 50)
            raise
        finally:
            redis_connection.lpop(key_pending_list)
```

Example Output: Documentation Query

Query: "L1 Python best practices"

# L1 Python Best Practices

## Naming Conventions
- Use `snake_case` for function and variable names
- Use `PascalCase` for class names
- Constants should be `UPPER_CASE`

## Error Handling
- Use specific exception types rather than bare `except:`
- Always include error messages for debugging
- Use context managers for resource management

## Type Hints
- Always use type hints for function parameters and return values
- Use `typing` module for complex types

```python
def process_user_data(
    user_list: List[Dict[str, str]],
    filter_criteria: Optional[Dict[str, str]] = None
) -> Tuple[Dict[str, List], int]:
    ...
```

Source: `docs/source/developer_guide/audit_standards/L1_python_best_practices.rst`

Search Level Selection

The LLM automatically selects the appropriate search level based on query intent:

  • symbol — "Find the job_counter decorator" → specific function/class
  • file — "How do we handle authentication?" → implementation files
  • module — "What's in the permissions folder?" → directory overview
  • doc — "What are our coding standards?" → RST/MD documentation
  • repo — "What repos handle geospatial data?" → cross-repo discovery

Strengths

  • Excellent at finding code patterns across repositories
  • Strong at comparing implementations (decorators, mixins)
  • Good at synthesizing answers from scattered sources
  • Documentation queries now use dedicated doc level search

7. Quick Start

# Clone repository
git clone https://github.com/kbhalerao/code-smriti
cd code-smriti

# Configure environment
cp .env.example .env
# Edit .env with your GitHub token and Couchbase credentials

# Start services
docker-compose up -d

# Verify
curl http://localhost/health

MCP Integration (Claude Desktop / Claude Code)

Add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "code-smriti": {
      "command": "uv",
      "args": ["run", "--with", "mcp", "--with", "httpx",
               "path/to/code-smriti/services/mcp-server/rag_mcp_server.py"],
      "env": {
        "CODESMRITI_API_URL": "http://localhost",
        "CODESMRITI_USERNAME": "your-username",
        "CODESMRITI_PASSWORD": "your-password"
      }
    }
  }
}

Available Tools

Tool Description
list_repos Discover available repositories with document counts
explore_structure Navigate directory structure, find key files
search_codebase Semantic search with level param: symbol, file, module, repo, or doc
get_file Fetch actual code with optional line ranges
ask_codebase RAG query with LLM synthesis (calls backend LLM)

Two Usage Modes

Mode Use Case How It Works
MCP Mode Claude Code / Claude Desktop Claude calls tools directly, does its own reasoning and synthesis
LLM Mode LMStudio / Ollama Local LLM orchestrates tools via PydanticAI agent

References

  1. Stack Overflow. 2024 Developer Survey. survey.stackoverflow.co/2024
  2. Panopto. Workplace Knowledge and Productivity Report. 2018. PRNewswire
  3. Various industry studies on engineering onboarding. HackerNoon, Pluralsight
  4. Symflower. Parsing Code with Tree-sitter. 2023. symflower.com
  5. Qodo. RAG for Large-Scale Code Repos. 2024. qodo.ai
  6. cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree. arXiv:2506.15655
  7. Weaviate. Evaluation Metrics for Search and Recommendation Systems. weaviate.io
  8. Chroma Research. Evaluating Chunking Strategies for Retrieval. research.trychroma.com