Skip to content

Instantly share code, notes, and snippets.

@arockwell
Created July 4, 2025 05:54
Show Gist options
  • Select an option

  • Save arockwell/e1de4c6e8d322d1030d97c21a86a8247 to your computer and use it in GitHub Desktop.

Select an option

Save arockwell/e1de4c6e8d322d1030d97c21a86a8247 to your computer and use it in GitHub Desktop.
PostgreSQL Knowledge Base for Millions of Markdown Files - A blazing-fast document management system with Fish shell

PostgreSQL Knowledge Base for Millions of Markdown Files

A blazing-fast markdown document management system built with PostgreSQL and Fish shell. Handles everything from personal notes to millions of documents with smart tab completion and beautiful mdcat rendering.

Features

  • πŸš€ Scales to millions - Optimized indexes, parallel queries, batch operations
  • πŸ” Full-text + Fuzzy search - PostgreSQL FTS with trigram support
  • πŸ“ Project organization - Group documents by project with multi-level tab completion
  • ⚑ Sub-second queries - Even with millions of documents
  • 🎨 Beautiful rendering - mdcat integration for terminal viewing

Quick Start

# Source the functions
source claude-functions.fish

# Save a document (auto-detects project from git)
claude-save README.md

# Save with explicit project
claude-save api-notes.md "API Design" "myproject"

# View with smart tab completion
claude-view <TAB>                    # Shows recent + prefixes
claude-view project:<TAB>            # Browse by project
claude-view project:myapp/<TAB>      # Documents in project

# Search across everything
claude-find "docker compose"
claude-find "api" --project myapp --limit 50

# Fuzzy search for typos
claude-find-fuzzy "configuraiton"    # Finds "configuration"

Installation

  1. Create the database schema:
CREATE SCHEMA claude;

CREATE TABLE claude.knowledge (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    project TEXT,
    tags TEXT[],
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    accessed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    access_count INTEGER DEFAULT 0,
    search_vector tsvector
);

-- Create indexes for performance
CREATE INDEX idx_claude_project ON claude.knowledge(project) WHERE project IS NOT NULL;
CREATE INDEX idx_claude_accessed ON claude.knowledge(accessed_at DESC);
CREATE INDEX idx_claude_search ON claude.knowledge USING GIN(search_vector);

-- Enable fuzzy search
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_claude_title_trgm ON claude.knowledge USING gin(title gin_trgm_ops);
  1. Source the Fish functions:
source claude-functions.fish
source claude-completions.fish

Multi-Level Tab Completion

The completion system is context-aware and supports multiple levels:

# Level 1: Prefixes and recent docs
claude-view <TAB>
api-design.md                    Recent document (modified today)
project:                         Browse by project β†’
tag:                            Browse by tag β†’
date:                           Browse by date β†’

# Level 2: Within a context
claude-view project:my<TAB>
project:myapp                    myapp project (12 documents)
project:mysite                   mysite project (8 documents)

# Level 3: Specific documents
claude-view project:myapp/<TAB>
project:myapp/api-design.md      API endpoint planning document
project:myapp/docker-setup.md    Docker configuration

Scaling to Millions

Current optimizations (handles 100K+ docs):

  • 12 specialized indexes including trigrams
  • Parallel query execution
  • Batch import operations
  • Smart search function with ranking

When you hit millions:

-- 1. Partition by date
CREATE TABLE claude.knowledge (...) PARTITION BY RANGE (created_at);

-- 2. Materialized views for tab completion
CREATE MATERIALIZED VIEW claude.project_summary AS
SELECT project, COUNT(*), MAX(updated_at) FROM claude.knowledge GROUP BY project;

-- 3. Separate search index
CREATE TABLE claude.search_index (
    id BIGINT PRIMARY KEY,
    title TEXT,
    search_vector tsvector,
    preview TEXT
);

Bulk Import

# Import entire directory
claude-import-dir ~/Documents/notes "personal"

# Import with progress for large sets
claude-import-dir ~/massive-docs "knowledge-base"
πŸ“ Found 50000 markdown files
πŸš€ Starting batch import...
........ (dots show progress)
βœ… Imported 50000 documents

Performance Monitoring

# Check performance stats
claude-perf
⚑ Performance Diagnostics
=========================
metric          | value   | age
total_documents | 50000   | 00:01:23
Index Usage:
index_name              | scans  | tuples_read
idx_claude_search       | 1523   | 4821933
idx_claude_project_title| 823    | 2847221

# Run maintenance
claude-vacuum
🧹 Running maintenance...
βœ… Maintenance complete

Architecture Decisions

  1. PostgreSQL over SQLite - Better concurrent access, full-text search, scales to millions
  2. Fish over Bash - Better completions, cleaner syntax, amazing interactive features
  3. Direct SQL over ORM - Maximum performance, full control over queries
  4. Trigram indexes - Handles typos and fuzzy matching efficiently

Advanced Usage

# Export entire project
claude-export-project myapp ~/Desktop/myapp-docs/

# Search with context
claude-find "authentication" --project api --limit 100

# View by ID (faster for scripts)
claude-view id:42

# Recent documents with custom limit
claude-recent 50

Why This Design Scales

  1. Smart Indexes - PostgreSQL chooses the best index automatically
  2. Parallel Queries - Multi-core search execution
  3. Batch Operations - Import thousands of files efficiently
  4. Incremental Updates - Only reindex what changes
  5. Partitioning Ready - Easy to add when needed

Testing Scale

# Generate test data
for i in (seq 1 10000)
    echo "# Document $i\n\nThis is test content for document $i" > /tmp/doc-$i.md
end

# Bulk import
claude-import-dir /tmp "test-docs"

# Test search performance
time claude-find "document 5423"

# Test fuzzy search
time claude-find-fuzzy "documnet"  # Typo intentional

# Check stats
claude-stats

Built with ❀️ for people who have too many markdown files and want sub-second search on all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment