Building Production Hybrid Search: BM25 + pgvector in Supabase

I had a working RAG pipeline. Queries were returning documents. Answers were being generated. And then I ran RAGAS.

Context Recall: 0.821. Target: 0.85.

The retrieval was finding relevant stuff, but it was missing things. Specific things. A user asked about "Form 1040-NR" and the vector search came back with generic "filing requirements" docs. The right document existed in the database. It just wasn't being found.

That's the moment I went from vector-only search to hybrid.

Why Vector Search Alone Isn't Enough

Vector search is brilliant at understanding meaning. Ask "what can I deduct working from home?" and it'll find the home office deduction chapter even if those exact words don't appear together.

But it has a blind spot: exact terms. Tax documents are full of them.

Form 1040-NR, non-resident alien return
Schedule K-1, partnership income
1031 exchange, like-kind property swap
IRS Publication 970, education tax benefits

These terms have embeddings that land near related concepts. A query for "1031 exchange" might return docs about capital gains, depreciation recapture, real estate investing, all semantically close, none of them the actual 1031 exchange rules.

BM25 doesn't care about semantic closeness. It looks for the token. 1031 is 1031.

The Architecture: Three Layers

Query
  │
  ├── BM25 Search (PostgreSQL full-text) ──┐
  │                                         ├── RRF Fusion → Temporal Re-ranking → Top K
  └── Vector Search (pgvector) ────────────┘

Each layer does something the other can't. The fusion layer is where the magic happens.

Layer 1: The SQL Function

Everything runs inside Supabase via a single PostgreSQL RPC call. No external services, no extra hops.

CREATE OR REPLACE FUNCTION hybrid_search_knowledge_documents(
    query_text TEXT,
    query_embedding VECTOR(384),
    match_count INT,
    bm25_weight DOUBLE PRECISION,
    vector_weight DOUBLE PRECISION
)
RETURNS TABLE (
    id UUID,
    title TEXT,
    content TEXT,
    embedding VECTOR(384),
    metadata JSONB,
    created_at TIMESTAMPTZ,
    combined_score DOUBLE PRECISION
) AS $$
BEGIN
    RETURN QUERY
    SELECT
        kd.id, kd.title, kd.content, kd.embedding, kd.metadata, kd.created_at,
        -- BM25-like score via PostgreSQL ts_rank_cd
        (bm25_weight * ts_rank_cd(
            to_tsvector('english', kd.content),
            websearch_to_tsquery('english', query_text)
        )) +
        -- Cosine similarity via pgvector
        (vector_weight * (1 - (kd.embedding <=> query_embedding)))
        AS combined_score
    FROM knowledge_documents kd
    WHERE
        to_tsvector('english', kd.content) @@ websearch_to_tsquery('english', query_text)
        OR (1 - (kd.embedding <=> query_embedding)) > 0.3
    ORDER BY combined_score DESC
    LIMIT match_count;
END;
$$ LANGUAGE plpgsql;

The WHERE clause is the key decision here: a document qualifies if it matches either BM25 or vector, with a 0.3 cosine threshold as the floor for vector matches. Too low and you get noise; too high and you miss semantically adjacent but differently-worded content. 0.3 was the sweet spot after testing.

Layer 2: Dynamic Weights: The Real Story Behind 0.5/0.5

The existing blog shows 0.5/0.5. That's actually the fallback. The real defaults in production are 0.6 BM25 / 0.4 vector, and they shift dynamically based on the query.

The logic: detect whether the query contains exact-match terms, and boost BM25 when it does.

exact_patterns = [
    r'\bForm\s+\d+',           # "Form 1040", "Form 8889"
    r'\b\d{4}\b',              # Years: "2024", "2023"
    r'\bSchedule\s+[A-Z]\b',   # "Schedule C", "Schedule A"
    r'\bW-?\d+\b',             # "W-2", "W4"
    r'\b1099-\w+\b',           # "1099-INT", "1099-MISC"
    r'\bIRS\s+Publication\s+\d+',
]

def _get_dynamic_weights(self, query: str) -> Dict[str, float]:
    has_exact_terms = any(
        re.search(pattern, query, re.IGNORECASE)
        for pattern in exact_patterns
    )

    if has_exact_terms:
        return {"bm25": 0.7, "vector": 0.3}  # Exact query, trust keywords
    else:
        return {"bm25": 0.6, "vector": 0.4}  # Conceptual query, balanced

"What is the standard deduction?" → 0.6/0.4. Conceptual, no exact tokens. "What are the limits for Form 8889 in 2024?" → 0.7/0.3. Two exact tokens (Form 8889, 2024). BM25 should lead.

You can override both via env vars (BM25_WEIGHT, VECTOR_WEIGHT) without redeploying. Useful when tuning against RAGAS scores.

Layer 3: Reciprocal Rank Fusion

Both searches run in parallel, return their own ranked lists, and then RRF combines them. The formula is beautifully simple:

score(doc) = Σ 1 / (k + rank_in_list)

Where k=60 is a constant that prevents top-ranked documents from dominating too heavily.

async def retrieve(self, query: str, k: int = 20) -> List[Dict]:
    weights = self._get_dynamic_weights(query)
    initial_k = k * 3  # Cast wide, trim later

    bm25_results, vector_results = await asyncio.gather(
        self.retrieve_bm25(query, initial_k, weight=weights["bm25"]),
        self.retrieve_vector(query, initial_k)
    )

    fused = self._reciprocal_rank_fusion(bm25_results, vector_results, query=query)
    return fused[:k]

The initial_k = k * 3 is intentional. You cast a wider net at retrieval, let RRF sort it, then take the top K. Narrowing too early means RRF has nothing to work with.

The Temporal Layer: A Tax-Specific Problem

Here's something that doesn't come up in generic RAG tutorials: tax law changes every year. A document about the 2022 standard deduction is actively harmful if someone asks about 2024.

So after RRF, there's a temporal re-ranking layer that extracts the year from the query and scores documents by how well their tax_years metadata matches.

def _extract_target_year(self, query: str) -> Optional[int]:
    current_month = datetime.datetime.now().month
    
    # Filing season awareness: Jan-Apr means last year's taxes
    if 1 <= current_month <= 4:
        current_tax_year = datetime.datetime.now().year - 1
    else:
        current_tax_year = datetime.datetime.now().year

    if re.search(r'\bthis year\b', query, re.IGNORECASE):
        return current_tax_year  # Returns 2025 in Feb 2026, not 2026

    if re.search(r'\blast year\b', query, re.IGNORECASE):
        return current_tax_year - 1
    
    # Explicit year: "what are the 2023 limits for..."
    match = re.search(r'\b(\d{4})\b', query)
    if match:
        year = int(match.group(1))
        if 2000 <= year <= datetime.datetime.now().year + 1:
            return year
    
    return None

The filing season awareness is subtle but important. In February 2026, "this year's taxes" means tax year 2025, not 2026. Without this, temporal re-ranking would surface 2026 docs that barely exist yet.

The temporal score then blends with the RRF score:

final_score = (1 - recency_weight) * rrf_score + recency_weight * temporal_score

recency_weight itself is dynamic. Historical year queries (e.g., "2022 rules") get 0.7 temporal weight, current year queries get 0.5.

Chunk Sizing: The "Search Small, Feed Big" Decision

The chunks in the database are capped at 500 characters. That feels small. Here's why it works.

Small chunks mean:

Precise vector matching: the embedding represents a tight concept, not a whole page
Higher BM25 precision: exact terms dominate the chunk, not get diluted by surrounding prose

But small chunks mean thin context for the LLM. So after retrieval, the top documents get expanded, fetching one neighbor chunk before and after from the same chapter, then concatenating:

async def expand_chunks(self, documents: List[Dict], expand_by: int = 1) -> List[Dict]:
    for doc in documents:
        chunk_id = doc['metadata'].get('chunk_id')
        chapter = doc['metadata']['chapter']

        neighbors = await self.supabase.table('knowledge_documents')\
            .select('content, metadata')\
            .filter('metadata->>chapter', 'eq', str(chapter))\
            .filter('metadata->>chunk_id', 'gte', str(chunk_id - expand_by))\
            .filter('metadata->>chunk_id', 'lte', str(chunk_id + expand_by))\
            .order('metadata->>chunk_id')\
            .execute()

        expanded_content = '\n'.join(n['content'] for n in neighbors.data)
        doc['content'] = expanded_content
        doc['was_expanded'] = True

Search with 500-char chunks. Feed the LLM 1500-char expanded chunks. Best of both worlds.

When to Leave pgvector Behind

pgvector inside Supabase is the right call for this project. One database, one connection, conversations and vectors living together. But there's a ceiling.

Stay with pgvector when:

Your knowledge base is under ~500k documents
You want everything in one place (auth, storage, vectors)
You're on a free or hobby budget
Query latency under 150ms is acceptable

Consider a dedicated vector DB when:

You need sub-10ms retrieval at scale
You're doing multi-tenant isolation at the vector level
Your knowledge base is growing into millions of documents
You need advanced filtering that pgvector's HNSW index can't handle efficiently

For Concierge AI, a tax assistant with a few thousand knowledge chunks, pgvector hits 95th percentile retrieval under 150ms. That's the right tradeoff. Reach for Pinecone when you've genuinely outgrown it, not before.

Putting It All Together

Here's the full retrieval flow in one mental model:

Query: "What are the 2024 limits for Schedule C home office?"

1. Dynamic weights     → has "2024" + "Schedule C" → BM25: 0.7, Vector: 0.3
2. BM25 search         → finds "Schedule C" + "home office" exact matches
3. Vector search       → finds semantic neighbors about home office deductions
4. RRF fusion          → docs appearing in both lists float to the top
5. Temporal re-rank    → 2024 docs boosted, 2021 docs pushed down
6. Reranker (Cohere)   → cross-encoder picks the 2 best from top 20
7. Chunk expansion     → top 2 chunks → 3x context each → fed to LLM

Context Recall went from 0.821 to 0.89 after the temporal layer was added. The RAGAS score that started this whole journey.

Next up: "Deploying a Python + Next.js AI App on Vercel Free Tier", the 50MB size limit, splitting FastAPI into serverless functions, and why .vercelignore is more important than you'd think.

→ Live Demo