How I Built Intelligent Routing Without ML Dependencies

I want to tell you about a decision I reversed.

The original routing system was pure regex: keyword patterns, a complexity scorer, no LLM calls. It worked. 85% accuracy on a tax domain, <5ms per query, zero ML dependencies. I was proud of it.

Then I added LLM-as-judge routing. And then I kept the regex as the fallback.

The punchline: both are still running in production. And the system is better for having both.

Both is good

The Problem: What Is Routing?

Every query that hits Concierge AI needs to go somewhere:

AI: RAG pipeline handles it, answer in ~950ms
Human expert: routed to a specialist, async response
Clarification: query is too vague to act on

Get this wrong and you either waste expert time on questions a chatbot could answer, or give someone an AI response to "I just got an IRS audit notice." Neither is acceptable.

The routing layer runs before RAG. It has to be fast, reliable, and cheap. Ideally free.

Layer 1: The Keyword Backbone

The complexity scorer is the foundation. No imports beyond re. No API calls. Runs in under 5ms.

It works in three passes:

Pass 1: Classify keywords into buckets:

self.urgency_keywords  = ['audit', 'penalty', 'notice', 'deadline', 'emergency', ...]
self.complex_keywords  = ['international', 'capital gains', 'staking', '1031', 'amt', ...]
self.moderate_keywords = ['self-employed', 'crypto', 'home office', 'rental income', ...]
self.simple_keywords   = ['standard deduction', 'w-2', 'refund', 'tax bracket', ...]

Pass 2: Start from intent, adjust with evidence:

base_score = intent_scores.get(intent, 3)  # simple_tax=2, complex_tax=3, urgent=5

if complex_count > 0:
    base_score = max(base_score, 4)     # Always escalate
elif moderate_count > 0:
    if intent == "simple_tax" and moderate_count == 1:
        base_score = max(base_score, 2) # Don't over-escalate
    else:
        base_score = max(base_score, 3)

The intent-awareness in pass 2 is the critical detail. Without it, "Can I deduct home office expenses?" scores a 3 and gets routed to a human because home office is a moderate keyword. With it, the simple_tax intent keeps the score at 2 and lets the RAG pipeline handle it.

Pass 3: Structural overrides:

if word_count > 30:           base_score = min(5, base_score + 1)
if has_multiple_questions:    base_score = min(5, base_score + 1)
if has_urgency:               base_score = 5  # Always human

Long queries are almost always complex. Multiple questions means the user is dealing with a compound situation. Urgency overrides everything, no exceptions.

Score → Route:

1-3 → AI
4-5 → Human expert

Layer 2: LLM-as-Judge

The keyword system is good at what it knows. It's blind to what it doesn't.

"As a US citizen living abroad, do I need to file FBAR?", no urgency keywords, no complex keywords in the list, moderate query length. Keyword scorer gives it a 3. It gets routed to AI.

That's wrong. FBAR non-compliance carries $10,000+ penalties. This is a human question.

LLM-as-judge catches it:

routing_prompt = """
**INTENT CLASSIFICATION:**
1. simple_tax   = Clear question about tax topic
2. complex_tax  = Multi-state, crypto, trusts, estate, international
3. urgent       = IRS audit, penalty, deadline TODAY
...

**CRITICAL: Stock Options & Equity = Complex!**
- Any mention of ISO, RSU, stock options, equity → complex_tax, complexity=5

Respond ONLY with JSON:
{
  "intent": "...",
  "route": "ai" | "human" | "clarification",
  "technical_complexity": 1-5,
  "urgency": 1-5,
  "risk_exposure": 1-5,
  "confidence": 0.0-1.0,
  "reasoning": "brief explanation"
}
"""

Three complexity dimensions (technical, urgency, risk) and the final complexity_score is the max of all three, not the average. A query can be technically simple but high risk (e.g., "I forgot to report foreign income for 3 years") and still route to human.

The LLM also unifies intent classification and routing into one call:

# One LLM call returns both intent AND route
complexity_score = max(
    result.get('technical_complexity', 1),
    result.get('urgency', 1),
    result.get('risk_exposure', 1)
)
intent = result.get('intent', 'complex_tax')

That saves one full LLM call per request compared to running intent classification and routing separately. On a free tier TPM budget, that matters.

The Fallback Architecture

Here's the part that makes the system actually reliable. The LLM router and keyword router aren't competing. They're stacked:

Query
  │
  ▼
LLM Router (Gemini Flash → Groq Llama fallback)
  │
  ├── Success → return LLM decision
  │
  └── Failure (rate limit, timeout, error)
        │
        ▼
      Keyword Fallback Router
        │
        ├── Success → return keyword decision
        │
        └── Failure (shouldn't happen, but...)
              │
              ▼
            Default → route="ai", complexity=2

async def route(self, query: str) -> Dict:
    if not self.enabled:
        return self._use_fallback(query)
    
    try:
        result_json = cached_llm_routing(query)
        result = json.loads(result_json)
        return result
    except Exception as e:
        logger.warning(f"LLM routing failed ({e}), using fallback keyword routing")
        return self._use_fallback(query)

The default of last resort (route="ai", complexity=2) is intentionally optimistic. If everything breaks, send the query to the RAG pipeline. An imperfect AI answer is recoverable. Silently dropping a request isn't.

The Cache Layer

Routing decisions for identical queries are cached:

@lru_cache(maxsize=100)
def cached_llm_routing(query: str) -> str:
    return _get_llm_routing_decision(query)

Within a warm Vercel container, the 100 most recent unique queries skip the LLM call entirely. For a tax app, many users ask near-identical questions ("what is the standard deduction", "when is the tax deadline") and the cache absorbs that load cleanly.

What 85% Accuracy Actually Means

The keyword fallback hitting 85% on a tax domain isn't a coincidence. It's a property of the domain.

Tax has unambiguous vocabulary. K-1, 1031 exchange, FBAR, AMT, QBI: these terms don't appear in casual conversation. When they show up in a query, they carry strong signal. The regex doesn't need to understand language; it just needs to recognize these tokens.

The 15% the keywords miss are edge cases: novel phrasing, implicit urgency ("I'm really stressed about something the IRS sent"), or domain-adjacent questions that don't match any pattern. That's exactly what the LLM layer handles.

Two systems, complementary blind spots. Neither would be as good alone.

The Philosophy: Information First

One design decision worth calling out. The routing philosophy is explicitly encoded in the LLM prompt:

**PHILOSOPHY: Information First, Clarification Second**
- If the query asks a CLEAR question about a tax topic, route to AI
- Only route to clarification if the query is a FRAGMENT with no clear tax topic

"Can I deduct my car?" → AI (provides self-employed vs employee rules, then asks for details)
"Car deduction?"       → Clarification (fragment, unclear what they're asking)

A routing system that asks "what do you mean?" too often trains users to abandon it. The bar for requesting clarification is deliberately high. Only true fragments with no recoverable intent. Everything else gets an answer, and the answer itself can surface follow-up questions.

→ Live Demo