Running an AI App End-to-End for Free: LLM Sleep & Token Optimizations

Free tiers are not free. They're a negotiation.

You get compute, tokens, requests, and in exchange, your app gets put to sleep when nobody's using it, rate limited when too many people are, and occasionally just... times out. Learning to work with these constraints instead of fighting them is what actually keeps the stack on $0/month.

This is everything I learned keeping Concierge AI alive for free.

SpongeBob Free Tier

The Full Free Stack

Before the optimizations, here's what the stack costs:

Service	Free Tier	What I Use It For
Vercel	100GB bandwidth, 100k function invocations	Frontend + Python API
Supabase	500MB DB, 2GB bandwidth	Postgres + pgvector
Groq	14,400 TPM, 30 RPM (Llama 3.3 70B)	LLM inference
HuggingFace Inference API	1000 requests/day	Embeddings
Cohere	10,000 rerank calls/month	Reranking

Every one of these has a ceiling. The art is designing the system so you hit the ceiling as rarely as possible.

Problem 1: The LLM Token Budget

Groq's free tier for Llama 3.3 70B gives you 14,400 tokens per minute. That sounds like a lot until you look at what a single RAG request actually sends:

System prompt:        ~800 tokens
Conversation history: ~400 tokens (3 messages)
Retrieved context:    ~2000 tokens (5 docs × 400 chars each)
User query:           ~50 tokens
LLM response:         ~500 tokens
─────────────────────────────────
Total per request:    ~3,750 tokens

At that rate, you can handle roughly 4 concurrent users before hitting the TPM limit. So the optimizations aren't optional. They're what makes the free tier viable.

The Context Budget

The entire RAG context is hard-capped at MAX_TOTAL_CONTEXT = 8000 characters, not tokens, characters. Characters are faster to count in Python and ~4 chars ≈ 1 token, so this keeps you well under the token limit.

MAX_TOTAL_CONTEXT = int(os.getenv("MAX_TOTAL_CONTEXT", "8000"))
total_chars = 0

for i, doc in enumerate(documents):
    available_space = MAX_TOTAL_CONTEXT - total_chars - len(source_header)
    if available_space < 200:
        break  # No room, stop adding docs

    content_to_use = doc['content'][:available_space]
    context_parts.append(f"{source_header}{content_to_use}")
    total_chars += len(source_header) + len(content_to_use)

    if total_chars >= MAX_TOTAL_CONTEXT:
        break

It's a rolling budget. Each document gets the remaining space. The last document might get truncated. That's fine. The reranker already put the most relevant ones first.

The Conversation History Cap

Conversation history is capped at 3 messages. Not 10, not 20. Just 3.

async def get_conversation_history(self, conversation_id: str, limit: int = 3) -> str:
    result = self.supabase.table('messages')\
        .select('role, content')\
        .eq('conversation_id', conversation_id)\
        .order('created_at', desc=False)\
        .limit(limit)\
        .execute()

3 messages is roughly the last exchange, enough for query contextualization ("what about for seniors?" → "what is the standard deduction for seniors in 2024?"), not so much that it balloons every request.

The max_tokens: 1000 Ceiling

LLM responses are capped too:

response = completion(
    model=self.model,
    messages=[...],
    temperature=0.4,
    timeout=30,
    max_tokens=1000   # Hard ceiling on output tokens
)

Tax answers don't need to be essays. Most good answers are 200-400 tokens. The 1000 cap prevents runaway generations that eat through the TPM budget.

Problem 2: When Groq Goes to Sleep

Groq's free tier models are available, until they're not. When the service is under load or you've hit rate limits, calls fail. This is where the fallback chain comes in.

self.fallbacks = [
    "groq/llama-3.3-70b-versatile",          # Primary: fastest
    "openrouter/google/gemini-2.0-flash-exp:free"  # Fallback: also free
]

response = completion(
    model=self.model,           # gemini-2.5-flash-lite (primary)
    messages=[...],
    fallbacks=self.fallbacks,   # LiteLLM tries these in order on failure
    timeout=30
)

LiteLLM handles the fallback logic automatically. If the primary model returns a rate limit error or times out, it moves to the next one in the list, transparent to the rest of the code. Two free models mean you'd have to exhaust both simultaneously to see a failure.

The timeout is set to 30 seconds for generation, 10 seconds for the lighter query contextualization call. Contextualization is a small LLM call that rewrites the user's question. It doesn't need 30 seconds, and failing fast here means the main generation call still has budget.

(Yes, LiteLLM. Yes, I know. The vibes were different when this was built.)

Problem 3: Cold Starts

Vercel serverless functions sleep after inactivity. When the first request arrives, the function wakes up, imports the entire Python module tree, initializes services, and then handles the request. That can take 5-8 seconds.

The initialization is lazy. It runs on first request, not at deploy time:

_services_initialized = False

def initialize_services():
    global _services_initialized
    if _services_initialized:
        return
    from services import initialize_all
    initialize_all()
    _services_initialized = True

And inside initialize(), the embeddings model gets a warmup call immediately after init:

def initialize():
    global service_instance
    service_instance = RAGService()
    logger.info("🔥 Pre-warming RAG service...")
    _ = get_embeddings().embed_query("warmup query")  # Forces HF API connection
    logger.info("✅ RAG service ready")

The warmup query fires the HuggingFace Inference API call during initialization, not during the user's first real request. The cold start still happens, but the first real query doesn't also pay the HuggingFace connection overhead on top of it.

On the frontend, the honest solution is a loading state that sets expectations. A spinner with "Thinking..." covers 5 seconds gracefully. Users wait 5 seconds for a good answer all the time. They just need to know something is happening.

Problem 4: The HuggingFace 1000 Request/Day Limit

Every query fires at least one embedding call to embed the user's query for vector search. A follow-up question fires two (one to contextualize, one for retrieval). At 1000 requests/day that's ~500 conversations before you hit the ceiling.

Two things keep this manageable:

1. Singleton caching. The embeddings client is initialized once per container and reused:

@lru_cache(maxsize=1)
def get_embeddings():
    return HuggingFaceEmbeddings(
        model="sentence-transformers/all-MiniLM-L6-v2",
        api_token=os.getenv("HF_TOKEN")
    )

No reconnection overhead, no re-authentication on every request.

2. Skip contextualization when unnecessary. If there's no conversation history, there's no need to rewrite the query:

async def contextualize_query(self, query: str, conversation_history: str) -> str:
    if conversation_history == "No prior conversation":
        return query  # Skip the LLM call entirely

First message in a conversation? Zero extra tokens, zero extra embedding calls.

Problem 5: Supabase Connection Limits

Supabase free tier allows 60 concurrent connections. Serverless functions can spin up many instances simultaneously. Each instance naively creating its own connection would blow through that fast.

The fix is the same singleton pattern:

@lru_cache(maxsize=1)
def get_supabase():
    return create_client(
        os.getenv("SUPABASE_URL"),
        os.getenv("SUPABASE_KEY")
    )

Within a single warm container, all requests share one Supabase client. Across containers it's still one-per-container, but the free tier's workload never justifies more than a handful of warm containers simultaneously.

The Graceful Failure Contract

When everything goes wrong (both LLMs rate limited, Supabase down, HuggingFace quota hit) the app has one job: don't show the user an error page.

except Exception:
    logger.error("⚠️ RAG generation failed: All providers exhausted.")
    return {
        "answer": "I'm having trouble providing a complete answer right now. Let me connect you with an expert who can help.",
        "sources": [],
        "confidence": 0.2
    }

A confidence of 0.2 triggers the routing layer to escalate to a human expert. The failure becomes a feature. The user gets handed off gracefully rather than hitting a wall.

The Numbers

At steady state with these optimizations:

Average tokens per request: ~3,200 (down from ~5,500 without caps)
Requests before Groq TPM limit: ~4-5 concurrent (sufficient for a portfolio/demo app)
HuggingFace calls per conversation: 1-2 (vs 3-4 without the skip logic)
Monthly cost: $0

The stack isn't built to handle thousands of concurrent users on free tier. That's not the point. It's built to handle real traffic for a portfolio project, a demo, or an early-stage product without spending a dollar until you've earned the right to spend one.

Next up: "How I Built Intelligent Routing Without Any ML Dependencies", the most unique part of this codebase, and the most underwritten. Pure regex, 85% accuracy, zero ML.

→ Live Demo