Deploying a Python + Next.js AI App on Vercel Free Tier

I want to tell you about the day I thought the deployment was broken.

The app worked perfectly locally. FastAPI running on 8000, Next.js on 3000, everything talking to each other. I pushed to Vercel, opened the live URL, sent a message, and got a 500 error.

It wasn't broken. I'd packed numpy, sentence-transformers, and semantic-router into a serverless function with a 50MB size limit. Combined, they're over 600MB. Vercel didn't crash, it just refused to let any of it in. Rightfully so.

What are you doing in my swamp

What Vercel Actually Is

Vercel is not a server. There's no machine sitting somewhere running your FastAPI app. Every request spins up an isolated function, runs your code, and dies. No persistent memory, no background threads, no warm process waiting.

This changes everything about how you structure a Python + Next.js app.

The architecture for Concierge AI ended up looking like this:

Browser
  │
  ├── Next.js (Vercel Edge)          → UI, static assets, React components
  │
  └── /api/py/* (Vercel Function)    → FastAPI, all Python AI logic
         │
         ├── Supabase                → Database + vectors (external, always on)
         ├── Groq API                → LLM inference (external)
         └── HuggingFace API         → Embeddings (external)

The key insight: your stateful services live outside Vercel. Vercel just handles the compute.

The Single Function Pattern

Vercel supports Python serverless functions, but each file in /api becomes its own function with its own cold start, its own memory, its own bundle. Splitting your FastAPI app into multiple files multiplies your problems.

The decision: one entry point, one function.

api/
  index.py       ← entire FastAPI app lives here

And vercel.json routes everything there:

{
  "rewrites": [
    {
      "source": "/api/py/:path*",
      "destination": "/api?path=:path*"
    }
  ],
  "functions": {
    "api/index.py": {
      "maxDuration": 60
    }
  }
}

Every call to /api/py/chat, /api/py/experts, /api/py/metrics hits the same index.py. The path gets passed as a query parameter, and a middleware layer inside the app reconstructs it.

@app.middleware("http")
async def handle_vercel_routing(request: Request, call_next):
    initialize_services()  # lazy init on first request

    if request.query_params.get("path"):
        path = "/" + request.query_params.get("path")
        request.scope["path"] = path
        request.scope["query_string"] = b""

    response = await call_next(request)
    return response

Vercel hands you /api?path=chat/send. This middleware turns it back into /chat/send so FastAPI's router can handle it normally.

The ASGI/WSGI Bridge

FastAPI is an ASGI app. Vercel's Python runtime expects WSGI. These are different protocols. ASGI is async-native, WSGI is synchronous.

The bridge is one line:

from a2wsgi import ASGIMiddleware

if os.environ.get("VERCEL"):
    app = ASGIMiddleware(app)

Only wrap it when running on Vercel. Locally, you run the raw ASGI app with uvicorn and don't need the wrapper. Without the VERCEL env check, your local dev breaks.

The 50MB Limit: A Forcing Function

Vercel free tier caps deployed function size at 50MB. This sounds like a constraint. It's actually a design principle in disguise.

Every dependency you add, you justify. Here's what that looked like:

Dropped:

semantic-router: 150MB with its ML dependencies. Replaced with pure regex pattern matching. Same accuracy for a tax domain, zero bundle size.
numpy: pulled in by half a dozen packages. Replaced all vector math with Python list comprehensions. 384-dimensional cosine similarity in pure Python is fast enough.
Local embedding models: sentence-transformers alone is 500MB+. Replaced with HuggingFace Inference API. Zero bundle size, ~150ms network latency per call.

Kept:

langchain core: needed for the LLM chain abstractions
supabase client: lightweight, no choice
fastapi + uvicorn: unavoidable
ragas + datasets: but only in requirements-dev.txt, never deployed

The .vercelignore is just as important as the dependencies:

knowledge_data/    ← PDFs already ingested into Supabase, never needed at runtime
scripts/           ← ingestion scripts, local only
__pycache__/
*.pyc

knowledge_data/ alone would've blown the size budget. The PDFs exist locally for ingestion, get chunked and stored in Supabase, and never need to touch Vercel again.

Lazy Initialization: Don't Pay on Cold Start

The services (RAG pipeline, embeddings client, Supabase connection) take a second or two to initialize. If you initialize them at module import time, every cold start pays that cost before the first request even arrives.

The pattern: initialize on first request, cache forever after.

_services_initialized = False

def initialize_services():
    global _services_initialized
    if _services_initialized:
        return

    from services import initialize_all
    initialize_all()
    _services_initialized = True

Called inside the middleware on every request, but the if _services_initialized guard means it only runs once per container lifetime. Subsequent requests in the same warm container skip it entirely.

The LLM and embeddings clients use the same pattern via @lru_cache:

@lru_cache(maxsize=1)
def get_llm():
    return ChatGroq(...)

@lru_cache(maxsize=1)
def get_supabase():
    return create_client(...)

lru_cache(maxsize=1) is a singleton. The function body runs exactly once per container. Every call after that returns the cached object.

Environment Variables: The Easy Part

Vercel's environment variable UI replaces .env.local completely. The app detects which context it's running in:

if not os.environ.get("VERCEL"):
    load_dotenv(dotenv_path=".env.local")

On Vercel, VERCEL=1 is set automatically. The dotenv load is skipped and the platform-injected vars are used directly. Locally, .env.local loads as normal. One codebase, two contexts, no changes needed.

The maxDuration: 60 Decision

The default Vercel function timeout is 10 seconds. An AI response (embedding lookup, hybrid search, reranking, LLM generation) takes 800ms to 2 seconds in practice. But on a cold start, add another 3-5 seconds for initialization.

Setting maxDuration: 60 in vercel.json gives the function room to breathe on cold starts without timing out on the user's first message.

Free tier allows up to 60 seconds. Don't leave it at 10.

What the Deployment Actually Looks Like

Push to main
  │
  Vercel builds Next.js → static output + edge functions
  Vercel packages api/index.py → serverless function bundle
  │
  Deploy
  │
  First request → cold start (~3-5s init) → response
  Subsequent requests → warm container → response (~950ms)

The cold start is the honest cost of serverless. Blog 4 covers how to handle it gracefully: keep-alive pings, user-facing loading states, and Groq's sleep behaviour. That's where the free tier optimization story really lives.

Next up: "Running an AI App End-to-End for Free: LLM Sleep & Token Optimizations", Groq's TPM limits, cold start handling, and keeping the whole stack alive on $0/month.

→ Live Demo