Evaluating Your RAG Pipeline with RAGAS, Completely Free
So you've built a RAG pipeline. Your retrieval is returning something. Your LLM is generating something. But is it actually good?
I asked myself the same question while building Concierge AI, a tax assistant that routes questions between AI and human experts. The stakes felt high. Tax advice that's confidently wrong is worse than no advice at all.
So I went looking for a way to measure the quality. That's when I found RAGAS.

What Even Is RAGAS?
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that evaluates your RAG pipeline across 5 metrics, no human labellers needed. It uses an LLM as a judge to score your system automatically.
The 5 metrics split into two buckets:
Retrieval Quality: did you find the right stuff?
- Context Precision: of everything you retrieved, how much was actually relevant?
- Context Recall: did you retrieve all the information needed to answer?
- Context Relevance: how tightly does the retrieved content match the question?
Generation Quality: did you say the right stuff?
- Faithfulness: is the answer grounded in the retrieved context, or is the LLM hallucinating?
- Answer Relevancy: does the answer actually address what was asked?
Think of it like this: Context Precision is about signal-to-noise, Context Recall is about completeness, and Faithfulness is your hallucination detector. For a tax app, Faithfulness was my north star. I set its threshold at 0.90, higher than everything else, because a hallucinated tax deduction can cost someone real money.
The Dataset Problem
RAGAS needs a golden dataset: question, expected answer, retrieved contexts. Building this by hand is tedious. Here's what worked for me:
# evaluation/golden_dataset.json structure
{
"question": "What is the standard deduction for a single filer in 2024?",
"ground_truth": "The standard deduction for single filers in 2024 is $14,600.",
"contexts": [
"For tax year 2024, the standard deduction amounts are..."
]
}
I generated mine semi-automatically. I ran real queries through the pipeline, kept the ones where the answer was verifiably correct, and saved those as ground truth. Takes an afternoon, saves you weeks of guessing.
The Free Tier Problem
RAGAS needs an LLM as the judge. The default config points to GPT-4, but I'm keeping this free. Enter Gemini Flash Lite via LiteLLM and HuggingFace Inference API for embeddings.
The solution: Gemini Flash Lite as the judge LLM, and HuggingFace Inference API for embeddings. Both have generous free tiers.

import os
from langchain_community.chat_models import ChatLiteLLM
from ragas.llms import LangchainLLMWrapper
# Free judge: Gemini Flash Lite via LiteLLM
model_name = os.getenv("RAGAS_EVALUATOR_MODEL", "gemini/gemini-2.5-flash-lite-preview-09-2025")
llm = ChatLiteLLM(model=model_name, temperature=0)
evaluator_llm = LangchainLLMWrapper(llm)
And for embeddings:
from services.hf_embeddings import HuggingFaceEmbeddings
evaluator_embeddings = HuggingFaceEmbeddings(
model="sentence-transformers/all-MiniLM-L6-v2",
api_token=os.getenv("HF_TOKEN")
)
Zero cost. Zero local models. Bundle size stays lean.
Wiring It Into Your Pipeline
The core RAGAS call looks like this:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
ContextRelevance,
faithfulness,
answer_relevancy
)
CONTEXT_RELEVANCE = ContextRelevance()
metrics = [
context_precision,
context_recall,
CONTEXT_RELEVANCE,
faithfulness,
answer_relevancy
]
# Build the dataset RAGAS expects
dataset = Dataset.from_dict({
"question": [case["question"] for case in test_cases],
"answer": [case["answer"] for case in test_cases],
"contexts": [case["contexts"] for case in test_cases],
"ground_truth": [case.get("ground_truth", "") for case in test_cases],
})
result = evaluate(
dataset,
metrics=metrics,
llm=evaluator_llm,
embeddings=evaluator_embeddings,
raise_exceptions=False
)
print(result.to_pandas())
Simple enough. But here's the part nobody tells you.
The Rate Limit Reality
If you run the full dataset in one shot on a free tier, you will get rate limited. The Gemini free tier hits RPM limits fast when RAGAS fires LLM calls for every single chunk of every single document.
The fix: evaluate one item at a time with a 60-second sleep between each.

import asyncio
all_scores = []
print("š¢ Running RAGAS in slow mode (60s delay/item)...")
for i, item in enumerate(dataset):
print(f" Evaluating item {i+1}/{len(dataset)}...")
single_ds = Dataset.from_list([item])
try:
result = evaluate(
single_ds,
metrics=metrics,
llm=evaluator_llm,
embeddings=evaluator_embeddings,
raise_exceptions=False
)
all_scores.append(result.to_pandas())
except Exception as e:
print(f" ā ļø Failed item {i+1}: {e}")
# Don't sleep after the last item
if i < len(dataset) - 1:
print(" ā³ Sleeping 60s for rate limits...")
await asyncio.sleep(60)
import pandas as pd
combined_df = pd.concat(all_scores)
Yes, it's slow. For a 10-item dataset that's ~10 minutes. But it's free and it works. Run it overnight.
Reading the Results
Once you have scores, you need thresholds to know what's actually good. These are the ones I landed on for Concierge AI:
thresholds = {
"context_precision": 0.80,
"context_recall": 0.85,
"context_relevancy": 0.70,
"faithfulness": 0.90, # Highest, tax domain, high stakes
"answer_relevancy": 0.85
}
And here's the interpretation layer that turns raw floats into actionable output:
def interpret_scores(scores: dict) -> dict:
interpretation = {}
for metric, score in scores.items():
if metric == "overall_score":
continue
target = thresholds.get(metric, 0.80)
interpretation[metric] = {
"score": round(score, 3),
"target": target,
"status": "PASS" if score >= target else "FAIL",
"gap": round(score - target, 3)
}
return interpretation
The gap field is the most useful part. It tells you exactly how far off you are, not just pass/fail. A gap of -0.02 is very different from -0.25.
The console output ends up looking like this:
============================================================
RAGAS EVALUATION - Industry Standard RAG Metrics
============================================================
š RETRIEVAL QUALITY:
Context Precision: 0.847 ā
PASS (target: 0.80)
Context Recall: 0.821 ā FAIL (target: 0.85)
Context Relevance: 0.743 ā
PASS (target: 0.70)
š GENERATION QUALITY:
Faithfulness: 0.912 ā
PASS (target: 0.90)
Answer Relevancy: 0.876 ā
PASS (target: 0.85)
šÆ OVERALL: 4/5 metrics passed
Rating: ⨠STRONG (production-worthy)
============================================================
Context Recall failing told me something real: I was retrieving relevant documents but missing some needed context. That's what pushed me to build the chunk expansion strategy, but that's a story for the next blog.
What the Scores Actually Tell You
| Score fails | What it means | What to fix |
|---|---|---|
| Context Precision low | Retrieving noisy, irrelevant docs | Tighten your similarity threshold |
| Context Recall low | Missing needed information | Add more knowledge, expand chunks |
| Faithfulness low | LLM is hallucinating beyond context | Stricter system prompt, lower temperature |
| Answer Relevancy low | Answer drifts off-topic | Better query contextualization |
| Context Relevance low | Retrieved docs loosely match query | Tune BM25/vector weights, add reranking |
The Full Setup (5 Minutes)
pip install ragas datasets langchain-community litellm
Set your env vars:
RAGAS_EVALUATOR_MODEL=gemini/gemini-2.5-flash-lite-preview-09-2025
GEMINI_API_KEY=your_key_here # Free at aistudio.google.com
HF_TOKEN=your_token_here # Free at huggingface.co
Then point it at your pipeline output and run. The full evaluator is in evaluation/ragas_evaluator.py if you want to clone and adapt it directly.
Lessons Learned
- Run RAGAS before you optimize, not after. I almost spent a week tuning BM25 weights blindly. Running RAGAS first told me Context Recall was my actual problem, not precision. Completely changed where I focused.
- The LLM judge matters less than you think. Gemini Flash Lite agreed with my manual spot-checks ~90% of the time. You don't need GPT-4 to evaluate unless you're in a very nuanced domain.
- Faithfulness is your hallucination smoke alarm. If this one dips below 0.85, stop everything else and fix it. Nothing else matters if your answers aren't grounded.
Next up: "Building Production Hybrid Search: BM25 + pgvector in Supabase", where I go deep on why 0.5/0.5 weights, how to tune them, and what actually happened to Context Recall when I added chunk expansion.
ā Live Demo