Troubleshooting RAG Evals
This guide helps you troubleshoot common issues when using RAG Evals.
Common Issues
LLM API Errors
Problem: Errors when connecting to the LLM provider.
Possible Solutions: - Verify API key is valid and correctly set in your environment - Check API rate limits and quotas - Ensure you have the correct Instructor version for your provider - Verify network connectivity to the LLM provider
# Example of proper client initialization
import instructor
import os
# Ensure API key is set
assert os.environ.get("OPENAI_API_KEY"), "API key not found in environment"
# Initialize client with error handling
try:
client = instructor.from_provider("openai/gpt-4o-mini")
except Exception as e:
print(f"Failed to initialize client: {e}")
# Handle the error appropriately
Context Window Limitations
Problem: Evaluation fails due to exceeding the model's context window.
Possible Solutions: - Reduce the size of context chunks - Use a model with a larger context window - Limit the amount of context provided per evaluation - Break large evaluations into smaller batches
# Example: Limiting context size
def limit_context_size(context, max_chunks=10, max_chunk_size=500):
"""Limit context to prevent exceeding context window limits"""
# Limit number of chunks
limited_context = context[:max_chunks]
# Limit size of each chunk
limited_context = [chunk[:max_chunk_size] for chunk in limited_context]
return limited_context
# Use limited context in evaluation
faithfulness_result = Faithfulness.grade(
question=question,
answer=answer,
context=limit_context_size(context),
client=client
)
Validation Errors
Problem: Errors related to the validation of response models.
Possible Solutions: - Check if your prompt aligns with the expected response model - Verify that chunk IDs are correctly referenced - Ensure the LLM is producing output that matches the response model schema - Add more explicit instructions in your prompt about the required output format
# Example: Adding explicit format guidance to prompt
from rag_evals import base
from rag_evals.score_faithfulness import FaithfulnessResult
# Create custom evaluator with explicit format instructions
ExplicitFormatFaithfulness = base.ContextEvaluation(
prompt="""
You are an expert evaluator assessing faithfulness.
IMPORTANT: Your output MUST follow this exact JSON structure:
{
"statements": [
{
"statement": "The exact claim from the answer",
"is_supported": true or false,
"supporting_chunk_ids": [list of integer IDs or null]
},
...
]
}
[Rest of prompt instructions...]
""",
response_model=FaithfulnessResult
)
Incorrect Chunk IDs
Problem: The evaluation references chunk IDs that don't exist in the context.
Possible Solutions: - Ensure chunk IDs start from 0 and are sequential - Verify that the LLM understands the chunk ID format - Check for chunk ID consistency in your prompts and examples - Make the consequences of invalid chunk IDs explicit in your prompt
# Example: Validating context chunks before evaluation
def validate_context(context):
"""Ensure context is properly formatted with sequential IDs"""
if not context:
raise ValueError("Context cannot be empty")
# Ensure context is a list
if not isinstance(context, list):
raise TypeError("Context must be a list of strings")
# Check that all context items are strings
for i, chunk in enumerate(context):
if not isinstance(chunk, str):
raise TypeError(f"Context chunk {i} is not a string")
return context
# Use validated context
faithfulness_result = Faithfulness.grade(
question=question,
answer=answer,
context=validate_context(context),
client=client
)
Inconsistent Evaluation Results
Problem: Evaluations produce inconsistent or unexpected results.
Possible Solutions: - Use a more capable LLM for evaluation - Provide explicit scoring criteria in the prompt - Add few-shot examples to guide the evaluation - Run multiple evaluations and average the results - Review your prompt for clarity and potential ambiguities
# Example: Running multiple evaluations for consistency
def evaluate_with_redundancy(question, answer, context, client, evaluator, n=3):
"""Run multiple evaluations and aggregate results for more consistency"""
results = []
for _ in range(n):
result = evaluator.grade(
question=question,
answer=answer,
context=context,
client=client
)
results.append(result)
# For faithfulness evaluations, average overall scores
if hasattr(results[0], 'overall_faithfulness_score'):
avg_score = sum(r.overall_faithfulness_score for r in results) / len(results)
print(f"Average Faithfulness Score: {avg_score:.2f}")
# For precision evaluations, average across chunk scores
elif hasattr(results[0], 'avg_score'):
avg_score = sum(r.avg_score for r in results) / len(results)
print(f"Average Precision Score: {avg_score:.2f}")
return results
Performance Issues
Problem: Evaluations are too slow.
Possible Solutions:
- Use parallel processing with agrade
and asyncio
- Batch evaluations when possible
- Use a faster LLM for evaluations that don't require high capability
- Optimize context size to reduce token count
- Consider using client-side caching for repeated evaluations
# Example: Parallel evaluation of multiple metrics
import asyncio
from instructor import AsyncInstructor
from rag_evals.score_faithfulness import Faithfulness
from rag_evals.score_precision import ChunkPrecision
async_client = AsyncInstructor(provider="openai/gpt-4o-mini")
async def evaluate_example(question, answer, context):
"""Run all evaluations for a single example in parallel"""
faithfulness_task = Faithfulness.agrade(
question=question,
answer=answer,
context=context,
client=async_client
)
precision_task = ChunkPrecision.agrade(
question=question,
answer=answer,
context=context,
client=async_client
)
# Run both evaluations in parallel
return await asyncio.gather(faithfulness_task, precision_task)
# Use with asyncio.run(evaluate_example(...))
Model-Specific Issues
GPT Models
- Ensure you're using the correct model name format (e.g., "openai/gpt-4o-mini")
- Monitor token usage to avoid unexpected costs
- Be aware of rate limits, especially with parallel evaluations
Claude Models
- Properly format the system and user messages for Claude
- Be aware of Claude's handling of structured output
- Adjust prompts to accommodate Claude's reasoning style
Local Models
- Ensure the model supports structured JSON output
- Be prepared for less consistent results with smaller models
- Consider using more explicit prompts for local models
Debugging Tips
- Inspect Raw API Responses: Look at the raw API responses to understand what the LLM is returning
- Log Intermediate Steps: Add logging to track the evaluation process
- Test With Simple Examples: Verify functionality with simple, known examples first
- Compare With Manual Evaluation: Periodically validate results against human judgments
- Check Context Processing: Verify that context is being correctly processed and enumerated
If you continue to experience issues, please check the project repository for known issues or submit a new issue with details about your problem.