Troubleshooting RAG Evals
This guide helps you troubleshoot common issues when using RAG Evals.
Common Issues
LLM API Errors
Problem: Errors when connecting to the LLM provider.
Possible Solutions: - Verify API key is valid and correctly set in your environment - Check API rate limits and quotas - Ensure you have the correct Instructor version for your provider - Verify network connectivity to the LLM provider
# Example of proper client initialization
import instructor
import os
# Ensure API key is set
assert os.environ.get("OPENAI_API_KEY"), "API key not found in environment"
# Initialize client with error handling
try:
    client = instructor.from_provider("openai/gpt-4o-mini")
except Exception as e:
    print(f"Failed to initialize client: {e}")
    # Handle the error appropriately
Context Window Limitations
Problem: Evaluation fails due to exceeding the model's context window.
Possible Solutions: - Reduce the size of context chunks - Use a model with a larger context window - Limit the amount of context provided per evaluation - Break large evaluations into smaller batches
# Example: Limiting context size
def limit_context_size(context, max_chunks=10, max_chunk_size=500):
    """Limit context to prevent exceeding context window limits"""
    # Limit number of chunks
    limited_context = context[:max_chunks]
    # Limit size of each chunk
    limited_context = [chunk[:max_chunk_size] for chunk in limited_context]
    return limited_context
# Use limited context in evaluation
faithfulness_result = Faithfulness.grade(
    question=question,
    answer=answer,
    context=limit_context_size(context),
    client=client
)
Validation Errors
Problem: Errors related to the validation of response models.
Possible Solutions: - Check if your prompt aligns with the expected response model - Verify that chunk IDs are correctly referenced - Ensure the LLM is producing output that matches the response model schema - Add more explicit instructions in your prompt about the required output format
# Example: Adding explicit format guidance to prompt
from rag_evals import base
from rag_evals.score_faithfulness import FaithfulnessResult
# Create custom evaluator with explicit format instructions
ExplicitFormatFaithfulness = base.ContextEvaluation(
    prompt="""
    You are an expert evaluator assessing faithfulness.
    IMPORTANT: Your output MUST follow this exact JSON structure:
    {
      "statements": [
        {
          "statement": "The exact claim from the answer",
          "is_supported": true or false,
          "supporting_chunk_ids": [list of integer IDs or null]
        },
        ...
      ]
    }
    [Rest of prompt instructions...]
    """,
    response_model=FaithfulnessResult
)
Incorrect Chunk IDs
Problem: The evaluation references chunk IDs that don't exist in the context.
Possible Solutions: - Ensure chunk IDs start from 0 and are sequential - Verify that the LLM understands the chunk ID format - Check for chunk ID consistency in your prompts and examples - Make the consequences of invalid chunk IDs explicit in your prompt
# Example: Validating context chunks before evaluation
def validate_context(context):
    """Ensure context is properly formatted with sequential IDs"""
    if not context:
        raise ValueError("Context cannot be empty")
    # Ensure context is a list
    if not isinstance(context, list):
        raise TypeError("Context must be a list of strings")
    # Check that all context items are strings
    for i, chunk in enumerate(context):
        if not isinstance(chunk, str):
            raise TypeError(f"Context chunk {i} is not a string")
    return context
# Use validated context
faithfulness_result = Faithfulness.grade(
    question=question,
    answer=answer,
    context=validate_context(context),
    client=client
)
Inconsistent Evaluation Results
Problem: Evaluations produce inconsistent or unexpected results.
Possible Solutions: - Use a more capable LLM for evaluation - Provide explicit scoring criteria in the prompt - Add few-shot examples to guide the evaluation - Run multiple evaluations and average the results - Review your prompt for clarity and potential ambiguities
# Example: Running multiple evaluations for consistency
def evaluate_with_redundancy(question, answer, context, client, evaluator, n=3):
    """Run multiple evaluations and aggregate results for more consistency"""
    results = []
    for _ in range(n):
        result = evaluator.grade(
            question=question,
            answer=answer,
            context=context,
            client=client
        )
        results.append(result)
    # For faithfulness evaluations, average overall scores
    if hasattr(results[0], 'overall_faithfulness_score'):
        avg_score = sum(r.overall_faithfulness_score for r in results) / len(results)
        print(f"Average Faithfulness Score: {avg_score:.2f}")
    # For precision evaluations, average across chunk scores
    elif hasattr(results[0], 'avg_score'):
        avg_score = sum(r.avg_score for r in results) / len(results)
        print(f"Average Precision Score: {avg_score:.2f}")
    return results
Performance Issues
Problem: Evaluations are too slow.
Possible Solutions:
- Use parallel processing with agrade and asyncio
- Batch evaluations when possible
- Use a faster LLM for evaluations that don't require high capability
- Optimize context size to reduce token count
- Consider using client-side caching for repeated evaluations
# Example: Parallel evaluation of multiple metrics
import asyncio
from instructor import AsyncInstructor
from rag_evals.score_faithfulness import Faithfulness
from rag_evals.score_precision import ChunkPrecision
async_client = AsyncInstructor(provider="openai/gpt-4o-mini")
async def evaluate_example(question, answer, context):
    """Run all evaluations for a single example in parallel"""
    faithfulness_task = Faithfulness.agrade(
        question=question,
        answer=answer,
        context=context,
        client=async_client
    )
    precision_task = ChunkPrecision.agrade(
        question=question,
        answer=answer,
        context=context,
        client=async_client
    )
    # Run both evaluations in parallel
    return await asyncio.gather(faithfulness_task, precision_task)
# Use with asyncio.run(evaluate_example(...))
Model-Specific Issues
GPT Models
- Ensure you're using the correct model name format (e.g., "openai/gpt-4o-mini")
- Monitor token usage to avoid unexpected costs
- Be aware of rate limits, especially with parallel evaluations
Claude Models
- Properly format the system and user messages for Claude
- Be aware of Claude's handling of structured output
- Adjust prompts to accommodate Claude's reasoning style
Local Models
- Ensure the model supports structured JSON output
- Be prepared for less consistent results with smaller models
- Consider using more explicit prompts for local models
Debugging Tips
- Inspect Raw API Responses: Look at the raw API responses to understand what the LLM is returning
- Log Intermediate Steps: Add logging to track the evaluation process
- Test With Simple Examples: Verify functionality with simple, known examples first
- Compare With Manual Evaluation: Periodically validate results against human judgments
- Check Context Processing: Verify that context is being correctly processed and enumerated
If you continue to experience issues, please check the project repository for known issues or submit a new issue with details about your problem.