Troubleshooting RAG Evals

This guide helps you troubleshoot common issues when using RAG Evals.

Common Issues

LLM API Errors

Problem: Errors when connecting to the LLM provider.

Possible Solutions: - Verify API key is valid and correctly set in your environment - Check API rate limits and quotas - Ensure you have the correct Instructor version for your provider - Verify network connectivity to the LLM provider

# Example of proper client initialization
import instructor
import os

# Ensure API key is set
assert os.environ.get("OPENAI_API_KEY"), "API key not found in environment"

# Initialize client with error handling
try:
    client = instructor.from_provider("openai/gpt-4o-mini")
except Exception as e:
    print(f"Failed to initialize client: {e}")
    # Handle the error appropriately

Context Window Limitations

Problem: Evaluation fails due to exceeding the model's context window.

Possible Solutions: - Reduce the size of context chunks - Use a model with a larger context window - Limit the amount of context provided per evaluation - Break large evaluations into smaller batches

# Example: Limiting context size
def limit_context_size(context, max_chunks=10, max_chunk_size=500):
    """Limit context to prevent exceeding context window limits"""
    # Limit number of chunks
    limited_context = context[:max_chunks]

    # Limit size of each chunk
    limited_context = [chunk[:max_chunk_size] for chunk in limited_context]

    return limited_context

# Use limited context in evaluation
faithfulness_result = Faithfulness.grade(
    question=question,
    answer=answer,
    context=limit_context_size(context),
    client=client
)

Validation Errors

Problem: Errors related to the validation of response models.

Possible Solutions: - Check if your prompt aligns with the expected response model - Verify that chunk IDs are correctly referenced - Ensure the LLM is producing output that matches the response model schema - Add more explicit instructions in your prompt about the required output format

# Example: Adding explicit format guidance to prompt
from rag_evals import base
from rag_evals.score_faithfulness import FaithfulnessResult

# Create custom evaluator with explicit format instructions
ExplicitFormatFaithfulness = base.ContextEvaluation(
    prompt="""
    You are an expert evaluator assessing faithfulness.

    IMPORTANT: Your output MUST follow this exact JSON structure:
    {
      "statements": [
        {
          "statement": "The exact claim from the answer",
          "is_supported": true or false,
          "supporting_chunk_ids": [list of integer IDs or null]
        },
        ...
      ]
    }

    [Rest of prompt instructions...]
    """,
    response_model=FaithfulnessResult
)

Incorrect Chunk IDs

Problem: The evaluation references chunk IDs that don't exist in the context.

Possible Solutions: - Ensure chunk IDs start from 0 and are sequential - Verify that the LLM understands the chunk ID format - Check for chunk ID consistency in your prompts and examples - Make the consequences of invalid chunk IDs explicit in your prompt

# Example: Validating context chunks before evaluation
def validate_context(context):
    """Ensure context is properly formatted with sequential IDs"""
    if not context:
        raise ValueError("Context cannot be empty")

    # Ensure context is a list
    if not isinstance(context, list):
        raise TypeError("Context must be a list of strings")

    # Check that all context items are strings
    for i, chunk in enumerate(context):
        if not isinstance(chunk, str):
            raise TypeError(f"Context chunk {i} is not a string")

    return context

# Use validated context
faithfulness_result = Faithfulness.grade(
    question=question,
    answer=answer,
    context=validate_context(context),
    client=client
)

Inconsistent Evaluation Results

Problem: Evaluations produce inconsistent or unexpected results.

Possible Solutions: - Use a more capable LLM for evaluation - Provide explicit scoring criteria in the prompt - Add few-shot examples to guide the evaluation - Run multiple evaluations and average the results - Review your prompt for clarity and potential ambiguities

# Example: Running multiple evaluations for consistency
def evaluate_with_redundancy(question, answer, context, client, evaluator, n=3):
    """Run multiple evaluations and aggregate results for more consistency"""
    results = []

    for _ in range(n):
        result = evaluator.grade(
            question=question,
            answer=answer,
            context=context,
            client=client
        )
        results.append(result)

    # For faithfulness evaluations, average overall scores
    if hasattr(results[0], 'overall_faithfulness_score'):
        avg_score = sum(r.overall_faithfulness_score for r in results) / len(results)
        print(f"Average Faithfulness Score: {avg_score:.2f}")

    # For precision evaluations, average across chunk scores
    elif hasattr(results[0], 'avg_score'):
        avg_score = sum(r.avg_score for r in results) / len(results)
        print(f"Average Precision Score: {avg_score:.2f}")

    return results

Performance Issues

Problem: Evaluations are too slow.

Possible Solutions: - Use parallel processing with agrade and asyncio - Batch evaluations when possible - Use a faster LLM for evaluations that don't require high capability - Optimize context size to reduce token count - Consider using client-side caching for repeated evaluations

# Example: Parallel evaluation of multiple metrics
import asyncio
from instructor import AsyncInstructor
from rag_evals.score_faithfulness import Faithfulness
from rag_evals.score_precision import ChunkPrecision

async_client = AsyncInstructor(provider="openai/gpt-4o-mini")

async def evaluate_example(question, answer, context):
    """Run all evaluations for a single example in parallel"""
    faithfulness_task = Faithfulness.agrade(
        question=question,
        answer=answer,
        context=context,
        client=async_client
    )

    precision_task = ChunkPrecision.agrade(
        question=question,
        answer=answer,
        context=context,
        client=async_client
    )

    # Run both evaluations in parallel
    return await asyncio.gather(faithfulness_task, precision_task)

# Use with asyncio.run(evaluate_example(...))

Model-Specific Issues

GPT Models

Ensure you're using the correct model name format (e.g., "openai/gpt-4o-mini")
Monitor token usage to avoid unexpected costs
Be aware of rate limits, especially with parallel evaluations

Claude Models

Properly format the system and user messages for Claude
Be aware of Claude's handling of structured output
Adjust prompts to accommodate Claude's reasoning style

Local Models

Ensure the model supports structured JSON output
Be prepared for less consistent results with smaller models
Consider using more explicit prompts for local models

Debugging Tips

Inspect Raw API Responses: Look at the raw API responses to understand what the LLM is returning
Log Intermediate Steps: Add logging to track the evaluation process
Test With Simple Examples: Verify functionality with simple, known examples first
Compare With Manual Evaluation: Periodically validate results against human judgments
Check Context Processing: Verify that context is being correctly processed and enumerated

If you continue to experience issues, please check the project repository for known issues or submit a new issue with details about your problem.