Context Precision Evaluation

Context Precision (also known as Chunk Relevancy) measures whether each retrieved context chunk is relevant to the original question. This metric helps assess the efficiency of your retrieval system.

What Context Precision Measures

Definition: Context Precision evaluates for each individual retrieved chunk of context, how relevant its content is to the user's original question, regardless of whether that chunk was actually used in the final generated answer.
Focus: The relationship is (Individual Retrieved Chunk → User's Question).
Purpose: To determine if the retriever is fetching chunks that are relevant to the user's query. If many chunks are retrieved but aren't relevant to the question, the retrieval might be inefficient.

How It Works

The Context Precision evaluator:

Examines each context chunk independently
Determines if the information in the chunk is relevant to answering the original question
Assigns a binary score (relevant/not relevant) to each chunk
Calculates an overall precision score based on the proportion of relevant chunks

Implementation Details

The context precision implementation uses a straightforward approach:

from rag_evals.score_precision import ChunkPrecision

precision_result = ChunkPrecision.grade(
    question=question,
    answer=answer,
    context=context,
    client=client
)

Response Model

The ChunkPrecision evaluator uses the base ChunkGradedBinary class:

class ChunkBinaryScore(BaseModel):
    id_chunk: int  # ID of the chunk being evaluated
    score: bool  # Whether the chunk is relevant (True) or not (False)

class ChunkGradedBinary(BaseModel, ContextValidationMixin):
    graded_chunks: list[ChunkBinaryScore]  # All evaluated chunks

    @property 
    def avg_score(self) -> float:
        # Calculates the proportion of relevant chunks

Example Output

# Result example
precision_result = ChunkGradedBinary(
    graded_chunks=[
        ChunkBinaryScore(id_chunk=0, score=True),   # Chunk is relevant to the question
        ChunkBinaryScore(id_chunk=1, score=True),   # Chunk is relevant to the question
        ChunkBinaryScore(id_chunk=2, score=False),  # Chunk is not relevant to the question
    ]
)

# Overall score: 0.6667 (2/3 chunks were relevant)

Customizing the Evaluation

You can customize the context precision prompt to adjust the criteria for what makes a chunk "relevant":

from rag_evals.score_precision import ChunkPrecision
from rag_evals import base

# Access the original prompt
original_prompt = ChunkPrecision.prompt

# Create a customized evaluator with a modified prompt
CustomPrecision = base.ContextEvaluation(
    prompt="Your custom prompt here...",
    response_model=base.ChunkGradedBinary
)

Context Precision vs. Chunk Utility

It's important to understand the difference between:

Context Precision: Measures if a chunk is relevant to the question (regardless of whether it was used in the answer)
Chunk Utility: Measures if a chunk was actually used in generating the answer

A chunk might be highly relevant to the question but not used in the answer, or it might be used in the answer despite having low relevance to the question.

Considerations When Using Context Precision

Partial Relevance: Consider how to score chunks that are only partially relevant to the question
Topical vs. Factual Relevance: A chunk might be topically relevant but not contain the specific facts needed
Question Decomposition: For complex questions with multiple parts, chunks may be relevant to only some parts

Best Practices

Use context precision alongside other metrics like faithfulness for a complete evaluation
Analyze chunks marked as "not relevant" to improve your retrieval system
Consider both precision and recall metrics for a comprehensive view of retrieval performance
Try different chunk sizes to find the optimal granularity for your use case

For implementation examples, see the usage examples.