Evaluation Framework¶
Instructor Classify includes a comprehensive evaluation framework for testing and comparing model performance on classification tasks.
Overview¶
The evaluation framework provides:
- Performance metrics (accuracy, precision, recall, F1 score)
- Statistical analysis with bootstrap confidence intervals
- Error analysis with confusion matrices
- Cost and latency tracking
- Visualizations and detailed reports
Running Evaluations¶
You can run evaluations using the CLI:
Configuration¶
The evaluation configuration is defined in a YAML file:
# Models to evaluate, we will search all models
models:
- "gpt-3.5-turbo"
- "gpt-4o-mini"
# Evaluation datasets
# We can segment each one based on certain splits
eval_sets:
- "datasets/evalset_multi.yaml"
- "datasets/evalset_single.yaml"
# Analysis parameters
bootstrap_samples: 1000
confidence_level: 0.95
# Optional parameters
output_dir: "results" # Where to save results
verbose: true # Show detailed progress
Evaluation Datasets¶
Evaluation datasets are defined in YAML files:
name: "Custom Classification Evaluation Set"
description: "A set of examples for testing intent classification"
classification_type: "single" # or "multi"
examples:
- text: "How do I reset my password?"
expected_label: "account_question"
- text: "I need to update my billing information."
expected_label: "billing_request"
- text: "What time do you close today?"
expected_label: "general_question"
# Add more examples...
For multi-label classification, use expected_labels
instead:
- text: "I'm having trouble logging in and need to update my payment method."
expected_labels: ["account_question", "billing_request"]
Outputs¶
The evaluation framework generates a comprehensive set of outputs:
1. Summary Report¶
A text file with overall results:
# Evaluation Summary Report
Date: 2025-04-18 19:31:16
## Models Evaluated
- gpt-3.5-turbo
- gpt-4o-mini
## Datasets
- Complex Classification Evaluation Set
- Custom Classification Evaluation Set
## Overall Performance
| gpt-3.5-turbo | gpt-4o-mini |
--------------|--------------|--------------|
Accuracy | 0.8500 | 0.9250 |
Macro F1 | 0.8479 | 0.9268 |
Avg. Latency | 0.6521s | 0.8752s |
Cost (tokens) | 21,450 | 24,680 |
## Bootstrap Confidence Intervals (95%)
| gpt-3.5-turbo | gpt-4o-mini |
--------------|------------------|-------------------|
Accuracy | 0.8025 - 0.8975 | 0.8850 - 0.9650 |
2. Metrics Files¶
Detailed JSON files for each model and dataset:
{
"accuracy": 0.925,
"macro_precision": 0.9325,
"macro_recall": 0.9231,
"macro_f1": 0.9268,
"per_label_metrics": {
"account_question": {
"precision": 0.9545,
"recall": 0.9545,
"f1": 0.9545
},
"billing_request": {
"precision": 0.9333,
"recall": 0.9333,
"f1": 0.9333
},
"general_question": {
"precision": 0.9091,
"recall": 0.8824,
"f1": 0.8955
}
}
}
3. Visualizations¶
The framework generates various visualizations:
- Confusion matrices for each model/dataset
- Error distribution charts
- Bootstrap confidence interval visualizations
- Cost and latency comparisons
4. Cost and Latency Analysis¶
JSON files with detailed cost and latency data:
{
"models": {
"gpt-3.5-turbo": {
"total_tokens": 21450,
"estimated_cost_usd": 0.0429,
"avg_tokens_per_prediction": 214.5,
"avg_latency_seconds": 0.6521
},
"gpt-4o-mini": {
"total_tokens": 24680,
"estimated_cost_usd": 0.0494,
"avg_tokens_per_prediction": 246.8,
"avg_latency_seconds": 0.8752
}
}
}
Advanced Analysis¶
Bootstrap Analysis¶
The evaluation framework uses bootstrap resampling to estimate confidence intervals:
{
"bootstrap_samples": 1000,
"confidence_level": 0.95,
"metrics": {
"gpt-3.5-turbo": {
"Complex Classification Evaluation Set": {
"accuracy": {
"mean": 0.85,
"lower_bound": 0.8025,
"upper_bound": 0.8975
}
}
}
}
}
Confusion Matrix Analysis¶
Detailed confusion matrices help identify specific error patterns:
{
"gpt-3.5-turbo": {
"Custom Classification Evaluation Set": {
"account_question": {
"account_question": 21,
"billing_request": 1,
"general_question": 0
},
"billing_request": {
"account_question": 1,
"billing_request": 14,
"general_question": 0
},
"general_question": {
"account_question": 0,
"billing_request": 1,
"general_question": 16
}
}
}
}
Programmatic Access to Results¶
You can access evaluation results programmatically:
from instructor_classify.eval_harness.unified_eval import UnifiedEvaluator
import json
# Run evaluation
evaluator = UnifiedEvaluator("configs/example.yaml")
evaluator.prepare()
results = evaluator.run()
# Access results
for model_name, model_results in results.items():
print(f"Model: {model_name}")
for eval_set_name, metrics in model_results.items():
print(f" Dataset: {eval_set_name}")
print(f" Accuracy: {metrics['accuracy']:.4f}")
# Check if this model is better than others with statistical significance
bootstrap_data = evaluator.bootstrap_results["metrics"][model_name][eval_set_name]["accuracy"]
print(f" 95% CI: [{bootstrap_data['lower_bound']:.4f}, {bootstrap_data['upper_bound']:.4f}]")
Custom Evaluation Metrics¶
You can extend the evaluation framework with custom metrics:
from instructor_classify.eval_harness.unified_eval import UnifiedEvaluator
from sklearn.metrics import matthews_corrcoef
class CustomEvaluator(UnifiedEvaluator):
def calculate_metrics(self, true_labels, pred_labels):
# Get standard metrics
metrics = super().calculate_metrics(true_labels, pred_labels)
# Add Matthews Correlation Coefficient
mcc = matthews_corrcoef(true_labels, pred_labels)
metrics["matthews_corrcoef"] = mcc
return metrics
# Use custom evaluator
evaluator = CustomEvaluator("configs/example.yaml")
evaluator.prepare()
results = evaluator.run()