-
Notifications
You must be signed in to change notification settings - Fork 85
Open
Description
Documentation page:
https://docs.judgmentlabs.ai/documentation/evaluation/scorers/introduction
Context:
The documentation provides a conceptual overview of each scorer and what it evaluates (e.g., execution order, tool usage, faithfulness). However, it does not explain how to configure these scorers or tune them for specific applications. Developers are left guessing:
- What parameters can be adjusted?
- What are the default thresholds for cosine similarity, ranking accuracy, etc.?
- What score range is considered "good" or "acceptable"?
- How do you interpret or act on a 0.6 versus 0.9 answer correctness score?
Suggestion:
Add a "Configuration & Interpretation" section to each scorer page with:
- Tunable Parameters
- E.g., similarity_threshold, tool_penalty_weight, faithfulness_sensitivity
- Show how to initialize with custom parameters
- Defaults & Ranges
- List default values used internally
- Provide typical value ranges for real-world datasets
- Interpreting Scores
- Explain what score bands mean: "A tool-order score of 1.0 means tools were called in ideal order, while <0.5 may signal planning drift or tool misuse."
- Best Practices
- For example: "Start with similarity_threshold = 0.75 for answer correctness, then calibrate based on false positive rate."
Why It Matters:
This improvement would help users:
- Tailor scorers to domain-specific agents
- Make informed decisions based on scores rather than treating them as black boxes
- Increase transparency and trust in automated evaluation pipelines
Metadata
Metadata
Assignees
Labels
No labels