Judge Reliability Harness Overview

Project Overview

The Judge Reliability Harness (JRH) orchestrates end-to-end evaluations of automated judges. It standardizes how datasets are prepared, perturbations are generated, and model outputs are rescored so teams can compare judge reliability across tasks. The harness coordinates data ingestion, perturbation pipelines, automated grading, and reporting into reproducible runs.

Why JRH?

LLMs are increasingly used as judges or to score, rank, or classify AI outputs in AI evaluations. Human evaluation yields high quality judgments but is expensive and difficult to scale, which has motivated the widespread use of LLMs as judges in place of human annotators However, the reliability of judge system comfiguration, including the LLM judge model, rubric, and prompt templates, are rarely evaluated and measured in a systematic manner or reported alongside benchmark evaluation results. Point estimates of agreement with human raters on small validation sets provide limited assurance about how a judge will respond to realistic variations in inputs, such as changes in formatting, paraphrasing, verbosity, or sampling parameters. This gap between the central role of LLM judges and the limited tools available to characterize their reliability makes it difficult for practitioners and decision makers to understand how much confidence to place in AI evaluation results.

We introduce the Judge Reliability Harness (JRH), an open source library that generates validation suites for any LLM judge on both agentic and free-response benchmarks. JRH generates reliability tests that measure grading accuracy via label flipped responses, invariance to formatting and paraphrasing, susceptibility to verbosity bias, stochastic stability under repeated sampling, and calibration across an ordinal grading scale. JRH features a human-in-the-loop review process for generated reliability tests through a user interface that gives full control to accept, reject, or edit the tests. Across a range of candidate judges, it aggregates pass rates, confidence intervals, and cost curves into standardized reports. By making reliability testing configurable, reproducible, and inexpensive, JRH aims to support a more transparent and trustworthy use of LLM judges in both research and deployment contexts.

Core Workflow

Load configuration describing the evaluation module, input assets, and runtime parameters.
Build perturbation and evaluation registries that determine which reliability tests execute.
Generate synthetic variants when required and persist the merged configuration for auditing.
Run harness execution to score perturbations, summarize performance, and emit artifacts inside outputs/.

Available Modes

Open-ended Judge: Binary grading for free-text responses.
Open-ended Autograder: Ordinal grading for free-text responses.
Agentic Judge: Agent transcript grading for binary criteria.
Agentic Autograder: Agent transcript grading for ordinal criteria.

Documentation Map

installation.qmd covers prerequisites, uv setup, and required credentials.
quickstart.qmd shows common run commands, how to toggle the review UI, and where outputs land.
user_guide.qmd documents configuration fields, review UI behavior, outputs, and rerun expectations.
reliability_tests.qmd details each synthetic reliability test and when to use it.
agentic_guide.qmd explains the Inspect-log-driven agent perturbation workflow.
developer_guide.qmd outlines the project layout for contributors.

Acknowledgments

The material developed during this research was sponsored by the Government of the United States under Contract Number FA8702-15-D-0002. The view, opinions, and/or filings contained in this material are those of the author(s) and should not be construed as an official position, policy, or decision of the Government of the United States or Carnegie Mellon University or the Software Engineering Institute unless designated by other documentation.