Developer Guide
This guide provides the context developers need to continue evolving the Judge Reliability Harness (JRH).
Folder Structure Overview
inputs
configs/: custom config files that override the default parameters.data/{module_name}/: module-specific assets, typically includinginstruction.txt,rubric.txt, anddata.csv.
outputs
{module_name}_{time_stamp}/: run artifacts, including the copied config (same filename you passed tomain),{test_name}_results_{llm_model_name}.{csv|xlsx},{llm_model_name}_report.json, andsynthetic_{test_name}.{csv|xlsx}files (format controlled byadmin.output_file_format).
prompts
synthetic_generation_prompts/: includesbasic_perturbation_instructions.jsonplus agentic context files such asperturbation_context.txtandperturbation_rubric.txt.templates/judge/: aiautograder evaluation templates (also used during synthetic-data validation).templates/synthetic/: generation templates for the perturbation pipelines.
src
harness.py: main JRH program that executes the triage specified by the default config.utils.py: helper functions used bymain.py.agent_helpers/(downstream): support files for agentic mode.configs/(upstream): houses the default configuration.core/(upstream): helpers that interpret inputs according to the schemas.reliability_tests/(downstream): evaluation, metrics, and cost-curve logic.review_server/(downstream): human-in-the-loop (HITL) review server utilities.schemas/(upstream): Pydantic schemas used prior to enteringharness.py.synthetic_data_pipeline/(downstream): primary synthetic-generation pipeline.
tests
- Unit tests for the project.
Reference assets
make_cost_curves.md,Quickstart.ipynb,README.md, andwalkthrough_figs/: walkthrough materials for users.
Root entry point
main.py: launches the JRH using command-line arguments.
Detailed Notes
Information flow travels through upstream modules before reaching harness.py, then continues through downstream components. This separation minimizes confusion around inputs and outputs.
Note
core/, configs/, and schemas/ are treated as upstream dependencies, while directories such as synthetic_data_pipeline/, reliability_tests/, and review_server/ are downstream consumers of the processed data.
synthetic_data_pipeline contents
agent_perturbation.py: perturbations for agentic mode.base_pipeline.py: shared base class referenced by:basic_perturbation_pipeline.pysynthetic_ordinal_pipeline.py
data_registry.py: I/O helpers for saving or uploading generated data.registry.py(upstream): linkstest_namevalues to their configuration.review_server_manager.py: orchestrates the review server when invoked by JRH.stochastic_stability.py: perturbations for stochastic stability tests.synthetic_data_adapter.py: triage logic used byharness.pyto determine which test to run.