Developer Guide

This guide provides the context developers need to continue evolving the Judge Reliability Harness (JRH).

Folder Structure Overview

inputs

configs/: custom config files that override the default parameters.
data/{module_name}/: module-specific assets, typically including instruction.txt, rubric.txt, and data.csv.

outputs

{module_name}_{time_stamp}/: run artifacts, including the copied config (same filename you passed to main), {test_name}_results_{llm_model_name}.{csv|xlsx}, {llm_model_name}_report.json, and synthetic_{test_name}.{csv|xlsx} files (format controlled by admin.output_file_format).

prompts

synthetic_generation_prompts/: includes basic_perturbation_instructions.json plus agentic context files such as perturbation_context.txt and perturbation_rubric.txt.
templates/judge/: aiautograder evaluation templates (also used during synthetic-data validation).
templates/synthetic/: generation templates for the perturbation pipelines.

src

harness.py: main JRH program that executes the triage specified by the default config.
utils.py: helper functions used by main.py.
agent_helpers/ (downstream): support files for agentic mode.
configs/ (upstream): houses the default configuration.
core/ (upstream): helpers that interpret inputs according to the schemas.
reliability_tests/ (downstream): evaluation, metrics, and cost-curve logic.
review_server/ (downstream): human-in-the-loop (HITL) review server utilities.
schemas/ (upstream): Pydantic schemas used prior to entering harness.py.
synthetic_data_pipeline/ (downstream): primary synthetic-generation pipeline.

tests

Unit tests for the project.

Reference assets

make_cost_curves.md, Quickstart.ipynb, README.md, and walkthrough_figs/: walkthrough materials for users.

Root entry point

main.py: launches the JRH using command-line arguments.

Detailed Notes

Information flow travels through upstream modules before reaching harness.py, then continues through downstream components. This separation minimizes confusion around inputs and outputs.

Note

core/, configs/, and schemas/ are treated as upstream dependencies, while directories such as synthetic_data_pipeline/, reliability_tests/, and review_server/ are downstream consumers of the processed data.

synthetic_data_pipeline contents

agent_perturbation.py: perturbations for agentic mode.
base_pipeline.py: shared base class referenced by:
- basic_perturbation_pipeline.py
- synthetic_ordinal_pipeline.py
data_registry.py: I/O helpers for saving or uploading generated data.
registry.py (upstream): links test_name values to their configuration.
review_server_manager.py: orchestrates the review server when invoked by JRH.
stochastic_stability.py: perturbations for stochastic stability tests.
synthetic_data_adapter.py: triage logic used by harness.py to determine which test to run.