User Guide
Before You Run
- Place inputs under
inputs/data/{module_name}/and pointadmin.dataset_config.dataset_nameat the CSV you want to use. Instruction/rubric text files referenced indefault_params_pathshould live in the same folder. - Ensure your dataset can be mapped to the internal
request/response/expectedschema viaadmin.perturbation_config.preprocess_columns_map. If anexpectedcolumn is missing, setuse_original_data_as_expected: Trueto let the autograder populate it during preprocessing. - Outputs default to CSV. Set
admin.output_file_format: "xlsx"when you need Excel artifacts; JRH does this automatically whenmodule_namestarts withstratus. - Add API keys (for example
OPENAI_API_KEY) to.env. Toggleadmin.test_debug_mode: Truewhen you need to dry-run without LLM calls. admin.time_stampcontrols the output folder name (outputs/{module_name}_{time_stamp}). Reuse a timestamp to append to an existing run; leave itnullto create a new one.admin.perturbation_config.use_HITL_processenables the review UI athttp://127.0.0.1:8765during generation. Set it toFalsefor unattended runs.- Agentic runs require an Inspect
.evalarchive and a rubric JSON; see the Agentic Mode Guide for the required shape.
How the Harness Executes
- Load and merge the provided YAML with defaults from
src/configs/default_config.yml, then create an output directory. - Preprocess the dataset: rename columns, optionally grade the original data to generate an
expectedcolumn, and write a_preprocessed.{csv|xlsx}copy if needed (format followsadmin.output_file_format). - Generate perturbations for each
tests_to_runentry (launching the review UI when HITL is enabled) and persist them tosynthetic_{test}.{csv|xlsx}. - Evaluate each
tests_to_evaluateitem (or alltests_to_runwhen left empty), skipping already-scored rows. Results are written only whenevaluation_config.overwrite_resultsistrue; otherwise existing result files are reused without saving new scores. - Compute metrics and emit
{model}_report.json; optional cost curves render tocost_curve_heatmap.png.
Default Config Overview
The default config, located at ./src/configs/default_config.yml, has four high-level parameters and three low-level parameter blocks:
admin:
dataset_config:
perturbation_config:
evaluation_config:
test_stochastic_stability_config:
synthetic_data_params:
test_agent_perturbation_config:The admin parameter governs the workflow, selecting datasets, perturbation modes, and evaluation logic. The remaining blocks provide specialized configuration for their respective test modes.
admin
module_name: project name mapped to./inputs/data/{module_name}/, which also determines the output folder prefix.time_stamp: optional pointer to a prior output folder; when omitted, a new folder is created (named{module_name}_{timestamp}).test_debug_mode: replaces LLM calls with fixed defaults to conserve API tokens.output_file_format: extension for saved artifacts (csvby default; automatically set toxlsxwhenmodule_namestarts withstratus).dataset_config:dataset_name: source dataset resolved to./inputs/data/{module_name}/{dataset_name}(CSV or Excel).default_params_path: text files loaded from./inputs/data/{module_name}/...; contents are merged intodefault_params(warnings emit when files are missing).use_original_data_as_expected: treats the original dataset as the gold standard during evaluation.default_params:min_score: minimum rubric score forsynthetic_ordinalandsingle_autogradermodes.max_score: maximum rubric score for those modes.
perturbation_config:use_HITL_process: enables human-in-the-loop review during perturbation generation (createsreview/under the output directory).tests_to_run: list of perturbation tests to execute (supports basic perturbations,synthetic_ordinal,stochastic_stability,agent_perturbation, and optionalagent_positives).preprocess_columns_map: maps dataset columns to the logical request/response/expected schema.
evaluation_config:template: evaluation template (e.g.,single_judge).autograder_model_name: LLM responsible for scoring.overwrite_results: whentrue, writes{test}_results_{model}.{csv|xlsx}; whenfalse, reuses existing result files without saving new scores.max_workers: degree of parallelism during evaluation.tests_to_evaluate: list of perturbations to score; falls back totests_to_runwhen empty.metric: scikit-learn metric applied to predictions.bootstrap_size: dataset fraction used for bootstrap resampling.bootstrap_repetitions: number of bootstrap iterations.get_cost_curves: toggles cost-performance visualizations.
test_stochastic_stability_config
sample_num_from_orig: samples drawn from the original dataset per stability test.number_of_seeds: random seeds for stability analysis.repetitions: trial count per seed.seed: base random seed.
synthetic_data_params
generation_model_name: model used to generate synthetic data.validation_model_name: model used to validate generated samples.max_tokens_generation: token cap for generation prompts.max_tokens_validation: token cap for validation prompts.max_workers: parallel workers for generation.use_similarity_filter: drops samples that are too similar to originals.sample_num_from_orig: original examples sampled as seeds.target_num_per_bucket: desired synthetic samples per bucket or class.similarity_threshold: cosine similarity threshold for filtering.rescore_sample_size: candidate generations to rescore for quality.initial_temp: starting sampling temperature.num_seed_examples_per_generation: seeds included per prompt.temp_increment: temperature increase step.max_temp_cap: upper bound on temperature adjustments.max_consecutive_failures: abort threshold for repeated failures.seed: base seed for deterministic sampling.
test_agent_perturbation_config
input_log_path: Inspect.evalarchive to perturb (resolved relative paths must exist).rubric_path: rubric JSON; entries typically includeid,instructions, and optionalscore_levels.target_rubric_ids: optional subset of rubric IDs to target; empty means use all rubric rows.max_summary_messages: cap on summary messages;max_edit_rounds: maximum refinement iterations;trace_messages: log planner decisions when enabled.sample_num_from_orig/sampling_seed: optional limit and seed for sampling runs from the Inspect archive.transcript_preprocessors(aliastranscript_preprocessor): list of callables or dotted import paths run on each normalized transcript before planning.autograder_template: defaults toevaluation_config.templatewhen omitted;autograder_default_params: defaults toadmin.dataset_config.default_params;score_targets: optional ordinal scores to target in autograder mode.objective:"fail"(default) induces rubric violations;"pass"preserves rubric satisfaction.pass_requiredenforces verifier PASS whenobjectiveis"pass".planner/editor/summary/verifier: stage configs withmodel,prompt_path, andtemperature. Missing blocks fall back to default prompts and the generation model; an empty{}summaryblock inherits the editor settings, and a verifier is only instantiated when provided.output: destination for agent JSONL + debug bundles (dir), toggle forwrite_jsonl, andoverwritebehavior when rerunning.
Human Review UI (HITL)
- The review UI is enabled when
admin.perturbation_config.use_HITL_processistrue(default in sample configs). - A local server starts at
http://127.0.0.1:8765while perturbations stream in. Accept, reject, or edit rows, then click Finalize (or press Enter in the terminal) to continue the run. - Decisions are persisted under
outputs/{module}/review/; accepted/edited items are reflected insynthetic_{test}.csv, rejected ones are removed. - Set the flag to
falsefor unattended or CI runs; generated items will be accepted as-is.
Outputs, Caching, and Reruns
outputs/{module}_{timestamp}/{config_name}.yml: merged config snapshot saved with the same filename you passed tomain(for exampledefault_config.ymlorconfig_agentharm.yml).synthetic_{test}.{csv|xlsx}: perturbations saved incrementally in the configuredoutput_file_format. Reruns append only missing items via theDataRegistry; when both agent modes run,agent_perturbationandagent_positivesshare this file (filter bytest_name).{test}_results_{model}.{csv|xlsx}: autograder scores written only whenevaluation_config.overwrite_resultsistrue. With the flagfalse, JRH reuses existing result files and keeps any new scores in-memory for the current run.{model}_report.json: aggregated metrics for all evaluated tests;cost_curve_heatmap.pngappears alongside whenget_cost_curvesis enabled.- Agentic runs also write
agent_perturbations.jsonl, a summary JSON, and debug bundles undertest_agent_perturbation_config.output.dir(defaults to anoutputs/agent_perturbation/...folder). - To continue a partial run, reuse the same
time_stampso JRH reads prior artifacts; flipoverwrite_resultstotruewhen you want fresh result files.
Overwriting the Default Config
JRH loads ./src/configs/default_config.yml by default and applies overrides from ./inputs/configs/CUSTOM_CONFIG.yml. Any missing parameters fall back to the default configuration. Run a custom configuration with:
python -m main ./inputs/configs/CUSTOM_CONFIG.ymlFor example, config_persuade.yml applies these overrides:
admin:
module_name: "persuade"
dataset_config:
dataset_name: "persuade_corpus_2.0_test_sample.csv"
default_params_path:
instruction: "persuade_instruction.txt"
rubric: "persuade_rubric.txt"
perturbation_config:
tests_to_run:
- "synthetic_ordinal"
preprocess_columns_map:
request: "assignment"
response: "full_text"
expected: "holistic_essay_score"
evaluation_config:
template: "single_autograder"
tests_to_evaluate:
- "synthetic_ordinal"This configuration designates the dataset, instruction, and rubric paths for the persuade module, maps dataset columns to the required schema, and specifies synthetic_ordinal for both generation and evaluation.
Operation 1: Data Preprocess
Data must be standardized before use. The preprocess step (1) renames columns according to preprocess_columns_map and (2) evaluates the original dataset with the aiautograder when requested.
admin:
dataset_config:
use_original_data_as_expected:
perturbation_config:
preprocess_columns_map:By default, the harmbench_binary project uses:
preprocess_columns_map:
request: "test_case"
response: "generation"
expected: "human_consensus"The test_case column becomes request, and so on. When use_original_data_as_expected is enabled, the aiautograder produces ground-truth labels unless the dataset already contains an expected column.
JRH reads CSV or Excel automatically; for unknown extensions it falls back to admin.output_file_format to pick a reader.
If the aiautograder runs during preprocessing, a new dataset named ./inputs/data/{module_name}/{stem(dataset_name)}_preprocessed.{csv|xlsx} is emitted (matching output_file_format). Use this file as the source for subsequent runs.
Operation 2: Synthetic data generation
Synthetic data generation follows the tests_to_run list:
admin:
perturbation_config:
tests_to_run:Available tests: - stochastic_stability, synthetic_ordinal, and agent_perturbation each use their dedicated configs. - agent_positives shares the agent config; when both agent modes run, they are stored together in synthetic_agent_perturbation.{csv|xlsx} (filter by the test_name column to split them). - Basic perturbations (label_flip, format_invariance_{1,2,3}, semantic_paraphrase, answer_ambiguity, verbosity_bias) share synthetic_data_params and the instructions in prompts/synthetic_generation_prompts/basic_perturbation_instructions.json.
Add new basic tests by extending ./prompts/synthetic_generation_prompts/basic_perturbation_instructions.json.
Operation 3: Synthetic data evaluation
Synthetic outputs are scored according to:
admin:
evaluation_config:
template: "single_judge"
autograder_model_name: "openai/gpt-4o-mini"
overwrite_results: False
max_workers: 10
tests_to_evaluate:
- "label_flip"When overwrite_results: true, each evaluated test is written to ./outputs/{module_name}_{time_stamp}/{test_name}_results_{model_name}.{csv|xlsx} (matching output_file_format). For the default configuration, this yields files such as ./outputs/harmbench_binary_20251103_1222/label_flip_results_openai_gpt-4o-mini.csv. If a results file already exists and overwrite_results is false, JRH reuses the cached rows; new evaluations are not persisted unless you flip the flag to true.
Operation 4: Metrics gathering
Metrics compare evaluated scores against the expected column using the configured scikit-learn metric. Bootstrapping is available via bootstrap_size and bootstrap_repetitions.
admin:
evaluation_config:
metric: "accuracy_score"
bootstrap_size: 0.1
bootstrap_repetitions: 10
tests_to_evaluate:
- "label_flip"Operation 5: Cost curves
Set get_cost_curves to generate seaborn heatmaps of metric scores across synthetic tests and aiautograder models.
admin:
evaluation_config:
get_cost_curves: FalseCost curves aggregate every {model}_report.json in ./outputs; if multiple reports share the same model id, later files in the scan overwrite earlier ones in the heatmap.
Full Config Example for Persuade Benchmark:
admin:
module_name: "persuade"
time_stamp: null
test_debug_mode: False
dataset_config:
dataset_name: "persuade_corpus_2.0_test_sample.csv"
default_params_path:
instruction: "persuade_instruction.txt"
rubric: "persuade_rubric.txt"
use_original_data_as_expected: False
default_params:
lowest_score: "1"
highest_score: "6"
perturbation_config:
use_HITL_process: True
tests_to_run:
- "synthetic_ordinal"
preprocess_columns_map:
original_idx: "essay_id"
request: "assignment"
response: "full_text"
expected: "holistic_essay_score"
evaluation_config:
template: "single_autograder"
autograder_model_name: "openai/gpt-4o-mini"
overwrite_results: False
max_workers: 10
tests_to_evaluate:
- "synthetic_ordinal"
metric: "accuracy_score"
bootstrap_size: 0.1
bootstrap_repetitions: 10
get_cost_curves: False
test_stochastic_stability_config:
sample_num_from_orig: 10
number_of_seeds: 1
repetitions: 1
seed: 87
synthetic_data_params:
generation_model_name: "openai/gpt-4o-mini"
validation_model_name: "openai/gpt-4o-mini"
max_tokens_generation: 1200
max_tokens_validation: 1200
max_workers: 10
use_similarity_filter: False
sample_num_from_orig: 1
target_num_per_bucket: 1
similarity_threshold: 0.9
rescore_sample_size: 2
initial_temp: 1.0
num_seed_examples_per_generation: 1
temp_increment: 0.1
max_temp_cap: 1.1
max_consecutive_failures: 2
seed: 87