Quick Start

Prepare Your Inputs
- Place your dataset, rubric, and instruction files under
inputs/data/{module_name}/.- Default binary example:
inputs/data/harmbench_binary/harmbench.csv,harmbench_binary_instruction.txt,harmbench_binary_rubric.txt. - Ordinal example:
inputs/data/persuade/persuade_corpus_2.0_test_sample.csv,persuade_instruction.txt,persuade_rubric.txt.
- Default binary example:
- Confirm the config maps your dataset columns to JRH’s
request/response/expectedschema viaadmin.perturbation_config.preprocess_columns_map. - Set API keys in
.env(for exampleOPENAI_API_KEY) and optionally enableadmin.test_debug_mode: Truefor dry runs that avoid LLM calls.
Run Common Workflows
Activate the uv environment you created during installation or prefix commands with uv run to reuse it automatically.
Default Harmbench label-flip (with the review UI on):
uv run python -m mainDefault Harmbench without the review UI (useful for headless runs):
uv run python -m main inputs/configs/config_binary_harmbench.ymland set
admin.perturbation_config.use_HITL_process: False.Ordinal essay scoring (Persuade):
uv run python -m main inputs/configs/config_persuade.ymlBuild your own run: copy
src/configs/default_config.ymlintoinputs/configs/your_config.yml, adjust paths/tests, then execute:uv run python -m main inputs/configs/your_config.ymlAgentic mode: add
"agent_perturbation"totests_to_run, filltest_agent_perturbation_configwith your Inspect.evalarchive and rubric, pick theevaluation_config.template(agent_judgeoragent_autograder), and launch with the same command pattern. See the Agentic Mode Guide for details.
Use uv run python -m main --help to view CLI arguments.
Human Review UI (optional)
If admin.perturbation_config.use_HITL_process is true, JRH starts a lightweight review UI at http://127.0.0.1:8765 while perturbations stream in. Accept, reject, or edit rows in the browser, then click Finalize (or press Enter in the terminal) to continue. Set the flag to false to skip the UI and accept all generated items.
Inspect Outputs
Artifacts live under outputs/{module_name}_{timestamp}/:
{config_name}.yml: merged configuration snapshot saved with the filename you passed tomain.synthetic_{test}.{csv|xlsx}: generated perturbations (after any human edits) in the configuredoutput_file_format.{test}_results_{model}.{csv|xlsx}: autograder scores per perturbation (written whenevaluation_config.overwrite_resultsistrue; otherwise existing results are reused).{model}_report.json: metrics summary; includes cost curves imagecost_curve_heatmap.pngif enabled.review/: review UI decisions and logs when HITL is enabled.
Agentic runs also emit agent_perturbations.jsonl plus per-run debug bundles under test_agent_perturbation_config.output.dir (defaults to outputs/agent_perturbation/...).