User Guide

Before You Run

  • Place inputs under inputs/data/{module_name}/ and point admin.dataset_config.dataset_name at the CSV you want to use. Instruction/rubric text files referenced in default_params_path should live in the same folder.
  • Ensure your dataset can be mapped to the internal request / response / expected schema via admin.perturbation_config.preprocess_columns_map. If an expected column is missing, set use_original_data_as_expected: True to let the autograder populate it during preprocessing.
  • Outputs default to CSV. Set admin.output_file_format: "xlsx" when you need Excel artifacts; JRH does this automatically when module_name starts with stratus.
  • Add API keys (for example OPENAI_API_KEY) to .env. Toggle admin.test_debug_mode: True when you need to dry-run without LLM calls.
  • admin.time_stamp controls the output folder name (outputs/{module_name}_{time_stamp}). Reuse a timestamp to append to an existing run; leave it null to create a new one.
  • admin.perturbation_config.use_HITL_process enables the review UI at http://127.0.0.1:8765 during generation. Set it to False for unattended runs.
  • Agentic runs require an Inspect .eval archive and a rubric JSON; see the Agentic Mode Guide for the required shape.

How the Harness Executes

  1. Load and merge the provided YAML with defaults from src/configs/default_config.yml, then create an output directory.
  2. Preprocess the dataset: rename columns, optionally grade the original data to generate an expected column, and write a _preprocessed.{csv|xlsx} copy if needed (format follows admin.output_file_format).
  3. Generate perturbations for each tests_to_run entry (launching the review UI when HITL is enabled) and persist them to synthetic_{test}.{csv|xlsx}.
  4. Evaluate each tests_to_evaluate item (or all tests_to_run when left empty), skipping already-scored rows. Results are written only when evaluation_config.overwrite_results is true; otherwise existing result files are reused without saving new scores.
  5. Compute metrics and emit {model}_report.json; optional cost curves render to cost_curve_heatmap.png.

Default Config Overview

The default config, located at ./src/configs/default_config.yml, has four high-level parameters and three low-level parameter blocks:

admin:
  dataset_config:
  perturbation_config:
  evaluation_config:

test_stochastic_stability_config:
synthetic_data_params:
test_agent_perturbation_config:

The admin parameter governs the workflow, selecting datasets, perturbation modes, and evaluation logic. The remaining blocks provide specialized configuration for their respective test modes.

admin

  • module_name: project name mapped to ./inputs/data/{module_name}/, which also determines the output folder prefix.
  • time_stamp: optional pointer to a prior output folder; when omitted, a new folder is created (named {module_name}_{timestamp}).
  • test_debug_mode: replaces LLM calls with fixed defaults to conserve API tokens.
  • output_file_format: extension for saved artifacts (csv by default; automatically set to xlsx when module_name starts with stratus).
  • dataset_config:
    • dataset_name: source dataset resolved to ./inputs/data/{module_name}/{dataset_name} (CSV or Excel).
    • default_params_path: text files loaded from ./inputs/data/{module_name}/...; contents are merged into default_params (warnings emit when files are missing).
    • use_original_data_as_expected: treats the original dataset as the gold standard during evaluation.
    • default_params:
      • min_score: minimum rubric score for synthetic_ordinal and single_autograder modes.
      • max_score: maximum rubric score for those modes.
  • perturbation_config:
    • use_HITL_process: enables human-in-the-loop review during perturbation generation (creates review/ under the output directory).
    • tests_to_run: list of perturbation tests to execute (supports basic perturbations, synthetic_ordinal, stochastic_stability, agent_perturbation, and optional agent_positives).
    • preprocess_columns_map: maps dataset columns to the logical request/response/expected schema.
  • evaluation_config:
    • template: evaluation template (e.g., single_judge).
    • autograder_model_name: LLM responsible for scoring.
    • overwrite_results: when true, writes {test}_results_{model}.{csv|xlsx}; when false, reuses existing result files without saving new scores.
    • max_workers: degree of parallelism during evaluation.
    • tests_to_evaluate: list of perturbations to score; falls back to tests_to_run when empty.
    • metric: scikit-learn metric applied to predictions.
    • bootstrap_size: dataset fraction used for bootstrap resampling.
    • bootstrap_repetitions: number of bootstrap iterations.
    • get_cost_curves: toggles cost-performance visualizations.

test_stochastic_stability_config

  • sample_num_from_orig: samples drawn from the original dataset per stability test.
  • number_of_seeds: random seeds for stability analysis.
  • repetitions: trial count per seed.
  • seed: base random seed.

synthetic_data_params

  • generation_model_name: model used to generate synthetic data.
  • validation_model_name: model used to validate generated samples.
  • max_tokens_generation: token cap for generation prompts.
  • max_tokens_validation: token cap for validation prompts.
  • max_workers: parallel workers for generation.
  • use_similarity_filter: drops samples that are too similar to originals.
  • sample_num_from_orig: original examples sampled as seeds.
  • target_num_per_bucket: desired synthetic samples per bucket or class.
  • similarity_threshold: cosine similarity threshold for filtering.
  • rescore_sample_size: candidate generations to rescore for quality.
  • initial_temp: starting sampling temperature.
  • num_seed_examples_per_generation: seeds included per prompt.
  • temp_increment: temperature increase step.
  • max_temp_cap: upper bound on temperature adjustments.
  • max_consecutive_failures: abort threshold for repeated failures.
  • seed: base seed for deterministic sampling.

test_agent_perturbation_config

  • input_log_path: Inspect .eval archive to perturb (resolved relative paths must exist).
  • rubric_path: rubric JSON; entries typically include id, instructions, and optional score_levels.
  • target_rubric_ids: optional subset of rubric IDs to target; empty means use all rubric rows.
  • max_summary_messages: cap on summary messages; max_edit_rounds: maximum refinement iterations; trace_messages: log planner decisions when enabled.
  • sample_num_from_orig / sampling_seed: optional limit and seed for sampling runs from the Inspect archive.
  • transcript_preprocessors (alias transcript_preprocessor): list of callables or dotted import paths run on each normalized transcript before planning.
  • autograder_template: defaults to evaluation_config.template when omitted; autograder_default_params: defaults to admin.dataset_config.default_params; score_targets: optional ordinal scores to target in autograder mode.
  • objective: "fail" (default) induces rubric violations; "pass" preserves rubric satisfaction. pass_required enforces verifier PASS when objective is "pass".
  • planner / editor / summary / verifier: stage configs with model, prompt_path, and temperature. Missing blocks fall back to default prompts and the generation model; an empty {} summary block inherits the editor settings, and a verifier is only instantiated when provided.
  • output: destination for agent JSONL + debug bundles (dir), toggle for write_jsonl, and overwrite behavior when rerunning.

Human Review UI (HITL)

  • The review UI is enabled when admin.perturbation_config.use_HITL_process is true (default in sample configs).
  • A local server starts at http://127.0.0.1:8765 while perturbations stream in. Accept, reject, or edit rows, then click Finalize (or press Enter in the terminal) to continue the run.
  • Decisions are persisted under outputs/{module}/review/; accepted/edited items are reflected in synthetic_{test}.csv, rejected ones are removed.
  • Set the flag to false for unattended or CI runs; generated items will be accepted as-is.

Outputs, Caching, and Reruns

  • outputs/{module}_{timestamp}/{config_name}.yml: merged config snapshot saved with the same filename you passed to main (for example default_config.yml or config_agentharm.yml).
  • synthetic_{test}.{csv|xlsx}: perturbations saved incrementally in the configured output_file_format. Reruns append only missing items via the DataRegistry; when both agent modes run, agent_perturbation and agent_positives share this file (filter by test_name).
  • {test}_results_{model}.{csv|xlsx}: autograder scores written only when evaluation_config.overwrite_results is true. With the flag false, JRH reuses existing result files and keeps any new scores in-memory for the current run.
  • {model}_report.json: aggregated metrics for all evaluated tests; cost_curve_heatmap.png appears alongside when get_cost_curves is enabled.
  • Agentic runs also write agent_perturbations.jsonl, a summary JSON, and debug bundles under test_agent_perturbation_config.output.dir (defaults to an outputs/agent_perturbation/... folder).
  • To continue a partial run, reuse the same time_stamp so JRH reads prior artifacts; flip overwrite_results to true when you want fresh result files.

Overwriting the Default Config

JRH loads ./src/configs/default_config.yml by default and applies overrides from ./inputs/configs/CUSTOM_CONFIG.yml. Any missing parameters fall back to the default configuration. Run a custom configuration with:

python -m main ./inputs/configs/CUSTOM_CONFIG.yml

For example, config_persuade.yml applies these overrides:

admin:
  module_name: "persuade"
  dataset_config:
    dataset_name: "persuade_corpus_2.0_test_sample.csv"
    default_params_path:
      instruction: "persuade_instruction.txt"
      rubric: "persuade_rubric.txt"

  perturbation_config:
    tests_to_run:
      - "synthetic_ordinal"

    preprocess_columns_map:
      request: "assignment"
      response: "full_text"
      expected: "holistic_essay_score"
  
  evaluation_config:
    template: "single_autograder"
    tests_to_evaluate:
      - "synthetic_ordinal"

This configuration designates the dataset, instruction, and rubric paths for the persuade module, maps dataset columns to the required schema, and specifies synthetic_ordinal for both generation and evaluation.

Operation 1: Data Preprocess

Data must be standardized before use. The preprocess step (1) renames columns according to preprocess_columns_map and (2) evaluates the original dataset with the aiautograder when requested.

admin:
  dataset_config:
    use_original_data_as_expected:

  perturbation_config:
    preprocess_columns_map:

By default, the harmbench_binary project uses:

preprocess_columns_map:
  request: "test_case"
  response: "generation"
  expected: "human_consensus"

The test_case column becomes request, and so on. When use_original_data_as_expected is enabled, the aiautograder produces ground-truth labels unless the dataset already contains an expected column.

JRH reads CSV or Excel automatically; for unknown extensions it falls back to admin.output_file_format to pick a reader.

Note

If the aiautograder runs during preprocessing, a new dataset named ./inputs/data/{module_name}/{stem(dataset_name)}_preprocessed.{csv|xlsx} is emitted (matching output_file_format). Use this file as the source for subsequent runs.

Operation 2: Synthetic data generation

Synthetic data generation follows the tests_to_run list:

admin:
  perturbation_config:
    tests_to_run:

Available tests: - stochastic_stability, synthetic_ordinal, and agent_perturbation each use their dedicated configs. - agent_positives shares the agent config; when both agent modes run, they are stored together in synthetic_agent_perturbation.{csv|xlsx} (filter by the test_name column to split them). - Basic perturbations (label_flip, format_invariance_{1,2,3}, semantic_paraphrase, answer_ambiguity, verbosity_bias) share synthetic_data_params and the instructions in prompts/synthetic_generation_prompts/basic_perturbation_instructions.json.

Add new basic tests by extending ./prompts/synthetic_generation_prompts/basic_perturbation_instructions.json.

Operation 3: Synthetic data evaluation

Synthetic outputs are scored according to:

admin:
  evaluation_config:
    template: "single_judge"
    autograder_model_name: "openai/gpt-4o-mini"
    overwrite_results: False
    max_workers: 10

    tests_to_evaluate:
      - "label_flip"

When overwrite_results: true, each evaluated test is written to ./outputs/{module_name}_{time_stamp}/{test_name}_results_{model_name}.{csv|xlsx} (matching output_file_format). For the default configuration, this yields files such as ./outputs/harmbench_binary_20251103_1222/label_flip_results_openai_gpt-4o-mini.csv. If a results file already exists and overwrite_results is false, JRH reuses the cached rows; new evaluations are not persisted unless you flip the flag to true.

Operation 4: Metrics gathering

Metrics compare evaluated scores against the expected column using the configured scikit-learn metric. Bootstrapping is available via bootstrap_size and bootstrap_repetitions.

admin:
  evaluation_config:
    metric: "accuracy_score"
    bootstrap_size: 0.1
    bootstrap_repetitions: 10
    
    tests_to_evaluate:
      - "label_flip"

Operation 5: Cost curves

Set get_cost_curves to generate seaborn heatmaps of metric scores across synthetic tests and aiautograder models.

admin:
  evaluation_config:
    get_cost_curves: False
Note

Cost curves aggregate every {model}_report.json in ./outputs; if multiple reports share the same model id, later files in the scan overwrite earlier ones in the heatmap.

Full Config Example for Persuade Benchmark:

admin:
  module_name: "persuade"
  time_stamp: null
  test_debug_mode: False

  dataset_config:
    dataset_name: "persuade_corpus_2.0_test_sample.csv"
    default_params_path:
      instruction: "persuade_instruction.txt"
      rubric: "persuade_rubric.txt"

    use_original_data_as_expected: False

    default_params:
      lowest_score: "1"
      highest_score: "6"

  perturbation_config:
    use_HITL_process: True
    tests_to_run:
      - "synthetic_ordinal"

    preprocess_columns_map:
      original_idx: "essay_id"
      request: "assignment"
      response: "full_text"
      expected: "holistic_essay_score"

  evaluation_config:
    template: "single_autograder"
    autograder_model_name: "openai/gpt-4o-mini"
    overwrite_results: False
    max_workers: 10

    tests_to_evaluate:
      - "synthetic_ordinal"

    metric: "accuracy_score"
    bootstrap_size: 0.1
    bootstrap_repetitions: 10
    get_cost_curves: False

test_stochastic_stability_config:
  sample_num_from_orig: 10
  number_of_seeds: 1
  repetitions: 1
  seed: 87

synthetic_data_params:
  generation_model_name: "openai/gpt-4o-mini"
  validation_model_name: "openai/gpt-4o-mini"
  max_tokens_generation: 1200
  max_tokens_validation: 1200

  max_workers: 10
  use_similarity_filter: False
  sample_num_from_orig: 1
  target_num_per_bucket: 1
  similarity_threshold: 0.9
  rescore_sample_size: 2
  initial_temp: 1.0
  num_seed_examples_per_generation: 1
  temp_increment: 0.1
  max_temp_cap: 1.1
  max_consecutive_failures: 2
  seed: 87