Skip to content

FEAT: Add Image functionality to TAP#1036

Open
awksrj wants to merge 34 commits intomicrosoft:mainfrom
awksrj:feature/tap-image-target
Open

FEAT: Add Image functionality to TAP#1036
awksrj wants to merge 34 commits intomicrosoft:mainfrom
awksrj:feature/tap-image-target

Conversation

@awksrj
Copy link
Copy Markdown
Contributor

@awksrj awksrj commented Jul 30, 2025

Description

This PR makes TAP (Tree of Attacks with Pruning) work with image generation targets and improves its resilience to blocked/content-filtered responses. Originally started by @awksrj, then brought up to date with main and significantly expanded.

Core changes

1. error_score_map — Resilient error handling for TAP

Adds an error_score_map parameter to TreeOfAttacksWithPruningAttack that maps response error types (e.g., "blocked") to fixed scores instead of letting the scorer crash and the branch get pruned. This prevents premature termination when all initial branches hit content filters.

  • Default: {"blocked": 0.0} — blocked branches survive with score 0 and are only pruned when width is exceeded
  • Pass {} to disable and restore previous behavior
  • Validated at construction time: keys must be valid PromptResponseError values, scores in [0, 1]
  • Synthetic scores are persisted to memory for audit trail

2. Single-turn target support via TargetCapabilities

TAP now checks objective_target.capabilities.supports_multi_turn and generates a fresh conversation_id before each prompt send for single-turn targets (like image generators). No special configuration needed.

3. Multimodal scoring fixes

  • SelfAskScaleScorer._score_piece_async now handles non-text content (images, audio) correctly by sending the raw content with its original data type and prepending the objective as a text piece — matching the pattern already used by SelfAskTrueFalseScorer
  • FloatScaleScorer._score_value_with_llm passes through prepended_text_message_piece to the parent class
  • TAP's default scorer auto-detects target output modalities and configures ScorerPromptValidator with the right supported types

4. Default scoring scale fix

  • TAP default scorer now uses TASK_ACHIEVED_SCALE instead of TREE_OF_ATTACKS_SCALE
  • Changed task_achieved_scale.yaml category from "jailbreak" to "task_achievement" so the scorer LLM evaluates objective completion rather than harmfulness

5. TAPSystemPromptPaths enum

Added TAPSystemPromptPaths enum (matching RTASystemPromptPaths pattern) with TEXT_GENERATION and IMAGE_GENERATION variants. The image generation system prompt is tailored for single-turn image models.

Documentation

  • Added image generation target example to doc/code/executor/attack/tap_attack.py and .ipynb
  • Notebook executed with real APIs — image TAP scores 0.95 (SUCCESS) for a cat-with-hat objective

Tests

  • 11 new unit tests for error_score_map behavior (validation, interception, propagation, opt-out)
  • 8 parametrized scenario tests exercising the full TAP loop with mocked nodes across multiple depths, covering blocked recovery, mixed errors, gradual improvement, off-topic recovery, and threshold plateau
  • 2 integration tests for TAP with text and image targets

Related Issue

Closes: #585

Behavioral changes

  • Default error_score_map: All existing TAP users will automatically get {"blocked": 0.0}. Pass error_score_map={} to restore previous behavior where blocked responses go through normal scoring.
  • Default scoring scale: Changed from jailbreak-oriented to task-achievement-oriented. This affects the default SelfAskScaleScorer used when no attack_scoring_config is provided.

@romanlutz romanlutz self-assigned this Jul 30, 2025
Comment thread doc/code/orchestrators/tree_of_attacks_with_pruning.py Outdated
Comment thread pyrit/executor/attack/multi_turn/tree_of_attacks.py Outdated
Comment thread pyrit/executor/attack/multi_turn/tree_of_attacks.py Outdated
Comment thread doc/code/orchestrators/tree_of_attacks_with_pruning.ipynb Outdated
Comment thread doc/code/orchestrators/tree_of_attacks_with_pruning.ipynb Outdated
@awksrj
Copy link
Copy Markdown
Contributor Author

awksrj commented Aug 1, 2025

Thanks for all the comments. I'll go through them and push changes soon!

@awksrj
Copy link
Copy Markdown
Contributor Author

awksrj commented Aug 6, 2025

I added two unit tests to cover the pruning logic, ensuring blocked responses are scored as 0.0 and pruning only occurs when we exceed tree_width. I also updated the example in tree_of_attacks_with_pruning.py, which used to show how the old TreeOfAttacksWithPruningOrchestrator worked with text targets. I replaced it with the new TAPAttack class to reflect the current implementation, which hopefully makes the documentation more complete.

Copy link
Copy Markdown
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the maintainers should run the notebook as well once it exists. Just to make sure we aren't missing anything

Comment thread pyrit/attacks/multi_turn/tree_of_attacks.py Outdated
Comment thread pyrit/attacks/multi_turn/tree_of_attacks.py Outdated
Comment thread tests/unit/attacks/test_tree_of_attacks.py Outdated
awksrj and others added 15 commits September 11, 2025 15:10
Resolve conflicts: keep both error_score_map (PR) and initial_prompt/prepended_conversation_config (main).
Take main version for doc files (need separate follow-up).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add error_score_map parameter to TreeOfAttacksWithPruningAttack and
_TreeOfAttacksNode that maps response error types to fixed scores.
This prevents premature branch pruning when targets return blocked
or content-filtered responses (e.g., image generation targets).

Key changes:
- Default error_score_map maps 'blocked' -> 0.0 (pass {} to disable)
- Intercepts mapped errors in _score_response_async before calling scorer
- Creates synthetic float_scale Score for mapped errors
- Propagates map through duplicate() and _create_attack_node()
- Copies dict to avoid shared mutable state

Updates from original PR microsoft#1036:
- Adapted to current Message/MessagePiece API (was PromptRequestResponse)
- Fixed Score constructor args (message_piece_id, score_category as list)
- Made default None -> {'blocked': 0.0} per reviewer feedback
- Added comprehensive unit tests for error interception, scoring, and
  map propagation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add input validation: keys must be valid PromptResponseError values,
  scores must be in [0, 1] range. Errors caught at construction time.
- Persist synthetic scores to memory via add_scores_to_memory()
- Fix multi-piece handling: iterate all message_pieces to find the
  error piece, not just the first piece
- Add validation unit tests for invalid key and out-of-range value

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add TAPSystemPromptPaths enum with TEXT_GENERATION and IMAGE_GENERATION
  variants, matching the RTASystemPromptPaths pattern
- Export TAPSystemPromptPaths from pyrit.executor.attack
- Add image generation target example to TAP doc (tap_attack.py/.ipynb)
  demonstrating use of IMAGE_GENERATION system prompt
- Add TAP integration tests for both text and image targets
- Regenerate tap_attack.ipynb from updated .py

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two issues fixed:

1. Single-turn target conversation IDs: TAP now generates a fresh
   conversation_id before each send when the objective target does not
   support multi-turn (detected via capabilities.supports_multi_turn).
   This ensures single-turn targets like image generators always receive
   a clean conversation without prior history.

2. Default scorer supports image output: When no scoring config is
   provided, TAP now inspects the objective target's output_modalities
   to determine supported data types for the default SelfAskScaleScorer.
   For image targets this includes 'image_path', allowing the scorer
   to evaluate generated images via a multimodal adversarial chat LLM.
   Previously, the default scorer only accepted 'text', silently failing
   on image responses and preventing successful scoring.

Updated doc example with notes on scoring, single-turn handling, and
re-executed notebook with real API calls.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause of image scoring failure: SelfAskScaleScorer._score_piece_async
concatenated image paths into a text prompt and passed it with data type
'image_path', causing the scorer LLM to crash. Fixed by adopting the same
pattern as SelfAskTrueFalseScorer: for non-text content, send the raw
content (image path) with its original data type and prepend the objective
as a text piece. This lets multimodal LLMs evaluate images directly.

Also fixed FloatScaleScorer._score_value_with_llm to pass through the
prepended_text_message_piece parameter to the parent class.

Re-executed notebook — image scoring now works (Score: 1/10 for benign
cat image, correctly below jailbreak threshold).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The default TAP scorer used TREE_OF_ATTACKS_SCALE which rates jailbreak
harmfulness (1=safe, 10=harmful). This caused correct task completions
(e.g., a generated cat-with-hat image) to score 1/10 because the image
isn't harmful — even though the objective was fully met.

Changed default to TASK_ACHIEVED_SCALE which rates whether the task was
completed (0=refusal, 100=fully achieved). This correctly scores
successful image generation and text responses based on objective
achievement rather than harmfulness.

Note: Notebook execution pending — image generation API returning 500
errors (infrastructure issue, not code). The .ipynb has the .py changes
but no execution output for the image cell.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…vement

The task_achieved_scale.yaml had category 'jailbreak', causing the scorer
LLM to evaluate all responses through a jailbreak lens. Benign task
completions (e.g., generating a cat-with-hat image) scored 0 because
they don't violate ethical guidelines — even though the objective was
fully achieved.

Changes:
- Changed category from 'jailbreak' to 'task_achievement'
- Updated min/max descriptions to be task-neutral
- Added Example7 showing image generation task achievement
- Updated doc example to use explicit image endpoint for reliability
- Re-executed notebook: image TAP now scores 0.95 (SUCCESS)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 8 scenario-driven tests exercising the full TAP _perform_async loop
with mocked nodes. Each scenario is a compact data structure defining
per-node behavior (score, blocked error, failure, off-topic, JSON error)
at each depth level. The test harness wires up mock nodes whose
send_prompt_async applies the prescribed behavior, runs _perform_async,
and asserts outcome, best score, and max depth.

Scenarios covered:
- immediate_success_depth1: high score on first attempt
- blocked_depth1_recovers_depth2: all blocked at depth 1, succeed at depth 2
- mixed_errors_and_success: combination of blocked, fail, off-topic, success
- all_fail_all_depths: all nodes fail (json_err + exception)
- gradual_improvement_succeeds_depth3: scores improve over 3 depths
- close_but_never_reaches_threshold: scores plateau below threshold
- empty_error_map_blocked_prunes_all: disabled error_score_map
- off_topic_sibling_recovers: off-topic node, sibling continues

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
romanlutz and others added 9 commits April 27, 2026 16:40
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Parametrize TestTAPScenarios over supports_multi_turn (True/False) so
every scenario runs twice — once with a multi-turn target and once with
a single-turn target. Also added with_supports_multi_turn() builder
method and wired capabilities into _create_mock_target.

16 scenario tests total (8 scenarios × 2 target types), 84 tests total.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ruff format: auto-formatted 2 files
- ruff ERA001: replaced commented-out code block with concise comment
- mypy: cast sorted output_types to list[PromptDataType] for
  ScorerPromptValidator type compatibility

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pre-commit fixes:
- Run nbstripout to strip notebook outputs
- Run sanitize_notebook_paths to remove user paths

Coverage fixes (was 87%, need 90%):
- Add test for single-turn target fresh conversation ID generation
- Add test for multi-turn target retaining conversation ID
- Add test for default scorer detecting text/image output modalities
- Add test for SelfAskScaleScorer non-text scoring path (prepended text)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
romanlutz and others added 2 commits April 29, 2026 03:57
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each TAP depth iteration now creates a new child node in the visualization
tree rather than appending scores to the same node's tag. This makes the
tree output match the actual tree structure:

Before: 1: Score: 1/10 || Score: 1/10 || Pruned (width)
After:
  1: Score: 1/10
    2: Score: 1/10 Pruned (width)

- Add _vis_node_id to _TreeOfAttacksNode to track current vis position
- _send_prompts_to_all_nodes_async creates child vis nodes per depth
- Prepended conversation nodes are chained (not flat under root)
- Remove trailing || separator from score format
- Update image objective to raccoon heist and re-execute notebook

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# appending scores to the same node.
for node in context.nodes:
vis_id = f"{node.node_id}_d{context.executed_turns}"
context.tree_visualization.create_node(f"{context.executed_turns}: ", vis_id, parent=node._vis_node_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because vis_node_id gets reset and this is an async function I think there's a race condition--one branch could write this while the other is reading it

Copy link
Copy Markdown
Contributor

@romanlutz romanlutz Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye, but I think this is safe — the _vis_node_id update loop (lines 1850-1853) is fully synchronous and runs to completion before any async work starts. The asyncio.gather on line 1872 only calls send_prompt_async on each node, which never reads or writes _vis_node_id. Each node also gets its own unique vis_id ({node.node_id}_d{context.executed_turns}) so there's no shared state between nodes.

The only time two nodes share the same _vis_node_id value is right after duplicate() copies it from parent to clone — but that's intentional (clone branches from the same vis position), and both get reassigned unique ids in this synchronous loop before any concurrency begins.

romanlutz and others added 2 commits April 29, 2026 16:15
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT: Add Image functionality to TAP

4 participants