FEAT: Add Image functionality to TAP#1036
Conversation
|
Thanks for all the comments. I'll go through them and push changes soon! |
…PyRIT into feature/tap-image-target
|
I added two unit tests to cover the pruning logic, ensuring blocked responses are scored as 0.0 and pruning only occurs when we exceed |
romanlutz
left a comment
There was a problem hiding this comment.
One of the maintainers should run the notebook as well once it exists. Just to make sure we aren't missing anything
…p-image-target
…attack notebooks, run notebooks
Resolve conflicts: keep both error_score_map (PR) and initial_prompt/prepended_conversation_config (main). Take main version for doc files (need separate follow-up). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add error_score_map parameter to TreeOfAttacksWithPruningAttack and
_TreeOfAttacksNode that maps response error types to fixed scores.
This prevents premature branch pruning when targets return blocked
or content-filtered responses (e.g., image generation targets).
Key changes:
- Default error_score_map maps 'blocked' -> 0.0 (pass {} to disable)
- Intercepts mapped errors in _score_response_async before calling scorer
- Creates synthetic float_scale Score for mapped errors
- Propagates map through duplicate() and _create_attack_node()
- Copies dict to avoid shared mutable state
Updates from original PR microsoft#1036:
- Adapted to current Message/MessagePiece API (was PromptRequestResponse)
- Fixed Score constructor args (message_piece_id, score_category as list)
- Made default None -> {'blocked': 0.0} per reviewer feedback
- Added comprehensive unit tests for error interception, scoring, and
map propagation
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add input validation: keys must be valid PromptResponseError values, scores must be in [0, 1] range. Errors caught at construction time. - Persist synthetic scores to memory via add_scores_to_memory() - Fix multi-piece handling: iterate all message_pieces to find the error piece, not just the first piece - Add validation unit tests for invalid key and out-of-range value Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add TAPSystemPromptPaths enum with TEXT_GENERATION and IMAGE_GENERATION variants, matching the RTASystemPromptPaths pattern - Export TAPSystemPromptPaths from pyrit.executor.attack - Add image generation target example to TAP doc (tap_attack.py/.ipynb) demonstrating use of IMAGE_GENERATION system prompt - Add TAP integration tests for both text and image targets - Regenerate tap_attack.ipynb from updated .py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two issues fixed: 1. Single-turn target conversation IDs: TAP now generates a fresh conversation_id before each send when the objective target does not support multi-turn (detected via capabilities.supports_multi_turn). This ensures single-turn targets like image generators always receive a clean conversation without prior history. 2. Default scorer supports image output: When no scoring config is provided, TAP now inspects the objective target's output_modalities to determine supported data types for the default SelfAskScaleScorer. For image targets this includes 'image_path', allowing the scorer to evaluate generated images via a multimodal adversarial chat LLM. Previously, the default scorer only accepted 'text', silently failing on image responses and preventing successful scoring. Updated doc example with notes on scoring, single-turn handling, and re-executed notebook with real API calls. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Root cause of image scoring failure: SelfAskScaleScorer._score_piece_async concatenated image paths into a text prompt and passed it with data type 'image_path', causing the scorer LLM to crash. Fixed by adopting the same pattern as SelfAskTrueFalseScorer: for non-text content, send the raw content (image path) with its original data type and prepend the objective as a text piece. This lets multimodal LLMs evaluate images directly. Also fixed FloatScaleScorer._score_value_with_llm to pass through the prepended_text_message_piece parameter to the parent class. Re-executed notebook — image scoring now works (Score: 1/10 for benign cat image, correctly below jailbreak threshold). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The default TAP scorer used TREE_OF_ATTACKS_SCALE which rates jailbreak harmfulness (1=safe, 10=harmful). This caused correct task completions (e.g., a generated cat-with-hat image) to score 1/10 because the image isn't harmful — even though the objective was fully met. Changed default to TASK_ACHIEVED_SCALE which rates whether the task was completed (0=refusal, 100=fully achieved). This correctly scores successful image generation and text responses based on objective achievement rather than harmfulness. Note: Notebook execution pending — image generation API returning 500 errors (infrastructure issue, not code). The .ipynb has the .py changes but no execution output for the image cell. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…vement The task_achieved_scale.yaml had category 'jailbreak', causing the scorer LLM to evaluate all responses through a jailbreak lens. Benign task completions (e.g., generating a cat-with-hat image) scored 0 because they don't violate ethical guidelines — even though the objective was fully achieved. Changes: - Changed category from 'jailbreak' to 'task_achievement' - Updated min/max descriptions to be task-neutral - Added Example7 showing image generation task achievement - Updated doc example to use explicit image endpoint for reliability - Re-executed notebook: image TAP now scores 0.95 (SUCCESS) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 8 scenario-driven tests exercising the full TAP _perform_async loop with mocked nodes. Each scenario is a compact data structure defining per-node behavior (score, blocked error, failure, off-topic, JSON error) at each depth level. The test harness wires up mock nodes whose send_prompt_async applies the prescribed behavior, runs _perform_async, and asserts outcome, best score, and max depth. Scenarios covered: - immediate_success_depth1: high score on first attempt - blocked_depth1_recovers_depth2: all blocked at depth 1, succeed at depth 2 - mixed_errors_and_success: combination of blocked, fail, off-topic, success - all_fail_all_depths: all nodes fail (json_err + exception) - gradual_improvement_succeeds_depth3: scores improve over 3 depths - close_but_never_reaches_threshold: scores plateau below threshold - empty_error_map_blocked_prunes_all: disabled error_score_map - off_topic_sibling_recovers: off-topic node, sibling continues Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Parametrize TestTAPScenarios over supports_multi_turn (True/False) so every scenario runs twice — once with a multi-turn target and once with a single-turn target. Also added with_supports_multi_turn() builder method and wired capabilities into _create_mock_target. 16 scenario tests total (8 scenarios × 2 target types), 84 tests total. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ruff format: auto-formatted 2 files - ruff ERA001: replaced commented-out code block with concise comment - mypy: cast sorted output_types to list[PromptDataType] for ScorerPromptValidator type compatibility Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pre-commit fixes: - Run nbstripout to strip notebook outputs - Run sanitize_notebook_paths to remove user paths Coverage fixes (was 87%, need 90%): - Add test for single-turn target fresh conversation ID generation - Add test for multi-turn target retaining conversation ID - Add test for default scorer detecting text/image output modalities - Add test for SelfAskScaleScorer non-text scoring path (prepended text) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each TAP depth iteration now creates a new child node in the visualization
tree rather than appending scores to the same node's tag. This makes the
tree output match the actual tree structure:
Before: 1: Score: 1/10 || Score: 1/10 || Pruned (width)
After:
1: Score: 1/10
2: Score: 1/10 Pruned (width)
- Add _vis_node_id to _TreeOfAttacksNode to track current vis position
- _send_prompts_to_all_nodes_async creates child vis nodes per depth
- Prepended conversation nodes are chained (not flat under root)
- Remove trailing || separator from score format
- Update image objective to raccoon heist and re-execute notebook
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| # appending scores to the same node. | ||
| for node in context.nodes: | ||
| vis_id = f"{node.node_id}_d{context.executed_turns}" | ||
| context.tree_visualization.create_node(f"{context.executed_turns}: ", vis_id, parent=node._vis_node_id) |
There was a problem hiding this comment.
because vis_node_id gets reset and this is an async function I think there's a race condition--one branch could write this while the other is reading it
There was a problem hiding this comment.
Good eye, but I think this is safe — the _vis_node_id update loop (lines 1850-1853) is fully synchronous and runs to completion before any async work starts. The asyncio.gather on line 1872 only calls send_prompt_async on each node, which never reads or writes _vis_node_id. Each node also gets its own unique vis_id ({node.node_id}_d{context.executed_turns}) so there's no shared state between nodes.
The only time two nodes share the same _vis_node_id value is right after duplicate() copies it from parent to clone — but that's intentional (clone branches from the same vis position), and both get reassigned unique ids in this synchronous loop before any concurrency begins.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
This PR makes TAP (Tree of Attacks with Pruning) work with image generation targets and improves its resilience to blocked/content-filtered responses. Originally started by @awksrj, then brought up to date with main and significantly expanded.
Core changes
1. error_score_map — Resilient error handling for TAP
Adds an
error_score_mapparameter toTreeOfAttacksWithPruningAttackthat maps response error types (e.g.,"blocked") to fixed scores instead of letting the scorer crash and the branch get pruned. This prevents premature termination when all initial branches hit content filters.{"blocked": 0.0}— blocked branches survive with score 0 and are only pruned when width is exceeded{}to disable and restore previous behaviorPromptResponseErrorvalues, scores in[0, 1]2. Single-turn target support via TargetCapabilities
TAP now checks
objective_target.capabilities.supports_multi_turnand generates a freshconversation_idbefore each prompt send for single-turn targets (like image generators). No special configuration needed.3. Multimodal scoring fixes
SelfAskScaleScorer._score_piece_asyncnow handles non-text content (images, audio) correctly by sending the raw content with its original data type and prepending the objective as a text piece — matching the pattern already used bySelfAskTrueFalseScorerFloatScaleScorer._score_value_with_llmpasses throughprepended_text_message_pieceto the parent classScorerPromptValidatorwith the right supported types4. Default scoring scale fix
TASK_ACHIEVED_SCALEinstead ofTREE_OF_ATTACKS_SCALEtask_achieved_scale.yamlcategory from"jailbreak"to"task_achievement"so the scorer LLM evaluates objective completion rather than harmfulness5. TAPSystemPromptPaths enum
Added
TAPSystemPromptPathsenum (matchingRTASystemPromptPathspattern) withTEXT_GENERATIONandIMAGE_GENERATIONvariants. The image generation system prompt is tailored for single-turn image models.Documentation
doc/code/executor/attack/tap_attack.pyand.ipynbTests
error_score_mapbehavior (validation, interception, propagation, opt-out)Related Issue
Closes: #585
Behavioral changes
{"blocked": 0.0}. Passerror_score_map={}to restore previous behavior where blocked responses go through normal scoring.SelfAskScaleScorerused when noattack_scoring_configis provided.