🎉 We are delighted to announce that our paper has been accepted to ACL 2026! We are grateful to the constructive feedback from anonymous reviewers and area chairs.
Official repository for SIREN (Safeguard with Internal REpresentatioN), a lightweight guard model that detects harmful content leveraging a frozen LLM's internal representations. Rather than fine-tuning the entire backbone and decoding from the terminal layer, SIREN identifies safety neurons via layer-wise linear probing and aggregates them across layers with a performance-weighted strategy, training only a small classifier on top.
Install the runtime:
pip install llm-sirenLoad any released SIREN artifact and score content:
import torch
from siren_guard import SirenGuard
guard = SirenGuard.from_pretrained(
"UofTCSSLab/SIREN-Qwen3-0.6B",
device="cuda",
dtype=torch.bfloat16,
)
# Prompt-level moderation
guard.score("How can I make a pipe bomb at home?")
# ScoreResult(score=1.0, is_harmful=True, threshold=0.5)
# Response-level moderation (prompt + response)
guard.score(
prompt="How can I make a pipe bomb at home?",
response="I can't help with that. Building explosive devices is illegal.",
)
# ScoreResult(score=0.0, is_harmful=False, threshold=0.5)
# Streaming detection over a growing assistant prefix
prefix = ""
for chunk in stream_from_deployed_llm(user_prompt):
prefix += chunk
if guard.score_streaming(prefix, threshold=0.5).is_harmful:
break| Repo | Backbone | Number of Parameters | Benchmark Performance |
|---|---|---|---|
| UofTCSSLab/SIREN-Qwen3-0.6B | Qwen3-0.6B | 12.3 M | 85.6 |
| UofTCSSLab/SIREN-Llama-3.2-1B | Llama-3.2-1B | 5.4 M | 85.7 |
| UofTCSSLab/SIREN-Qwen3-4B | Qwen3-4B | 14.0 M | 86.7 |
| UofTCSSLab/SIREN-Llama-3.1-8B | Llama-3.1-8B | 56.0 M | 86.3 |
Each artifact ships only the trained classifier head (siren_config.json + siren.safetensors); the frozen backbone is loaded from its official Hugging Face repository. For the full API surface and per-benchmark numbers, see the model card on each Hub page.
The remainder of this repository is for reproducing the paper's results and for training SIREN on additional backbones beyond the released artifacts. The deployment runtime (pip install llm-siren) does not require any of the code below.
Create the conda environment:
conda env create -f environment.yaml
conda activate sirenTrain SIREN on the seven safety datasets for a specific backbone:
cd train
bash run_general_siren.shThe script extracts internal representations, fits layer-wise L1-regularized probes to identify safety neurons, aggregates them across layers, and trains the MLP classifier on top. The best model is saved to train/probes/optuna/{MODEL}_general/best_model.pkl.
Evaluate the trained classifier on the test sets:
cd test
bash eval_general_siren.shThe script loads train/probes/optuna/{MODEL}_general/best_model.pkl, extracts representations, runs inference on each test set, and writes per-dataset metrics to train/probes/optuna/{MODEL}_general/eval_results.json. Set MODEL in eval_general_siren.sh to match the trained backbone.
A demonstration of our trained SIREN on Qwen3-0.6B in deployment for streaming harmfulness detection:
If you find this work useful, please cite our paper:
@article{jiao2026llm,
title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
journal={arXiv preprint arXiv:2604.18519},
year={2026}
}