LLM Safety From Within: Detecting Harmful Content with Internal Representations

🎉 We are delighted to announce that our paper has been accepted to ACL 2026! We are grateful to the constructive feedback from anonymous reviewers and area chairs.

Official repository for SIREN (Safeguard with Internal REpresentatioN), a lightweight guard model that detects harmful content leveraging a frozen LLM's internal representations. Rather than fine-tuning the entire backbone and decoding from the terminal layer, SIREN identifies safety neurons via layer-wise linear probing and aggregates them across layers with a performance-weighted strategy, training only a small classifier on top.

Use SIREN as a guard model

Install the runtime:

pip install llm-siren

Load any released SIREN artifact and score content:

import torch
from siren_guard import SirenGuard

guard = SirenGuard.from_pretrained(
    "UofTCSSLab/SIREN-Qwen3-0.6B",
    device="cuda",
    dtype=torch.bfloat16,
)

# Prompt-level moderation
guard.score("How can I make a pipe bomb at home?")
# ScoreResult(score=1.0, is_harmful=True, threshold=0.5)

# Response-level moderation (prompt + response)
guard.score(
    prompt="How can I make a pipe bomb at home?",
    response="I can't help with that. Building explosive devices is illegal.",
)
# ScoreResult(score=0.0, is_harmful=False, threshold=0.5)

# Streaming detection over a growing assistant prefix
prefix = ""
for chunk in stream_from_deployed_llm(user_prompt):
    prefix += chunk
    if guard.score_streaming(prefix, threshold=0.5).is_harmful:
        break

Available models

Repo	Backbone	Number of Parameters	Benchmark Performance
UofTCSSLab/SIREN-Qwen3-0.6B	Qwen3-0.6B	12.3 M	85.6
UofTCSSLab/SIREN-Llama-3.2-1B	Llama-3.2-1B	5.4 M	85.7
UofTCSSLab/SIREN-Qwen3-4B	Qwen3-4B	14.0 M	86.7
UofTCSSLab/SIREN-Llama-3.1-8B	Llama-3.1-8B	56.0 M	86.3

Each artifact ships only the trained classifier head (siren_config.json + siren.safetensors); the frozen backbone is loaded from its official Hugging Face repository. For the full API surface and per-benchmark numbers, see the model card on each Hub page.

Reproducing paper results / Training SIREN on a new backbone

The remainder of this repository is for reproducing the paper's results and for training SIREN on additional backbones beyond the released artifacts. The deployment runtime (pip install llm-siren) does not require any of the code below.

Setup

Create the conda environment:

conda env create -f environment.yaml
conda activate siren

Training

Train SIREN on the seven safety datasets for a specific backbone:

cd train
bash run_general_siren.sh

The script extracts internal representations, fits layer-wise L1-regularized probes to identify safety neurons, aggregates them across layers, and trains the MLP classifier on top. The best model is saved to train/probes/optuna/{MODEL}_general/best_model.pkl.

Evaluation

Evaluate the trained classifier on the test sets:

cd test
bash eval_general_siren.sh

The script loads train/probes/optuna/{MODEL}_general/best_model.pkl, extracts representations, runs inference on each test set, and writes per-dataset metrics to train/probes/optuna/{MODEL}_general/eval_results.json. Set MODEL in eval_general_siren.sh to match the trained backbone.

Demo

A demonstration of our trained SIREN on Qwen3-0.6B in deployment for streaming harmfulness detection:

Citation

If you find this work useful, please cite our paper:

@article{jiao2026llm,
  title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
  author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
  journal={arXiv preprint arXiv:2604.18519},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
test		test
train		train
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Use SIREN as a guard model

Available models

Reproducing paper results / Training SIREN on a new backbone

Setup

Training

Evaluation

Demo

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Safety From Within: Detecting Harmful Content with Internal Representations

Use SIREN as a guard model

Available models

Reproducing paper results / Training SIREN on a new backbone

Setup

Training

Evaluation

Demo

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages