AI Safety Research Dashboard

Richard J. Young, Ph.D. — AI Safety & Alignment Researcher

TEMPEST Paper  |  Instruction Following Paper  |  Hugging Face  |  ORCID

10

Models Tested

83.9%

Average ASR

6/10

Models with 90%+ ASR

42.0%

Best Defense (Kimi K2 (Thinking))

Key finding: Enabling extended reasoning (thinking mode) on Kimi K2 reduced attack success rate from 97% to 42% — the single most effective mitigation observed across all models tested.

Leaderboard

Filter by Family
Filter by Type
Sort by
Model
Family
Type
Behaviors
Succeeded
Refused
ASR (%)
Mistral Large 3 (675B)
DeepSeek
Frontier (Thinking)
100
100
22
100

Attack Success Rate by Model

Textbox

Higher ASR = more vulnerable. Data from richardyoung/tempest-replication. 100 harmful behaviors per model, up to 5 multi-turn attack rounds each. Read the paper (arxiv:2512.07059)