AI Safety Research Dashboard

10

Models Tested

83.9%

Average ASR

6/10

Models with 90%+ ASR

42.0%

Best Defense (Kimi K2 (Thinking))

Key finding: Enabling extended reasoning (thinking mode) on Kimi K2 reduced attack success rate from 97% to 42% — the single most effective mitigation observed across all models tested.

Leaderboard

Filter by Family

Filter by Type

Sort by

Model	Family	Type	Behaviors	Succeeded	Refused	ASR (%)
Mistral Large 3 (675B)	DeepSeek	Frontier (Thinking)	100	100	22	100

Model	Family	Type	Behaviors	Succeeded	Refused	ASR (%)
Mistral Large 3 (675B)	Mistral	Frontier	100	100	0	100
Gemma 3 (12B)	Gemma	Frontier	100	100	0	100
GLM-4.6	GLM	Frontier	100	99	1	99
DeepSeek V3.1 (671B)	DeepSeek	Frontier	100	99	1	99
Kimi K2	Kimi	Frontier	100	97	3	97
Cogito 2.1 (671B)	Cogito	Frontier	100	96	4	96
GPT-OSS (20B)	OpenAI	Frontier	100	78	22	78
GPT-OSS (120B)	OpenAI	Frontier	100	73	27	73
MiniMax M2	MiniMax	Frontier	100	55	45	55
Kimi K2 (Thinking)	Kimi	Frontier (Thinking)	100	42	58	42

Attack Success Rate by Model

Higher ASR = more vulnerable. Data from richardyoung/tempest-replication. 100 harmful behaviors per model, up to 5 multi-turn attack rounds each. Read the paper (arxiv:2512.07059)

256

Models Tested

43.7%

Overall Pass Rate

20

Diagnostic Tests

2.7%

Hardest Test Pass Rate

Key finding: Most LLMs cannot reliably follow precise instructions. The hardest test (String Manipulation Chain) is passed by only 7 out of 256 models. Even top-tier models fail basic formatting and counting tasks.

Model Leaderboard

Show top N models

10 256

Rank	Model	Passed	Total	Pass Rate (%)
10	cognitivecomputations/dolphin3.0-mistral-24b:free	20	20	100

Rank	Model	Passed	Total	Pass Rate (%)
1	qwen/qwen-plus-2025-07-28:thinking	20	20	100
2	x-ai/grok-4-fast	19	20	95
3	x-ai/grok-code-fast-1	18	20	90
4	x-ai/grok-4	18	20	90
5	openai/gpt-oss-120b	17	20	85
6	openai/gpt-oss-20b:free	17	20	85
7	qwen/qwen3-vl-235b-a22b-thinking	15	20	75
8	qwen/qwen3-max	15	20	75
9	openai/gpt-oss-20b	15	20	75
10	openai/gpt-5-codex	15	20	75
11	qwen/qwen3-vl-235b-a22b-instruct	15	20	75
12	tencent/hunyuan-a13b-instruct	15	20	75
13	x-ai/grok-3-mini	15	20	75
14	anthropic/claude-3.7-sonnet	15	20	75
15	anthropic/claude-3.7-sonnet:thinking	15	20	75
16	perplexity/sonar-reasoning-pro	15	20	75
17	nousresearch/deephermes-3-mistral-24b-preview	14	20	70
18	x-ai/grok-3	14	20	70
19	nousresearch/hermes-4-405b	14	20	70
20	x-ai/grok-3-beta	14	20	70
21	sao10k/l3.1-70b-hanami-x1	14	20	70
22	google/gemini-2.5-flash-image	14	20	70
23	opengvlab/internvl3-78b	14	20	70
24	switchpoint/router	14	20	70
25	openrouter/auto	14	20	70
26	anthropic/claude-3.5-sonnet	14	20	70
27	google/gemini-2.5-flash-lite-preview-09-2025	13	20	65
28	qwen/qwen3-coder-plus	13	20	65
29	nousresearch/hermes-4-70b	13	20	65
30	google/gemini-2.5-flash-image-preview	13	20	65
31	moonshotai/kimi-k2-0905	13	20	65
32	z-ai/glm-4-32b	13	20	65
33	google/gemini-2.5-flash-lite-preview-06-17	13	20	65
34	arcee-ai/coder-large	13	20	65
35	anthropic/claude-sonnet-4	13	20	65
36	x-ai/grok-3-mini-beta	13	20	65
37	allenai/olmo-2-0325-32b-instruct	13	20	65
38	openai/o3	13	20	65
39	alfredpros/codellama-7b-instruct-solidity	13	20	65
40	sao10k/l3-euryale-70b	13	20	65
41	cognitivecomputations/dolphin3.0-mistral-24b:free	13	20	65
42	cognitivecomputations/dolphin3.0-mistral-24b	13	20	65
43	qwen/qwen-plus-2025-07-28	13	20	65
44	qwen/qwen3-coder	13	20	65
45	qwen/qwen3-235b-a22b-2507	13	20	65
46	moonshotai/kimi-k2	13	20	65
47	mistralai/devstral-medium	13	20	65
48	inclusionai/ling-1t	13	20	65
49	openai/gpt-4-0314	13	20	65
50	cohere/command-r-08-2024	13	20	65

Test Difficulty (hardest first)

Test	Category	Passed	Total	Pass Rate (%)
Deduplication and Position Multiply	Constraint Compliance	123	256	12.5

Test	Category	Passed	Total	Pass Rate (%)
String Manipulation Chain	String Manipulation	7	256	2.7
Remove Repeated Letters JSON	Data Processing	14	256	5.5
Multi-step String Manipulation	String Manipulation	20	256	7.8
Vowel Count Sorting	Data Processing	20	256	7.8
Same Start/End Letter	String Manipulation	22	256	8.6
Matrix Diagonal Difference	Mathematical	32	256	12.5
Deduplication and Position Multiply	Mathematical	43	256	16.8
String Replace with Newlines	String Manipulation	74	256	28.9
Prime JSON	Format Conversion	123	256	48
Base64 Encoding	Format Conversion	147	256	57.4
Complex Password	Constraint Compliance	148	256	57.8
Digit Sum Categorization	Mathematical	150	256	58.6
Complex List Processing	Data Processing	163	256	63.7
Prime After 10000	Mathematical	170	256	66.4
Perfect Squares Table	Format Conversion	174	256	68
CSV Filter Markdown	Data Processing	180	256	70.3
Roman Numerals	Mathematical	180	256	70.3
Sentence Without E	Constraint Compliance	181	256	70.7
Safety Refusal	Constraint Compliance	185	256	72.3
Selective Text Processing	Data Processing	205	256	80.1

Search by Model

1	2	3

20 diagnostic tests across 5 categories. Read the paper (arxiv:2510.18892). Dataset: richardyoung/llm-instruction-following-eval

Paper	Topic	Link
Replicating TEMPEST at Scale	Multi-turn adversarial attacks on 10 frontier models	arxiv:2512.07059
Abliteration Methods Compared	4 tools across 16 architectures	arxiv:2512.13655
Instruction Adherence (256 LLMs)	20 diagnostic tests	arxiv:2510.18892
Guardrail Robustness	10 guardrails, 1,445 adversarial prompts	arxiv:2511.22047
PHI Leakage in Medical OCR	Vision masking fails at 42.9% ceiling	arxiv:2511.18272
CardioEmbed	Domain embeddings for cardiology	arxiv:2511.10930
LoRA Embeddings Compared	10 models for clinical text	arxiv:2511.19739