AI Safety Research Dashboard
Richard J. Young, Ph.D. — AI Safety & Alignment Researcher
TEMPEST Paper | Instruction Following Paper | Hugging Face | ORCID
10
Models Tested
83.9%
Average ASR
6/10
Models with 90%+ ASR
42.0%
Best Defense (Kimi K2 (Thinking))
Key finding: Enabling extended reasoning (thinking mode) on Kimi K2 reduced attack success rate from 97% to 42% — the single most effective mitigation observed across all models tested.
Leaderboard
Model | Family | Type | Behaviors | Succeeded | Refused | ASR (%) |
|---|---|---|---|---|---|---|
Mistral Large 3 (675B) | DeepSeek | Frontier (Thinking) | 100 | 100 | 22 | 100 |
Attack Success Rate by Model
Higher ASR = more vulnerable. Data from richardyoung/tempest-replication. 100 harmful behaviors per model, up to 5 multi-turn attack rounds each. Read the paper (arxiv:2512.07059)
256
Models Tested
43.7%
Overall Pass Rate
20
Diagnostic Tests
2.7%
Hardest Test Pass Rate
Key finding: Most LLMs cannot reliably follow precise instructions. The hardest test (String Manipulation Chain) is passed by only 7 out of 256 models. Even top-tier models fail basic formatting and counting tasks.
Model Leaderboard
Rank | Model | Passed | Total | Pass Rate (%) |
|---|---|---|---|---|
10 | cognitivecomputations/dolphin3.0-mistral-24b:free | 20 | 20 | 100 |
Rank | Model | Passed | Total | Pass Rate (%) |
|---|---|---|---|---|
1 | qwen/qwen-plus-2025-07-28:thinking | 20 | 20 | 100 |
2 | x-ai/grok-4-fast | 19 | 20 | 95 |
3 | x-ai/grok-code-fast-1 | 18 | 20 | 90 |
4 | x-ai/grok-4 | 18 | 20 | 90 |
5 | openai/gpt-oss-120b | 17 | 20 | 85 |
6 | openai/gpt-oss-20b:free | 17 | 20 | 85 |
7 | qwen/qwen3-vl-235b-a22b-thinking | 15 | 20 | 75 |
8 | qwen/qwen3-max | 15 | 20 | 75 |
9 | openai/gpt-oss-20b | 15 | 20 | 75 |
10 | openai/gpt-5-codex | 15 | 20 | 75 |
11 | qwen/qwen3-vl-235b-a22b-instruct | 15 | 20 | 75 |
12 | tencent/hunyuan-a13b-instruct | 15 | 20 | 75 |
13 | x-ai/grok-3-mini | 15 | 20 | 75 |
14 | anthropic/claude-3.7-sonnet | 15 | 20 | 75 |
15 | anthropic/claude-3.7-sonnet:thinking | 15 | 20 | 75 |
16 | perplexity/sonar-reasoning-pro | 15 | 20 | 75 |
17 | nousresearch/deephermes-3-mistral-24b-preview | 14 | 20 | 70 |
18 | x-ai/grok-3 | 14 | 20 | 70 |
19 | nousresearch/hermes-4-405b | 14 | 20 | 70 |
20 | x-ai/grok-3-beta | 14 | 20 | 70 |
21 | sao10k/l3.1-70b-hanami-x1 | 14 | 20 | 70 |
22 | google/gemini-2.5-flash-image | 14 | 20 | 70 |
23 | opengvlab/internvl3-78b | 14 | 20 | 70 |
24 | switchpoint/router | 14 | 20 | 70 |
25 | openrouter/auto | 14 | 20 | 70 |
26 | anthropic/claude-3.5-sonnet | 14 | 20 | 70 |
27 | google/gemini-2.5-flash-lite-preview-09-2025 | 13 | 20 | 65 |
28 | qwen/qwen3-coder-plus | 13 | 20 | 65 |
29 | nousresearch/hermes-4-70b | 13 | 20 | 65 |
30 | google/gemini-2.5-flash-image-preview | 13 | 20 | 65 |
31 | moonshotai/kimi-k2-0905 | 13 | 20 | 65 |
32 | z-ai/glm-4-32b | 13 | 20 | 65 |
33 | google/gemini-2.5-flash-lite-preview-06-17 | 13 | 20 | 65 |
34 | arcee-ai/coder-large | 13 | 20 | 65 |
35 | anthropic/claude-sonnet-4 | 13 | 20 | 65 |
36 | x-ai/grok-3-mini-beta | 13 | 20 | 65 |
37 | allenai/olmo-2-0325-32b-instruct | 13 | 20 | 65 |
38 | openai/o3 | 13 | 20 | 65 |
39 | alfredpros/codellama-7b-instruct-solidity | 13 | 20 | 65 |
40 | sao10k/l3-euryale-70b | 13 | 20 | 65 |
41 | cognitivecomputations/dolphin3.0-mistral-24b:free | 13 | 20 | 65 |
42 | cognitivecomputations/dolphin3.0-mistral-24b | 13 | 20 | 65 |
43 | qwen/qwen-plus-2025-07-28 | 13 | 20 | 65 |
44 | qwen/qwen3-coder | 13 | 20 | 65 |
45 | qwen/qwen3-235b-a22b-2507 | 13 | 20 | 65 |
46 | moonshotai/kimi-k2 | 13 | 20 | 65 |
47 | mistralai/devstral-medium | 13 | 20 | 65 |
48 | inclusionai/ling-1t | 13 | 20 | 65 |
49 | openai/gpt-4-0314 | 13 | 20 | 65 |
50 | cohere/command-r-08-2024 | 13 | 20 | 65 |
Test Difficulty (hardest first)
Test | Category | Passed | Total | Pass Rate (%) |
|---|---|---|---|---|
Deduplication and Position Multiply | Constraint Compliance | 123 | 256 | 12.5 |
Test | Category | Passed | Total | Pass Rate (%) |
|---|---|---|---|---|
String Manipulation Chain | String Manipulation | 7 | 256 | 2.7 |
Remove Repeated Letters JSON | Data Processing | 14 | 256 | 5.5 |
Multi-step String Manipulation | String Manipulation | 20 | 256 | 7.8 |
Vowel Count Sorting | Data Processing | 20 | 256 | 7.8 |
Same Start/End Letter | String Manipulation | 22 | 256 | 8.6 |
Matrix Diagonal Difference | Mathematical | 32 | 256 | 12.5 |
Deduplication and Position Multiply | Mathematical | 43 | 256 | 16.8 |
String Replace with Newlines | String Manipulation | 74 | 256 | 28.9 |
Prime JSON | Format Conversion | 123 | 256 | 48 |
Base64 Encoding | Format Conversion | 147 | 256 | 57.4 |
Complex Password | Constraint Compliance | 148 | 256 | 57.8 |
Digit Sum Categorization | Mathematical | 150 | 256 | 58.6 |
Complex List Processing | Data Processing | 163 | 256 | 63.7 |
Prime After 10000 | Mathematical | 170 | 256 | 66.4 |
Perfect Squares Table | Format Conversion | 174 | 256 | 68 |
CSV Filter Markdown | Data Processing | 180 | 256 | 70.3 |
Roman Numerals | Mathematical | 180 | 256 | 70.3 |
Sentence Without E | Constraint Compliance | 181 | 256 | 70.7 |
Safety Refusal | Constraint Compliance | 185 | 256 | 72.3 |
Selective Text Processing | Data Processing | 205 | 256 | 80.1 |
Search by Model
1 | 2 | 3 |
|---|---|---|
1 | 2 | 3 |
|---|
20 diagnostic tests across 5 categories. Read the paper (arxiv:2510.18892). Dataset: richardyoung/llm-instruction-following-eval
Richard J. Young, Ph.D.
AI alignment and safety researcher focused on trustworthy foundation models in healthcare and behavioral health.
Current Roles
- Senior AI Research Scientist, UnitedHealth Group
- Part-Time Professor, UNLV Lee Business School
Research Focus
Empirically measuring and mitigating worst-case failure modes in LLMs: adversarial attacks, instruction breakdowns, privacy leakage, and information hazards in clinical and behavioral health settings.
Publications
| Paper | Topic | Link |
|---|---|---|
| Replicating TEMPEST at Scale | Multi-turn adversarial attacks on 10 frontier models | arxiv:2512.07059 |
| Abliteration Methods Compared | 4 tools across 16 architectures | arxiv:2512.13655 |
| Instruction Adherence (256 LLMs) | 20 diagnostic tests | arxiv:2510.18892 |
| Guardrail Robustness | 10 guardrails, 1,445 adversarial prompts | arxiv:2511.22047 |
| PHI Leakage in Medical OCR | Vision masking fails at 42.9% ceiling | arxiv:2511.18272 |
| CardioEmbed | Domain embeddings for cardiology | arxiv:2511.10930 |
| LoRA Embeddings Compared | 10 models for clinical text | arxiv:2511.19739 |
Resources
Contact
richard@deepneuro.ai | ryoung@unlv.edu | ORCID: 0000-0002-1109-7552
Want your model tested? Open a discussion on the TEMPEST dataset with the model repo link.