Stop Running Entire Models When You Only Need a Few Layers
Purpose-built guard models, full-layer fine-tuning, brute-force inference: the industry's default approach to model optimization assumes you need to run everything. Starseer's interpretability platform challenges that assumption. By reading the internal representations of neural networks, we isolate exactly which layers, neurons, and activation patterns drive specific behaviors, and discard the rest. The results: near-identical accuracy to leading 7B guard models at ~38ms latency, fine-tuning that exceeds full-layer performance with 31% fewer layers, and safety classification at 85%+ accuracy using just 3 layers across 17 models.
Guardrails Don't Need Their Own Model
The industry standard for jailbreak and content safety detection is a purpose-built guard model, typically 7 to 9 billion parameters, running alongside your production model. These work well, but they add latency, cost, and operational complexity.
Starseer takes a fundamentally different approach. Instead of running a separate model, we train lightweight probes that read the AI-native mathematical representations already present in the host model's activations during inference. We're not adding computation, we're extracting signal that's already there.
On a standardized jailbreak detection benchmark using a mixed corpus of benign and malicious prompts, our probe achieved a 0.9918 AUC and 96.3% accuracy, within striking distance of WildGuard-7B (0.9953 AUC, 98.0% accuracy), the leading open-source guard model. Our probe's true positive rate at 5% false positive rate hit 97.0%, compared to WildGuard's 98.8%.
vs 106–570ms guard models
the best 7B guard model
on jailbreak classification
The cost difference is where it gets interesting. Starseer's probe runs 2.7 times faster than WildGuard-7B and 15 times faster than ShieldGemma-9B. For models where you control the weights, probes run directly on the model's own internal representations during inference. For external APIs, we use a canary architecture that operates asynchronously.
The gap between interpretability-guided detection and conventional approaches isn't marginal. It's structural. Guard models in the lower tier, ShieldGemma-9B, Llama Guard 3-8B, and Granite Guardian, clustered around 70% accuracy with true positive rates in the 14–44% range.
Standardized Jailbreak Detection Benchmark
Mixed benign/malicious prompt corpus · Higher is better across all metrics
| Model | AUC | Accuracy | TPR @ 5% FPR | Latency |
|---|---|---|---|---|
| WildGuard-7B | 0.9953 | 98.0% | 98.8% | 106 ms |
| ✦ Starseer | 0.9918 | 96.3% | 97.0% | ~38 ms |
| ShieldGemma-9B | 0.7806 | 71.0% | 44.1% | 570 ms |
| Llama Guard 3-8B | 0.7761 | 71.8% | 44.1% | 121 ms |
| Granite Guardian 3.3-8B | 0.7601 | 69.8% | 22.4% | 130 ms |
| Llama Prompt Guard 2-86M | 0.6344 | 61.2% | 14.3% | 11 ms |
AUC = Area Under Curve (overall detection quality, 0–1). TPR @ 5% FPR = True Positive Rate while misclassifying only 5% of safe content.
Fine-Tuning: Why Adapt Every Layer When Most Don't Matter?
Standard LoRA fine-tuning applies low-rank adapters across every layer of a model. It's elegant and effective, but it doesn't account for the fact that different behaviors are encoded in different layers. Format compliance, reasoning patterns, and domain knowledge don't live everywhere. They concentrate in specific parts of the network.
Starseer uses interpretability signals to identify which layers actually encode the target behavior, then applies fine-tuning only to those layers. We tested this on GSM8K, a standard mathematical reasoning benchmark, using Llama-3.1-8B.
Full LoRA across all 32 layers produced 59.2% exact match accuracy and 34.8% answer-found accuracy. Starseer's top 50% layer selection, just 16 layers, recovered 93% of full fine-tuning performance on exact match while actually exceeding it on answer-found (46.6% vs. 34.8%). Our combined interpretability signal, using 22 of 32 layers, beat full fine-tuning on both metrics: 60.7% exact match and 58.6% answer-found, a 68% improvement in raw reasoning ability with 31% fewer layers.
GSM8K Mathematical Reasoning Benchmark
Llama-3.1-8B · Interpretability-guided layer selection vs. full LoRA
| Method | Layers | Exact Match | Answer Found | vs. Full Fine-Tune |
|---|---|---|---|---|
| Full LoRA (all 32 layers) | 32 | 59.2% | 34.8% | Baseline |
| ✦ Starseer Top 50% | 16 | 54.8% | 46.6% | 93% exact / 134% answer |
| ✦ Starseer Combined Signal | 22 | 60.7% | 58.6% | 103% exact / 168% answer |
| Base model (no fine-tuning) | 0 | 0.0% | 17.7% | — |
Exact Match = answer in the precise expected format. Answer Found = correct answer anywhere in the response.
Fewer layers didn't just maintain performance, it improved it.
Our hypothesis is that irrelevant layers introduce noise during fine-tuning, and surgical selection eliminates that noise. We validated the same approach on Qwen2-7B, confirming that Starseer's layer-ranking signals transfer across model architectures.
Three Layers. Seventeen Models. Eighty-Five Percent Accuracy.
Perhaps the most striking finding is how few layers are needed for safety classification across entirely different model families.
Using activation patterns to identify the three most informative layers for jailbreak detection, we tested on JailbreakBench (200 prompts) across models ranging from 0.5B to 32B parameters, spanning seven different model families: Mistral, Olmo, Qwen, Llama, and others.
needed for classification
from 7 different families
The results were remarkably consistent. Accuracy ranged from 85.2% to 87.1%, with F1 scores between 84.4% and 86.2%, all using just three layers out of models with 24 to 64 total layers. In most cases, using all layers actually performed worse than Starseer's targeted selection, due to noise from irrelevant layers.
This isn't a single-model trick. The layer-ranking method is model-agnostic. The signal is in the structure, not the specific architecture.
Interpretability Belongs in Your Ops Stack
Interpretability has spent years in the research lab. The results on this page make the case that it belongs somewhere else: in your production pipeline, alongside the monitoring, deployment, and optimization tools your team already relies on.
The commercial impact is concrete. Interpretability-guided probes replace dedicated guard models, cutting inference costs and latency without sacrificing detection quality. Targeted layer selection reduces fine-tuning compute and training time while delivering equal or better performance. Cross-architecture transferability means the same methodology works as your model stack evolves, no rewrite required.
These aren't research projections. They're operational results, measured on standard benchmarks, validated across model families, and ready to deploy. When you can see inside a model and identify exactly which components drive a given behavior, every downstream decision gets cheaper, faster, and more predictable.
Every number on this page comes from production research on real models. Interpretability isn't a lens for studying AI. It's a lever for running it.
If you're running guard models, managing fine-tuning pipelines, or scaling inference across model families, Starseer's interpretability platform fits directly into the stack you already have.
What Could Starseer Do for Your Models?
Precision optimization for guardrails, fine-tuning, and model efficiency, built on interpretability signals from production research.
Get in Touch →