Skip to content
R&D

Stop Running Entire Models When You Only Need a Few Layers

Starseer Labs
Starseer Labs
Stop Running Entire Models When You Only Need a Few Layers
8:33

Purpose-built guard models, full-layer fine-tuning, brute-force inference: the industry's default approach to model optimization assumes you need to run everything. Starseer's interpretability platform challenges that assumption. By reading the internal representations of neural networks, we isolate exactly which layers, neurons, and activation patterns drive specific behaviors, and discard the rest. The results: near-identical accuracy to leading 7B guard models at ~38ms latency, fine-tuning that exceeds full-layer performance with 31% fewer layers, and safety classification at 85%+ accuracy using just 3 layers across 17 models.

 

Guardrails Don't Need Their Own Model

The industry standard for jailbreak and content safety detection is a purpose-built guard model, typically 7 to 9 billion parameters, running alongside your production model. These work well, but they add latency, cost, and operational complexity.

Starseer takes a fundamentally different approach. Instead of running a separate model, we train lightweight probes that read the AI-native mathematical representations already present in the host model's activations during inference. We're not adding computation, we're extracting signal that's already there.

On a standardized jailbreak detection benchmark using a mixed corpus of benign and malicious prompts, our probe achieved a 0.9918 AUC and 96.3% accuracy, within striking distance of WildGuard-7B (0.9953 AUC, 98.0% accuracy), the leading open-source guard model. Our probe's true positive rate at 5% false positive rate hit 97.0%, compared to WildGuard's 98.8%.

~38ms
Starseer probe latency
vs 106–570ms guard models
~1,000x
Fewer parameters than
the best 7B guard model
99.2%
Detection accuracy
on jailbreak classification

The cost difference is where it gets interesting. Starseer's probe runs 2.7 times faster than WildGuard-7B and 15 times faster than ShieldGemma-9B. For models where you control the weights, probes run directly on the model's own internal representations during inference. For external APIs, we use a canary architecture that operates asynchronously.

The gap between interpretability-guided detection and conventional approaches isn't marginal. It's structural. Guard models in the lower tier, ShieldGemma-9B, Llama Guard 3-8B, and Granite Guardian, clustered around 70% accuracy with true positive rates in the 14–44% range.

Standardized Jailbreak Detection Benchmark

Mixed benign/malicious prompt corpus · Higher is better across all metrics

Model AUC Accuracy TPR @ 5% FPR Latency
WildGuard-7B 0.9953 98.0% 98.8% 106 ms
✦  Starseer 0.9918 96.3% 97.0% ~38 ms
ShieldGemma-9B 0.7806 71.0% 44.1% 570 ms
Llama Guard 3-8B 0.7761 71.8% 44.1% 121 ms
Granite Guardian 3.3-8B 0.7601 69.8% 22.4% 130 ms
Llama Prompt Guard 2-86M 0.6344 61.2% 14.3% 11 ms

AUC = Area Under Curve (overall detection quality, 0–1). TPR @ 5% FPR = True Positive Rate while misclassifying only 5% of safe content.

Fine-Tuning: Why Adapt Every Layer When Most Don't Matter?

Standard LoRA fine-tuning applies low-rank adapters across every layer of a model. It's elegant and effective, but it doesn't account for the fact that different behaviors are encoded in different layers. Format compliance, reasoning patterns, and domain knowledge don't live everywhere. They concentrate in specific parts of the network.

Starseer uses interpretability signals to identify which layers actually encode the target behavior, then applies fine-tuning only to those layers. We tested this on GSM8K, a standard mathematical reasoning benchmark, using Llama-3.1-8B.

Full LoRA across all 32 layers produced 59.2% exact match accuracy and 34.8% answer-found accuracy. Starseer's top 50% layer selection, just 16 layers, recovered 93% of full fine-tuning performance on exact match while actually exceeding it on answer-found (46.6% vs. 34.8%). Our combined interpretability signal, using 22 of 32 layers, beat full fine-tuning on both metrics: 60.7% exact match and 58.6% answer-found, a 68% improvement in raw reasoning ability with 31% fewer layers.

GSM8K Mathematical Reasoning Benchmark

Llama-3.1-8B · Interpretability-guided layer selection vs. full LoRA

Method Layers Exact Match Answer Found vs. Full Fine-Tune
Full LoRA (all 32 layers) 32 59.2% 34.8% Baseline
✦  Starseer Top 50% 16 54.8% 46.6% 93% exact / 134% answer
✦  Starseer Combined Signal 22 60.7% 58.6% 103% exact / 168% answer
Base model (no fine-tuning) 0 0.0% 17.7%

Exact Match = answer in the precise expected format. Answer Found = correct answer anywhere in the response.

Key Finding

Fewer layers didn't just maintain performance, it improved it.

Our hypothesis is that irrelevant layers introduce noise during fine-tuning, and surgical selection eliminates that noise. We validated the same approach on Qwen2-7B, confirming that Starseer's layer-ranking signals transfer across model architectures.

Three Layers. Seventeen Models. Eighty-Five Percent Accuracy.

Perhaps the most striking finding is how few layers are needed for safety classification across entirely different model families.

Using activation patterns to identify the three most informative layers for jailbreak detection, we tested on JailbreakBench (200 prompts) across models ranging from 0.5B to 32B parameters, spanning seven different model families: Mistral, Olmo, Qwen, Llama, and others.

3 layers
Out of 24–64 total layers
needed for classification
85–87%
Accuracy across 17 models
from 7 different families

The results were remarkably consistent. Accuracy ranged from 85.2% to 87.1%, with F1 scores between 84.4% and 86.2%, all using just three layers out of models with 24 to 64 total layers. In most cases, using all layers actually performed worse than Starseer's targeted selection, due to noise from irrelevant layers.

This isn't a single-model trick. The layer-ranking method is model-agnostic. The signal is in the structure, not the specific architecture.

Interpretability Belongs in Your Ops Stack

Interpretability has spent years in the research lab. The results on this page make the case that it belongs somewhere else: in your production pipeline, alongside the monitoring, deployment, and optimization tools your team already relies on.

The commercial impact is concrete. Interpretability-guided probes replace dedicated guard models, cutting inference costs and latency without sacrificing detection quality. Targeted layer selection reduces fine-tuning compute and training time while delivering equal or better performance. Cross-architecture transferability means the same methodology works as your model stack evolves, no rewrite required.

These aren't research projections. They're operational results, measured on standard benchmarks, validated across model families, and ready to deploy. When you can see inside a model and identify exactly which components drive a given behavior, every downstream decision gets cheaper, faster, and more predictable.

Every number on this page comes from production research on real models. Interpretability isn't a lens for studying AI. It's a lever for running it.

If you're running guard models, managing fine-tuning pipelines, or scaling inference across model families, Starseer's interpretability platform fits directly into the stack you already have.

What Could Starseer Do for Your Models?

Precision optimization for guardrails, fine-tuning, and model efficiency, built on interpretability signals from production research.

Get in Touch →

Share this post