No Security Meter for AI: Our Takeaways from BIML's Latest Paper
Last week, the Berryville Institute of Machine Learning (BIML) published No Security Meter for AI, a paper by Gary McGraw, Harold Figueroa, Katie McMahon, and Richie Bonett that should be required reading for every security leader deploying AI systems in production. The paper makes a case that practitioners have felt intuitively but haven't had the language to articulate: the way we measure AI security today is fundamentally broken, and the path forward requires looking inside the model, not just at its outputs.
We were honored that BIML cited Starseer's interpretability research as early evidence that whitebox analysis yields security insights external approaches cannot. But the paper's implications go well beyond any single vendor. It reframes how the entire industry should think about securing AI systems.
Here are our key takeaways, and what they mean for organizations deploying AI at scale.
AI Security Benchmarks Don't Measure What You Think They Measure
BIML dissects the major AI security benchmarks, including SECURE, CAIBench, and ExCyTIn, and reaches a conclusion that should concern every security buyer: these benchmarks measure how well a model performs on security tasks, not whether the model itself is secure.
This is a critical distinction. A model that scores 92% on a security benchmark might still harbor backdoors, respond to adversarial manipulation, or behave unpredictably when confronted with real-world conditions that deviate from the test scenarios. As BIML puts it, using a benchmark score as a security rating is like treating an annual penetration test as proof that your system is secure. It tells you about known badness, not about the presence of security.
The problem runs deeper than AI security specifically. BIML cites UC Berkeley research demonstrating that eight of the most prominent AI agent benchmarks can be exploited to achieve near-perfect scores without solving a single task. No reasoning, no capability, just exploitation of how the score is computed. If general-purpose benchmarks are this fragile, security benchmarks built on the same foundations are unlikely to fare better.
For practitioners, the takeaway is straightforward: don't use benchmark scores as a proxy for security posture. They are useful for comparing models against each other on defined tasks, but they are not security meters.
The "Strange Loop": Doing Security vs. Being Secure
One of the paper's most powerful concepts is what BIML calls the "strange loop" of AI security. There is a fundamental difference between an AI system that performs well on security tasks (using ML to do security) and an AI system that is itself secure (building ML systems that are secure).
McGraw draws the parallel from software security, where this same confusion persisted for years. Security software, like antivirus tools, is software used for security. Software security is the discipline of building software that is itself secure. Too often, security software suffered from terrible software security. The same pattern is now repeating with AI.
A model can excel at identifying vulnerabilities in code, detecting phishing attempts, or reasoning about threat scenarios, and still be vulnerable to supply chain tampering, prompt injection, or behavioral manipulation. The benchmark scores tell you about the first capability. They say nothing about the second.
This is precisely why Starseer built AI-Verify around interpretability-based model validation rather than behavioral benchmarking. When we examine a model's internal activations, weight distributions, and behavioral signatures, we're answering a fundamentally different question than "how does this model perform on a security quiz." We're asking: is this model structurally what you approved, or has something changed?
Output Monitoring Is Necessary. It Is Not Sufficient.
BIML introduces a concept from McGraw's earlier work that crystallizes the limitation of current AI security approaches: the "badness-ometer." A badness-ometer measures insecurity on a scale from "deep trouble" (you failed known tests) to "who knows" (you passed the tests we had). It can never tell you that a system is actually secure.
Every approach that monitors outputs, scans prompts, or applies guardrails at the boundary operates as a badness-ometer. These tools catch visible anomalies and known attack patterns. They are blind to threats that produce normal-looking outputs by design: backdoors that activate only under specific conditions, supply chain tampering that preserves benchmark performance, covert capabilities that evade behavioral detection.
This is where interpretability changes the equation. By examining what happens inside the model at inference, including activation patterns, circuit behavior, and structural signatures, you can detect threats that are invisible to any external observer. BIML's paper explicitly validates this approach, citing Starseer's research demonstrating that layerwise activation analysis can classify jailbreak attempts and prompt injection attacks, matching or exceeding the performance of safety classifiers explicitly fine-tuned for those tasks.
detection at 5% false positive rate
interpretability-guided fine-tuning
with targeted LoRA fine-tuning
The implication for security teams is clear: output monitoring remains a necessary layer of defense. But organizations that rely on it exclusively are flying blind to an entire category of threats that are designed to evade exactly that kind of detection.
BIML's Predicted Evolution Mirrors What Starseer Is Building
The paper maps out a four-stage evolution of AI security tooling based on patterns from software security's maturation over the past three decades:
The Four Stages of AI Security Tooling
BIML's predicted evolution based on software security history
| Stage 1 | AI red teams conducting bespoke penetration testing focused on prompt injection. This is where most of the industry is today. |
| Stage 2 | Black box controls treating models as opaque systems to be monitored and constrained externally, the AI firewall approach. |
| Stage 3 | Agentic AI controls adapting authentication and authorization frameworks to autonomous AI agents. |
| Stage 4 | Whitebox analysis: getting inside the model to understand internal behavior and move toward notions of intention. Starseer operates here. |
BIML positions Starseer's interpretability work at stage four while acknowledging that the industry broadly is still working through stages one and two. This isn't about being ahead for its own sake. It's about recognizing that each stage builds on the one before it, and organizations that invest in internal observability now will be better positioned to make credible security assurance claims as the discipline matures.
At Starseer, our platform spans this entire evolution with four integrated products:
AI-Verify addresses the supply chain integrity problem head-on. Using interpretability fingerprinting, it verifies that deployed models match their approved baselines and detects structural tampering that checksums and benchmarks miss. This is BIML's "get inside the model" thesis, operationalized as a pre-deployment validation gate.
AI-DE brings detection engineering discipline to the AI attack surface with YARA-based detection authoring mapped to MITRE ATLAS. It provides the adversarial simulation and coverage gap analysis that BIML calls for when they argue the industry needs purpose-built assurance practices for AI, not repurposed software security tools.
AI-EDR treats every deployed model and agent as a security endpoint, providing the runtime behavioral monitoring, automated containment, and forensic analysis that BIML describes as the AI equivalent of intrusion detection and incident response.
AI-SIR classifies every prompt by intent, routes to the optimal model, and enforces policy at the gateway, with every event logged into AI-EDR for full security visibility. It addresses the operational reality that organizations running multiple AI providers need a unified control plane, not just guardrails.
The Path Forward: Process, Observability, and Rigor
BIML closes with three recommendations for practitioners, and we agree with all of them.
First, use benchmarks only for what they're actually good for: comparing models against each other on defined tasks. Stop treating scores as security ratings.
Second, invest in process. Identifying which assurance activities, applied to which ML artifacts, reliably produce more secure systems is foundational work that needs to happen now. This is where Starseer's advisory practice meets organizations where they are, helping security teams stand up AI security programs grounded in the same rigor that software security matured into over three decades.
Third, treat internal observability as a priority. The paper's central argument is that we can't build a meaningful security measurement framework for AI without first understanding what's happening inside these models. Interpretability isn't the security meter itself. It's the precondition for eventually getting to one.
Independent validation matters.
The fact that BIML's independent research, led by the person who literally wrote the book on software security, arrived at the same conclusion Starseer was founded on gives us confidence that looking inside the model will define the next era of AI security.
The agentic AI wave is accelerating. Models and agents are becoming endpoints in critical enterprise workflows, making decisions, taking actions, and operating with increasing autonomy. The organizations that treat AI security with the same structural rigor that software security eventually demanded will be the ones that deploy with confidence.
Read the full BIML paper: No Security Meter for AI (PDF)
Ready to Move Beyond the Badness-ometer?
Model validation, detection engineering, runtime protection, and intelligent routing, built on interpretability from the inside out.
Get in Touch →