R&D AI Security Interpretability

Opening the Hood: Detecting Architectural Backdoors in AI Models

Tim Schulz

May 21, 2026

8:25

At Cackalacky 2026, Starseer CEO Tim Schulz presented Opening the Hood: Detecting Architectural Backdoors in AI Models, a practitioner-level walkthrough of how AI models get compromised, why traditional security tools miss it, and what detection at each layer of the model stack actually looks like. The full slide deck is available as a free download at the end of this post.

This post summarizes the key ideas from the talk for teams who are deploying AI in production and want to understand what they're actually trusting when they download and run a model.

Download the full presentation

30 slides covering the AI model supply chain, four-layer attack surface, sleeper agent mechanics, and practical detection approaches.

Download PDF

You are already running local models

The talk opens with a premise most security teams haven't fully internalized: you are already running AI models on your infrastructure, whether you chose to or not. Apple Intelligence ships a ~3B parameter model on every iPhone. Gemini Nano runs locally on Pixel keyboards and Samsung devices. Copilot autocomplete runs small completion models alongside your IDE. Your antivirus vendor ships classifier models as part of its EDR agent. None of these have a SHA the user can verify against a vendor-published manifest.

For teams that are intentionally running local models, the supply chain picture is worse. Models don't ship from vendors. They ship from repositories. Hugging Face hosts over a million models, anyone can upload anything, and the "verified" badges added in 2024 mean less than most people assume. The model file you actually run was almost certainly not published by the original vendor. It was converted, quantized, and republished by a small number of community contributors, each of whom made decisions about how to transform the weights.

The gap between software security and AI security

Software has a mature security stack: SHA-256 for identity, SAST tools like Semgrep and CodeQL for static analysis, sandboxes and fuzzers and EDR for dynamic analysis. AI models have almost none of that. A hash of what, exactly? The file? The graph? The weights? Static analysis means "read" 70 billion floating-point parameters. Dynamic analysis is limited to input/output filtering and, at the frontier, interpretability.

This framing matters because it reveals where the industry's detection tooling actually is. We wrote about a similar structural gap in our analysis of BIML's latest research on AI security measurement. There is no security meter for AI. The tools that exist today are necessary but incomplete, and the gaps are precisely where the most dangerous threats hide.

Four layers, four attack surfaces

The core of the presentation is a four-layer model of the AI attack surface. Each layer has its own risks, and detection at each layer requires its own tools.

Layer 1: The Model File

The container itself. GGUF, safetensors, pickle, ONNX, Core ML. The first risk is the loader: PyTorch's torch.load() evaluates arbitrary Python on deserialization. Loading a model is running code. Beyond pickle RCE, risks include header smuggling (malformed fields, overlapping tensor offsets), tokenizer manipulation (custom merges that bias every prompt), hidden chat templates prepended before your input, extra files like modeling_xxx.py combined with trust_remote_code=True, and external data references in sharded formats where hashing one file misses the rest.

Layer 2: The Computational Graph

The wiring diagram that determines what operations execute, in what order, on what tensors. This is where silent changes hide best. Quantization rewrites the graph. ONNX optimizers fuse, split, and replace operations automatically. A tiny adapter layer inserted near the output can re-route specific token patterns, invisible to eval suites that don't hit the trigger. These graph-level backdoors persist through fine-tuning, making them particularly durable.

Layer 3: The Weights

Billions of floating-point parameters that encode what the model learned. Fine-tuning is a cheap way to backdoor a model: a few hundred examples of trigger-to-bad-output, a LoRA fine-tune for under $50 of compute, and the model's behavior on every other prompt remains unchanged. It passes its existing eval suite with no regressions. Then it ships as a "small accuracy improvement." This is the mechanism described in Hubinger et al.'s "Sleeper Agents" research.

Layer 4: Inference

The model running and generating tokens. The sleeper agent problem surfaces here: a model trained to behave normally under standard conditions but produce harmful output when a specific trigger condition is met, such as a date encoded in the system prompt. Standard safety training didn't remove the backdoor in the Hubinger study. RLHF actually made it better at hiding. Output filters never see the trigger condition because the trigger isn't in the output. Activation-level detection does see it, because the internal representations shift measurably when the trigger fires, even when the output looks clean.

What you can detect, and what you might not

The presentation closes with an honest map of detection capabilities at each layer. SHA mismatches, pickle calls, malformed headers, and unexpected files are detectable at the file layer. Op-count diffs and structural comparison against a reference model work at the graph layer. Weight-norm outliers and distribution shifts catch some parameter-level tampering. Output-level classifiers, perplexity spikes, and regression suites cover parts of the inference layer.

What's harder to catch: models re-quantized with adversarial calibration, embedded chat templates that subtly shift behavior, LoRA-style trigger backdoors that are perplexity-neutral, and sleeper agents whose behavior only changes under conditions no eval suite tests for. These are the threats that require looking inside the model, not just observing its outputs.

Practical starting points

The talk includes several concrete recommendations for teams running models today:

Check the AI-BOM. Know the model name, base checkpoint, publisher, file hash, format, and quantization method. The OWASP AI-BOM Generator is a good starting point.

Learn to read model names. "Meta-Llama-3.1-8B-Instruct-Q4_K_M-GGUF" encodes family, version, size, post-training method, quantization recipe, and file format. Each segment carries security implications.

Know your republishers. The model you run was probably converted by a small number of community contributors. Understand who they are and what decisions they made.

Use inspection tools. Picklescan and fickling for pickle-based risks. Modelscan from Protect AI for broader file-level scanning. Netron for graph visualization. These won't catch everything, but they raise the floor.

Run a model yourself. If you've never downloaded and inspected a model file, the talk makes the case that tonight is the night: brew install llama.cpp followed by ollama run llama3.2. Understanding starts with hands-on experience.

The bottom line

You can't evaluate your way out of a backdoor you don't know exists. The AI model supply chain has structural gaps that traditional software security tooling was never designed to address. Closing those gaps requires detection at every layer of the stack, from the file container through the computational graph and weight space to inference-time activation analysis. The threats are real, the tooling is maturing, and the first step is understanding what you're actually running.

Get the full presentation

The complete 30-slide deck covers the AI model supply chain from Hugging Face to your hardware, all four layers of the attack surface with specific risk examples, the sleeper agent mechanism from Hubinger et al., and practical tools and frameworks for detection. No email required.

Download the slides (PDF)

This is the problem AI-Verify was built to address. If your team is deploying models and needs to understand what's inside them before they ship, learn how model validation works or request a demo.