Frontier AI Security Residency DOC · FASR-WHY

Why AI Security?

Model weights Infrastructure Hardware Cyber
38

attack vectors against frontier model weights catalogued by RAND, across nine categories

SL5

the security level that withstands top state attackers; no frontier lab meets it yet

<$200

cost to remove the safety training from an open-weight model (BadLlama)

8 weeks

fully funded, in Cambridge, on a real problem in cyber or hardware

Frontier AI is being built and deployed faster than it can be secured. Models are growing more capable month by month, and the infrastructure serving them expands with each data centre investment, presenting an ever-increasing attack surface.

Securing frontier AI is a question of whether the systems can be trusted: whether model weights stay secure, whether the hardware it runs on stays intact and does what it claims, and whether any of it can be verified rather than taken on faith. These are ordinary expectations for critical infrastructure. For frontier AI, almost none of them are met today.

A frontier model costs hundreds of millions of dollars to train end-to-end, and the result is a few terabytes of weights: a digital artefact which can be copied, moved, stolen, or misused far more easily than the infrastructure that produced it. If these model weights leak, they cannot simply be patched, recalled, or rotated like ordinary software secrets. And yet, to serve consumers at scale, those weights have to run on networked machines, decrypted in memory, inside production systems built under intense pressure for speed and performance.

Through the AI Security Residency, we want to bring strong security talent to the hardest unsolved problems around frontier AI: protecting model weights, the infrastructure they run on, and the systems built on top of them. With a strong network across ERA and our technical partner orgs, we want to help incubate new start-ups or non-profits, facilitate interdisciplinary collaborations across hardware, cyber & policy, and help 10x the efforts to secure frontier AI.

AI security & verification is an emerging field, and progress within it is constrained above all else by the number of talented people working on its problems. As frontier AI becomes a subject of international agreements, both state actors will need credible means of confirming one another's activities, and much of that infrastructure has yet to be built.

Across both cyber and hardware focus areas, our residents will work closely with established mentors on problems of real urgency.

AI Capabilities Are Rapidly Expanding

Over the past few years, AI systems have made striking progress across a broad range of domains. Security is one of the clearest domains where this shift is already visible. Models can now find previously unknown, exploitable vulnerabilities in widely used software (Google's Big Sleep, which surfaced a live zero-day in SQLite), autonomously discover and patch real flaws across tens of millions of lines of code (the systems in DARPA's AI Cyber Challenge, won by Team Atlanta's ATLANTIS), and run most of an offensive operation end to end (Anthropic reported GTG-1002, the first reported AI-orchestrated cyber-espionage campaign).

Independent evaluation is tracking the same curve: the UK AI Security Institute found Anthropic's restricted Claude Mythos Preview could discover and exploit vulnerabilities and carry out multi-stage network attacks on its own. UK AISI, May 2026:

The length of tasks frontier models can autonomously complete in our narrow cyber suite has been doubling every few months. This doubling rate has become faster over time, and recent models exceeded our previous trends.

As general-purpose AI systems improve, they become useful in more domains, and more autonomous within those domains. In particular, they become better at offensive cyber security and autonomous espionage. We need to significantly expand our security and verification efforts to meet this moment.

Security challenges for frontier AI

Confidentiality. A frontier model costs hundreds of millions of dollars to train end-to-end, and the result is a few terabytes of weights: a digital artefact which can be copied, moved, stolen, or misused far more easily than the infrastructure that produced it. Unlike ordinary software secrets, leaked weights cannot be patched, recalled, or rotated like ordinary software. And yet, to serve consumers at scale, those weights have to be read constantly: sharded across distributed training clusters, checkpointed to shared storage, and finally decrypted into GPU memory on networked inference servers, inside production systems built under intense pressure for latency and cost.

This makes confidentiality both unusually important and unusually hard to guarantee. The weights are at once the crown asset and a value which must be used millions of times a day, so the surface we have to secure is vast: training and inference clusters, object storage and checkpoints, the firmware, drivers and hypervisors beneath them, the orchestration layer, and the hundreds of people with legitimate access to some point on that path. A defender has to close every route; an attacker needs one. And the cost of failure is unusually final: a leaked model cannot be recalled, and its safety fine-tuning can be stripped for the price of a weekend's compute, so a single exfiltration hands frontier capability, permanently, to whoever now holds the file.

Integrity. Keeping the weights secure is only the first property, and on its own it is not enough. A system also needs integrity: assurance that the model actually executing is the one that was trained and evaluated, not a variant which has been swapped, tampered with, quietly fine-tuned, or degraded somewhere along a pipeline that stretches across many machines and many hands. A model whose weights never leak can still be the wrong model, and the operator may be the last to know.

Verifiability. The property that ties AI security together is verifiability: the ability for someone outside the system — an auditor, a regulator, a counterparty, another lab — to confirm what a model or a cluster is actually doing, rather than take the operator's word for it. What was trained, on which chips, running where, under what controls? Today these are answered mostly by trust, so the reach of every rule, commitment, and safety case depends on how much of it can be proven rather than promised.

Join the work

Applications opening soon. Register interest below to be notified when they do.