For years, people have cautioned we wait to do anything about AI until it starts demonstrating “dangerous capabilities.” Those capabilities may be arriving now.
Virology knowledge has been limited to a small number of experts. Expertise in dual-use fields like virology is difficult to attain, with people completing multiple degrees and dedicating their careers to reaching the forefront of research. Where knowledge is publicly available, the jargon-heavy literature is largely indecipherable to most people outside the field. To perform research involving biosafety level 3 (BSL-3) pathogens—such as SARS, anthrax, or H5N1 influenza—researchers must clear a series of approvals, including facility certification, security clearances, specialized training, and ongoing medical surveillance. Only then can they get access to these pathogens and begin acquiring the tacit skills needed to work with them
These high barriers to entry have limited the pool of people with access to powerful dual-use knowledge, keeping the chances of misuse low. But rapid developments in publicly available AI systems now risk turning amateurs into capable threat actors.
LLMs outperform human virologists in their areas of expertise on a new benchmark. This week the Center for AI Safety published a report with SecureBio that details a new benchmark for virology capabilities in publicly available frontier models. Alarmingly, the research suggests that several advanced LLMs now outperform most human virology experts in troubleshooting practical work in wet labs.
The benchmark questions focus on practical lab work. Co-developed with 57 expert virologists, the benchmark, called the Virology Capabilities Test (VCT), comprises 322 multiple-choice questions, many of which contain both text and image components. These questions go beyond abstract theory—they mirror the real-world challenges virologists face when troubleshooting experiments at the lab bench.
Many of the questions were inspired by niche but practical issues that the contributing experts had faced in their own experiments. As illustrated by the multimodal example below, the questions were designed not to be easy to Google.
VCT performance reflects dual-use capability. Given the practical nature and specificity of these questions, we consider performance on the VCT to serve as a reasonable proxy of genuinely useful virological ability. It could be employed just as effectively by those pursuing negative outcomes as by researchers performing legitimate experiments for beneficial purposes.
Some LLMs scored close to or above the 90th percentile of human experts. After giving experts questions from the VCT that were relevant to their specific sub-fields, which they had not been involved in writing or reviewing, we then tested several widely available LLMs. The results were striking. GPT-4o performed better than 53 percent of experts and Gemini 1.5 Pro better than 67 percent of them. Claude Sonnet 3.5 scored in the 75th percentile, o1 in the 89th, and o3 in the 95th.
LLMs are increasingly competent in STEM subjects. While concerning, the VCT results should not be surprising. Frontier models have steadily become more proficient in all STEM subjects they are tested on, from mathematics and physics to the biological sciences. The latter, in particular, has proven easiest for LLMs to improve their performance in.
Virology is part of STEM; if models are reaching PhD-level competence in other STEM subjects, we should expect the same to be true in virology. The concern is that virology expertise is dual-use. A PhD-level virologist can apply their expertise towards creating bioweapons.
Across multiple biology benchmarks, LLMs are performing near expert level or higher. The VCT results do not arrive in a vacuum, but as another data point in a growing field of benchmarks. For instance, on the Weapons of Mass Destruction Proxy (WMDP), which tests conceptual knowledge required for hazardous uses including bioweapons development, o1 scores around 87 percent. The baseline set by human experts is 60 percent. Since WMDP concentrates on theory, questions could still be raised around the practical applicability of LLMs that score highly on it. The VCT, with its complementary focus on addressing issues in the wet lab, appears to address those doubts.
This is further backed up by recent LLMs’ performance on ProtocolQA and BioLP-bench — two other benchmarks that, while not specifically tailored to virology, also assess models’ ability to reason about and troubleshoot common protocols in biology labs. (Note: Dan advised WMDP and BioLP-bench.) On ProtocolQA, o3 recently scored 78 percent, compared with an average score among human experts of 79 percent. On the more difficult BioLP-bench, o4-mini has achieved 34 percent, not far off the average expert mark of 38 percent. Current company evaluations are beginning to set thresholds for concern so high that even actual virology experts—whose skills are sufficient to cause significant harm—might be incorrectly deemed “not concerning enough” as they score below today's leading models.
All these results point to the same conclusion: widely available AI systems can make it easier for anyone doing work with viruses—including those trying to do harm—to overcome practical issues they might encounter.
Publicly available virologist LLMs increase biorisk by orders of magnitude. Bioweapon risk depends on certain factors: the number of people with access to bioweapon skills, the intent to create a bioweapon, and the severity of harm that a bioweapon could cause. Risk has so far been low, as there are a few hundred virologists from top virology programs, and they have not felt so inclined to create a pandemic. However, if these skills are available to hundreds of millions of people via LLMs, the probability of an intentional release grows by orders of magnitude.
The primary lever for reducing biorisk is reducing accessibility. To reduce the likelihood of a malicious actor successfully creating a pathogen, we want to increase the friction they will face in attempting to do so. The main way to do this is by targeting access to bioweapon skills; we want to limit the number of people with access to expert-level virology capabilities to those with a genuine need for them.
Two-tier access for potentially catastrophic dual-use capabilities. A first step to increase friction for malicious users is to put default filters on models, restricting them from sharing detailed instructions for how to manipulate pathogens on the US select agents list, for example. If someone has just made an account to use an AI system, and the provider does not know who they are, they should not have access to these capabilities. Providers could have a vetting process whereby legitimate researchers can contact sales and demonstrate that they are doing beneficial work, to get the filters removed. This is a simple way to mitigate the risks and capture the benefits.
Cloud labs should have measures to detect pandemic-potential pathogens. Another way to increase friction is to target cloud labs—facilities that perform experiments at the request of remote users. These labs widen access to bioweapon capabilities. Introducing measures like screening for known hazardous genomic sequences, for example, can increase friction for those seeking to do harm.
AI agents should not be created with expert-level virology capabilities. Future AI agent virologists could make it even easier to make a bioweapon: agents could perform steps on behalf of a malicious user, rather than giving instructions that the user must interpret and carry out. To avoid this scenario, which could greatly reduce friction, developers should not make expert-level AI virology agents easily available to unvetted individuals.
More advanced AI virologists can help create more severe bioweapons. While the primary way of reducing biorisk is to limit access, we should bear in mind that risk also increases with severity. The deadlier a pathogen, the greater the threat it poses. As AI virologists continue to improve beyond expert-level proficiency, they will be better able to design viruses with higher mortality rates, for example. To mitigate risks, we should therefore be placing tougher restrictions on models as their capabilities increase.
We are not at the stage where a freely available LLM could autonomously create and release a lethal virus at the behest of a user with no expertise at all. But we do not need to reach that level of capability to reduce the biorisks that AI chatbot virologists are already amplifying today.
Do not bottleneck work on evals. Numerous organizations have put huge efforts into evaluating dangerous capabilities in AI models. The rationale was that discovering AI-heightened risk in biotechnology or cyber security would be the canary in the coal mine, prompting action on safety. We have now demonstrated that LLMs amplify biorisk. The problem is that there are no adequate mitigations ready to implement. Work on safety measures should not wait until the threat is proven, as that means leaving it too late.
Developing and implementing safeguards can take months. Companies cannot safeguard dual-use capabilities overnight. Rather, to build robust safeguards, developers must come up with a standard defining the kinds of questions models should refuse to answer. They then need to create a refusal dataset to train the model with input and output filters incorporated with the help of experts. Next, even more data must be created to test and refine these filters, which then need to be implemented across all consumer and enterprise products. This process can take months.
Any window of time in which dual-use capabilities are publicly available is a window in which hundreds of millions of people can leverage the technology. Some may well use it to pursue harm. That is why companies should develop reliable safeguards for current models and build in safeguards before the evaluation is high.
Friction buys time to prepare for potential outbreaks. Long-term biodefenses, such as developing rapid-response protocols and stockpiling personal protective equipment, are achievable but will require time. Restricting access to virology capabilities reduces the probability of a successful attack, giving governments more time to build defenses. Time also allows for GDP growth, which is likely to accelerate through the widespread use of AI, increasing wealth to pay for defenses.
Work on mitigations for other potential future risks must start today. It is a sobering thought that the risks associated with existing, publicly available models were not discovered until now. Looking ahead, we will in all likelihood find that AIs have expert-level competence in dual-use cyber capabilities soon, given the efforts going into developing them. Learning from biorisk, companies should start working on cyber mitigations now, so that we do not have to play catch-up when heightened cyber risk is demonstrated, as we must for biorisk today.
To evoke a common metaphor, we are the proverbial frog in water. As the temperature rises, each additional degree might seem like a small increase, but at some point, we run the risk of boiling. The water is already steaming—the time to jump is now.
Read the full report from the Center for AI Safety and SecureBio here.
AI is naturally prone to being tricked into behaving badly, but researchers are working hard to patch that weakness.