OpenAI’s GPT-5 launched in early August, after extensive internal testing. But another OpenAI model — one with math skills advanced enough to achieve “gold medal-level performance” on the world’s most prestigious math competition — will not be released for months. This isn’t unusual. Increasingly, AI systems with capabilities considerably ahead of what the public can access remain hidden inside corporate labs.
This hidden frontier represents America’s greatest technological advantage — and a serious, overlooked vulnerability. These internal models are the first to develop dual-use capabilities in areas like cyberoffense and bioweapon design. And they’re increasingly capable of performing the type of research-and-development tasks that go into building the next generation of AI systems — creating a recursive loop where any security failure could cascade through subsequent generations of technology. They’re the crown jewels that adversaries desperately want to steal. This makes their protection vital. Yet the dangers they may pose are invisible to the public, policymakers, and third-party auditors.
While policymakers debate chatbots, deepfakes, and other more visible concerns, the real frontier of AI is unfolding behind closed doors. Therefore, a central pillar of responsible AI strategy must be to enhance transparency into and oversight of these potent, privately held systems while still protecting them from rival AI companies, hackers, and America’s geopolitical adversaries.
Each of the models that power the major AI systems you've heard of — ChatGPT, Claude, Gemini — spends months as an internal model before public release. During this period, these systems undergo safety testing, capability evaluation, and refinement. To be clear, this is good!
Keeping frontier models under wraps has advantages. Companies keep models internal for compelling reasons beyond safety testing. As AI systems become capable of performing the work of software engineers and researchers, there’s a powerful incentive to deploy them internally rather than selling access. Why give competitors the same tools that could accelerate your own research? Google already generates over 25% of its new code with AI, and engineers are encouraged to use ‘Gemini for Google,’ an internal-only coding assistant trained on proprietary data.
This trend will only intensify. As AI systems approach human-level performance at technical tasks, the competitive advantage of keeping them internal grows. A company with exclusive access to an AI system that can meaningfully accelerate research and development has every reason to guard that advantage jealously.
But as AI capabilities accelerate, the gap between internal and public capabilities could widen, and some important systems may never be publicly released. In particular, the most capable AI systems (the ones that will shape our economy, our security, and our future) could become increasingly invisible both to the public and to policymakers.
The hidden frontier faces two fundamental threats that could undermine American technological leadership: 1) theft and 2) untrustworthiness — whether due to sabotage or inherent unreliability.
Internal AI models can be stolen. Advanced AI systems are tempting targets for foreign adversaries. Both China and Russia have explicitly identified AI as critical to their national competitiveness. With training runs for frontier models approaching $1 billion in cost and requiring hardware that export controls aim to keep out of our adversaries’ hands, stealing a ready-made American model could be far more attractive than building one from scratch.
Importantly, to upgrade from being a fast follower to being at the bleeding edge of AI, adversaries would need to steal the internal models hot off the GPU racks, rather than wait months for a model to be publicly released and only then exfiltrate it.
The vulnerability is real. A 2024 RAND framework established five “security levels” (SL1 through SL5) for frontier AI programs, with SL1 being sufficient to deter hobby hackers and SL5 secure against the world’s most elite attackers, incorporating measures comparable to those protecting nuclear weapons. It’s impossible to say exactly at which security level each of today’s frontier AI companies is operating, but Google’s recent model card for Gemini 2.5 states it has “been aligned with RAND SL2.”
/inline-pitch-cta
The threat of a breach isn’t hypothetical. In 2023, a hacker with no known ties to a foreign government penetrated OpenAI’s internal communications and obtained information about how the company’s researchers design their models. There’s also the risk of internal slip-ups. In January 2025, security researchers discovered a backdoor into DeepSeek’s databases; then, in July, a Department of Government Efficiency (DOGE) staffer accidentally leaked access to at least 52 of xAI’s internal LLMs.
The consequences of successful theft extend far beyond the immediate loss of the company’s competitive advantage. If China steals an AI system capable of automating research and development, the country’s superior energy infrastructure and willingness to build at scale could flip the global balance of technological power in its favor.
Untrustworthy AI models bring additional threats. The second set of threats comes from the models themselves: they may engage in harmful behaviors due to external sabotage or inherent unreliability.
Saboteurs would gain access to the AI model in the same way as prospective thieves would, but they would have different goals. Such saboteurs would target internal models during their development and testing phase — when they’re frequently updated and modified — and use malicious code, prompting, or other techniques to force the model to break its safety guardrails.
In 2024, researchers demonstrated that it was possible to create “sleeper agent” models that pass all safety tests but misbehave when triggered by specific conditions. In a 2023 study, researchers found that it was possible to manipulate an instruction-tuned model’s output by inserting as few as 100 “poisoned examples” into its training dataset. If adversaries were to compromise the AI systems used to train future generations of AIs, the corruption could cascade through every subsequent model.
But saboteurs aren’t necessary to create untrustworthy AI. The same reinforcement learning techniques that have produced breakthrough language and reasoning capabilities also frequently trigger concerning behaviors. OpenAI’s o1 system exploited bugs in ways its creators never anticipated. Anthropic’s Claude has been found to “reward hack,” technically completing assigned tasks while subverting their intent. Testing 16 leading AI models, Anthropic also found that all of them engaged in deception and even blackmail when those behaviors helped achieve their goals.
A compromised internal AI poses threats to the external world. Whether caused by sabotage or emergent misbehavior, untrustworthy AI systems pose unique risks when deployed internally. These systems increasingly have access to company codebases and training infrastructure; they can also influence the next generation of models. A compromised or misaligned system could hijack company resources for unauthorized purposes, copy itself to external servers, or corrupt its successors with subtle biases that compound over time.
AI is increasingly aiding in AI R&D. Every trend described above is accelerating because of one development: AI systems are beginning to automate AI research itself. This compounds the threat of a single security failure cascading through generations of AI systems.
Increasingly automated AI R&D isn’t speculation about distant futures; it’s a realistic forecast for the next few years. According to METR, GPT-5 has about a 50% chance of autonomously completing software engineering tasks that would take a skilled human around two hours — and across models, the length of tasks AI systems can handle at this level has been doubling roughly every seven months. Leading labs and researchers are actively exploring ways for AI systems to meaningfully contribute to model development, from generating training data to designing reward models and improving training efficiency. Together, these and other techniques could soon enable AI systems to autonomously handle a substantial portion of AI research and development.
Self-improving AI could amplify risks from theft and sabotage. This automation creates a powerful feedback loop that amplifies every risk associated with frontier AI systems. For one, it makes internal models vastly more valuable to thieves — imagine the advantage of possessing an untiring AI researcher who can work around the clock at superhuman speed and the equivalent of millennia of work experience. Likewise, internal models become more attractive targets for sabotage. Corrupting a system that trains future AIs could lead to vulnerabilities that persist across future AI model generations, which would allow competitors to pull ahead. And these systems are more dangerous if misaligned: an AI system that can improve itself might also be able to preserve its flaws or hide them from human overseers.
Crucially, this dynamic intensifies the incentive for companies to keep models internal. Why release an automated AI research system that could help competitors catch up? The result is that the most capable systems — the ones that pose the biggest risks to society — are the most difficult to monitor and secure.
One might hope that market mechanisms would be sufficient to mitigate these risks. No company wants its models to reward hack or to be stolen by competitors. But the AI industry faces multiple market failures that prevent adequate security investment.
/odw-inline-subscribe-cta
Security is expensive and imposes opportunity costs. First, implementing SL5 protections would be prohibitively expensive for any single company. The costs aren’t just up-front expenditures. Stringent security measures (like maintaining completely isolated, air-gapped networks) could slow development and make it harder to attract top talent accustomed to Silicon Valley’s open culture. Companies that “move fast and break things” might reach transformative capabilities first, even if their security is weaker.
Security falls prey to the tragedy of the commons. Second, some security work, such as fixing bugs in commonly used open-source Python libraries, benefits the whole industry, not just one AI company. This creates a “tragedy of the commons” problem, where companies would prefer to focus on racing to develop AI capabilities themselves, while benefiting from security improvements made by others. As competition intensifies, the incentive to free-ride increases, leading to systematic under-investment in security that leaves the whole industry at greater risk.
Good security takes time. Finally, by the time market forces prompt companies to invest in security — such as following a breach, regulatory shock, or reputational crisis — the window for action may already be closed. Good security can’t be bought overnight; instead, it must be painstakingly built from the ground up, ensuring every hardware component and software vendor in the tech stack meets rigorous requirements. Each additional month of delay makes it harder to achieve adequate security to protect advanced AI capabilities.
Congress has framed AI as critical to national security. Likewise, the AI Action Plan rightly stresses the importance of security to American AI leadership. There are several lightweight steps that the government can take to better address the security challenges posed by the hidden frontier. By treating security as a prerequisite for — rather than an obstacle to — innovation, the government can further its goal of “winning the AI race.”
Improve government understanding of the hidden frontier. At present, policymakers are flying blind, unable to track the AI capabilities emerging within private companies or verify the security measures protecting them from being stolen or sabotaged. The US government must require additional transparency from frontier companies about their most capable internal AI systems, internal deployment practices, and security plans. This need not be a significant imposition on industry; at least one leading company has called for mandatory disclosures. Additional insight could come from expanding the voluntary evaluations performed by the Center for AI Standards Innovation (CAISI). CAISI currently works with companies to evaluate frontier models for various national security risks before deployment. These evaluations could be expanded to earlier stages of the development lifecycle, where there might still be dangers lurking in the hidden frontier.
Share expertise to secure the hidden frontier. No private company can match the government’s expertise in defending against nation-state actors. Programs like the Department of Energy’s CRISP initiative already share threat intelligence with critical infrastructure operators. The AI industry needs similar support, with the AI Action Plan calling for “sharing of known AI vulnerabilities from within Federal agencies to the private sector.” Such support could include real-time threat intelligence about adversary tactics, red-team exercises simulating state-level attacks, and assistance in implementing SL5 protections. For companies developing models with national security implications, requiring security clearances for key personnel might also be appropriate.
Leverage the hidden frontier to boost security. The period between when new capabilities emerge internally and when they’re released publicly also provides an opportunity. This time could be used as an “adaptation buffer,” allowing society to prepare for any new risks and opportunities. For example, cybersecurity firms could use cutting-edge models to identify and patch vulnerabilities before attackers can use public models to exploit them. AI companies could provide access to cyber defenders without any government involvement, but the government might have a role to play in facilitating and incentivizing this access.
The nuclear industry offers a cautionary tale. Throughout the 1960s and ’70s, the number of nuclear power plants around the globe grew steadily. However, in 1979, a partial meltdown at Three Mile Island spewed radioactive material into the surrounding environment — and helped spread antinuclear sentiment around the globe. The Chernobyl accident, seven years later, exacerbated the public backlash, leading to regulations so stringent that construction on new US nuclear power plants stalled until 2013. An AI-related incident — such as an AI system helping a terrorist develop a bioweapon — could inflame the public and lead to similarly crippling regulations.
In order to preempt this backlash, the US needs adaptive standards that scale with AI capabilities. Basic models would need minimal oversight, while systems whose capabilities approach human-level performance at sensitive tasks would require proportionally stronger safeguards. The key is to establish these frameworks now, before a crisis forces reactive overregulation.
Internal models would not be exempt from these frameworks. After all, biological labs dealing with dangerous pathogens are not given a free pass just because they aren’t marketing a product to the public. Likewise, for AI developers, government oversight is appropriate when risks arise, even at the internal development and testing stage.
The models developing in the hidden frontier today will shape tomorrow's economy, security, and technology. These systems — invisible to public scrutiny yet powerful enough to automate research, accelerate cyberattacks, or even improve themselves — represent both America's greatest technological advantage and a serious vulnerability. If we fail to secure this hidden frontier against theft or sabotage by adversaries, or the models' own emergent misbehavior, we risk not just losing the AI race but watching our own innovations become the instruments of our technological defeat. We must secure the hidden frontier.
See things differently? AI Frontiers welcomes expert insights, thoughtful critiques, and fresh perspectives. Send us your pitch.
Despite years of effort, mechanistic interpretability has failed to provide insight into AI behavior — the result of a flawed foundational assumption.
Dynamism vs. stasis is a clearer lens for criticizing controversial AI safety prescriptions.