We'll Be Arguing for Years Whether Large Language Models Can Make New Scientific Discoveries

This post originally appeared on RAND.

---

When OpenAI released its newest AI models o3 and o4-mini in April, its president Greg Brockman made an intriguing claim: “These are the first models where top scientists tell us they produce legitimately good and useful novel ideas.”

If AI can indeed make scientific discoveries, that would not only have practical impacts for society but would also provide evidence that we've achieved true digital intelligence. But reaching expert consensus on what counts as a “scientific discovery by an AI” may prove more elusive than expected.

Ever since OpenAI released ChatGPT in 2022, a public debate has raged about whether the leading large language models (LLMs) are showing “sparks of artificial general intelligence” or are merely “stochastic parrots” or “autocomplete on steroids.” This debate has become repetitive, in part because neither side has offered a compelling definition of “intelligence.” The leading LLMs can accomplish impressive tasks, but so far they aren't generating economic value from tasks other than computer programming (and even there, the story is controversial).

In fact, it has proven surprisingly difficult to effectively quantify the performance of LLMs at all. The latest models post ever-increasing scores on benchmarks like standardized exams, even with Ph.D.-level questions. But experts have questioned the practical relevance of these metrics; in some cases they may simply reflect that the AI models had already been trained on the test questions or had stumbled across a correct solution. (And LLMs still can't beat Pokémon.) Are these LLMs displaying intelligence, or just drawing on a repository of existing knowledge?

/odw-pull-quote

The leading LLMs can accomplish impressive tasks, but so far they aren't generating economic value from tasks other than computer programming (and even there, the story is controversial).

Looking at a different axis of capability, AI models have generated award-winning pieces of art and music that some would say are completely novel. This is impressive, but likely too subjective to establish proof of intelligence for most people. It's hard to argue against someone's position that, “This AI-generated art is technically impressive, but it seems to me to lack real artistic creativity.”

Scientific discovery is an achievement that could thread the needle between narrow, game-able performance on standardized tests and subjective achievements in art. Suppose that an AI model were to make a truly original prediction of a fundamentally new and important scientific phenomenon—one that was later confirmed in the “real world” of a laboratory. That would make it difficult to deny that the model had at least some form of intelligence. Scientific discovery is one of the distinguishing features of human civilization. By definition, a novel prediction cannot simply be pulled from the model's training dataset. And the experimental confirmation would provide an objective and unambiguous demonstration of correctness. (The advent of AIs that are capable of scientific discovery could also raise important policy questions around the proper way to regulate them, since some forms of scientific discovery can be dangerous.)

AI already has made huge breakthroughs in applied science. The most famous example is DeepMind's AlphaFold, which revolutionized our understanding of protein folding and earned its creators a well-deserved Nobel Prize in Chemistry. AI has also led to the discovery of new materials and new math algorithms, among countless other discoveries. It has even been used to save lives.

/odw-inline-subscribe-cta

But, importantly, these discoveries were essentially computational in nature. Most of these AI models solved (or at least made progress on) incredibly difficult but well-understood computational math problems, sometimes paired with a large amount of quantitative experimental data. Most of these models relied on a mathematical technique called “reinforcement learning”—the approach that allowed AIs to dominate humans at games like Go. (This approach is quite different from how LLMs like ChatGPT operate. There are other AI tools for scientific discovery that work somewhat more like LLMs and image generators, but cannot use human language.) Some might argue that the AIs that perform these impressive but narrow computations are just souped-up supercomputers.

A much more conclusive demonstration of “understanding” would be a new conceptual scientific idea. Not a black-box computation, but an idea that once explained in words would prompt a human to say, “I get it now,” and enable them to build on the idea with further insights. Some of these kinds of new ideas, like Darwin's theory of natural selection or Einstein's insight that gravity results from the curvature of spacetime, have launched scientific revolutions that forever changed the way we think about the field. These kinds of discoveries don't just advance the forefront of science; they make young children want to become scientists.

Large language models like ChatGPT have proven themselves useful for practical day-to-day science tasks, like performing literature reviews, writing computer code, or cleaning up a draft paper that that author did not write in their native language. But I have not found any examples of an LLM synthesizing its massive training dataset into a new conceptual discovery akin to August Kekulé's daydream of a snake eating its own tail, which led him to finally understand the ringlike chemical structure of benzene.

Greg Brockman's claim about o3 and o4-mini suggests that the leading LLMs may finally be crossing that milestone. There are some hints from practicing scientists that this may be the case, and some early attempts to automate more of the scientific discovery process. (Although other scientists are skeptical that those efforts will succeed.) The lowest-hanging fruit may be in pure mathematics research rather than in experimental science. New math theorems can, with effort, be expressed precisely enough to be rigorously verified entirely by computer. An AI model for math discovery could not only propose a new idea but also prove that it was correct.

But unfortunately, I doubt that an AI will deliver an incontrovertible slam-dunk demonstration of “legitimately good and useful novel ideas” any time soon. The words “useful” and even “novel” turn out in practice to be unexpectedly fuzzy and subjective.

/odw-pull-quote

It isn't terribly difficult to produce a correct mathematical result or a scientific finding. The hard part is finding an interesting and important new result.

It isn't terribly difficult to produce a correct mathematical result or a scientific finding (for example, by combining several well-known facts in a simple logical chain to produce a new fact). The hard part is finding an interesting and important new result. And—especially in the less-applied branches of math and science—what the research community considers interesting and important is more arbitrary than they'd like to admit. Just like everyone else, researchers judge one another's work partly based on subjective notions of beauty or whether the topic happens to be in vogue.

For the next several years, I suspect we will find ourselves in a messy middle ground. AI models will continue to produce increasingly impressive discussions of scientific and mathematical concepts, along with some mistakes. Some experts will say that these discussions are groundbreaking, and others will dismiss them as obvious or unimportant. Nonexperts will have a hard time assessing which side is correct. Researchers will iterate ideas through LLMs more frequently, and it will be difficult to determine after-the-fact how much credit (if any) belongs to the LLM for a discovery.

I hope that I am wrong, and that an AI will soon make an indisputable scientific breakthrough. A researcher could soon devise (or stumble across) a clever LLM prompt that generates a response far beyond the researcher's own capacity to produce, and that response could significantly change the course of some field of science or math. But I suspect that instead, AI's contribution to scientific discoveries will remain ambiguous. And the interminable debates about whether LLMs demonstrate intelligence will continue.

Footnotes

Written by

Edward Parker

Edward Parker is a physical scientist at RAND. He is broadly interested in the societal impact of disruptive technologies, and his current research focuses on emerging quantum technologies, cybersecurity, and artificial intelligence.

Cover image: Aleksandra Zhilenkova/Getty Images

How AI Can Prevent Blackouts

For safety-critical domains like energy grids, "probably safe" isn't good enough. To fulfill the potential of AI in these areas, we need to develop more robust, mathematical guarantees of safety.

David 'davidad' Dalrymple

Jun 5, 2025

Can We Stop Bad Actors From Manipulating AI?

AI is naturally prone to being tricked into behaving badly, but researchers are working hard to patch that weakness.

Andy Zou

Jason Hausenloy

Apr 9, 2025

We'll Be Arguing for Years Whether Large Language Models Can Make New Scientific Discoveries

Edward Parker

How AI Can Prevent Blackouts

Can We Stop Bad Actors From Manipulating AI?

Subscribe to AI Frontiers

Subscribe to
AI Frontiers

Discover more from AI Frontiers