AI Can Pass Standardized Tests—But It Would Fail Preschool
Artificial Intelligence researchers have long dreamed of building a computer as knowledgeable and communicative as the one in Star Trek, which could interact with humans in natural (i.e., human) language. Last week, we seemed to boldly go toward that ideal. The New York Times reported that a team at the Allen Institute for Artificial Intelligence (AI2) had achieved “an artificial-intelligence milestone.” AI2’s program, Aristo, not only passed but also excelled on a standardized eighth-grade science test. The machine, the Times heralded, “is ready for high school science. Maybe even college.”
Melanie Mitchell is professor of computer science at Portland State University and External Professor at the Santa Fe Institute. Her book Artificial Intelligence: A Guide for Thinking Humans will be published in October by Farrar, Straus, and Giroux.
Or maybe not. Aristo isn’t the first AI system to shine on a test designed to gauge human knowledge and reasoning abilities. In 2015 one system matched a 4-year-old’s performance on an IQ test, prompting the BBC headline “AI had IQ of four-year-old child.” Another group reported their system could solve SAT geometry questions “as well as the average American 11th-grade student.” More recently, Stanford researchers created a question-answering test that prompted the New York Post to announce that “AI systems are beating humans in reading comprehension.” The truth is that while these systems perform well on specific language-processing tests, they can only take the test. None come anywhere close to matching humans in reading comprehension or other general abilities that the test was designed to measure.
The problem is that today’s machines, which excel at certain narrow tasks, still lack what we might call common sense. This includes the vast, and mostly unconscious, background knowledge that we use to understand the situations we encounter and the language we communicate with. Common sense also includes our ability to apply this knowledge quickly and flexibly to new circumstances.
The goal of endowing machines with common sense is as old as the field of AI itself, and is, I would venture, AI’s hardest open problem. Beginning in the 1990s, research on common sense took a back seat to statistical, data-driven AI approaches—especially in the form of neural networks and “deep learning.” But researchers have recently found that deep learning systems lack the robustness and generality of human learning, primarily because they lack our broad knowledge and flexible reasoning capabilities. Giving machines humanlike common sense is now at the top of AI’s to-do list.
Open-ended question-answering, like that of the Star Trek computer, is still too hard for current AI systems, so researchers make progress by creating programs that can perform well on “benchmarks”—particular data sets that represent a specific task. Aristo’s benchmark consists of a set of multiple-choice questions from the New York State Regents Exam in science. A sample question:
Which equipment will best separate a mixture of iron filings and black pepper?
(a) magnet (b) filter paper (c) triple-beam balance (d) voltmeter
Aristo’s creators believe that developing AI systems to answer such questions is one of the best ways to push the field forward. “While not a full test of machine intelligence,” they note, these questions “do explore several capabilities strongly associated with intelligence, including language understanding, reasoning, and use of common-sense knowledge.”
Aristo is a complicated system that combines several AI methods. However, the component that accounts for almost all of the system’s success is a deep neural network that has been trained to be a so-called language model—a mechanism that, given a sequence of words, can predict what the next word will be. “I was driving way too fast when I was stopped by the …” What’s the next word? Maybe “police.” Probably not “grapefruit.” Given a sequence of words, a language model computes the probability that each of the hundreds of thousands of words in its vocabulary will be the next one in the sequence.
Aristo’s language model was trained on word sequences from millions of documents (including all of English Wikipedia). After training with this vast collection of English, the neural network has presumably learned some useful things about language in general. At this point the network can be “fine-tuned” to learn to answer multiple-choice questions. When it takes the Regents exam, its input is the question plus the four possible answers; the output is the probability that each answer is correct. The network returns the highest-probability answer as its guess.
Aristo was tested on 119 questions from the eighth-grade exam and was correct on over 90 percent of them, a remarkable performance. It was also correct on over 83 percent of 12th-grade questions. While the Times reported that Aristo “passed the test,” the AI2 team noted that the actual tests New York students take include questions that refer to diagrams, as well as “direct answer” questions, neither of which Aristo was able to handle.
This is exciting progress, but we must keep in mind that a high score on a particular data set does not always mean that a machine has actually learned the task its human programmers intended. Sometimes the data used to train and test a learning system has subtle statistical patterns—I’ll call these giveaways—that allow the system to perform well without any real understanding or reasoning.
For example, one neural-network language model—similar to the one Aristo uses—was reported in 2019 to capably determine whether one sentence logically implies another. However, the reason for the high performance was not that the network understood the sentences or their connecting logic; rather, it relied on superficial syntactic properties such as how much the words in one sentence overlapped those in the second sentence. When the network was given sentences for which it could not take advantage of these syntactic properties, its performance plummeted.
Dozens of papers have been published over the past