The artificial intelligence language model GPT-3 was on par with college students in solving logic problems on a standardized test. According to the researchers who conducted the experiment, this raises the question of whether the technology mimics human reasoning or uses new types of cognitive processes. To solve this, however, researchers need access to software that supports GPT-3 and other AI software.

People easily solve new problems without special training or practice by comparing them to familiar problems and extending the solution to the new problem. This process, called analogical reasoning, has long been considered a uniquely human ability. Now if artificial intelligence can step next to a human with a similar ability.

Research by UCLA psychologists shows that the artificial intelligence language model GPT-3 performs surprisingly as well as college students when asked to solve the kinds of reasoning problems that typically appear on intelligence tests and standardized tests like the SAT. The study was published in the journal Nature Human Behaviour.

Without access to the GPT-3 software — which is overseen by the company that created it, OpenAI — the UCLA researchers can’t say for sure how its reasoning powers work. They also write that while GPT-3 performs much better than expected on some reasoning tasks, the popular AI tool still fails spectacularly on others.

“As impressive as our results are, it’s important to emphasize that this system has major limitations,” said one of the study’s authors, Taylor Webb, a postdoctoral fellow in psychology at UCLA. “It can do analogical reasoning, but it can’t do things that are very easy for humans, like use tools to solve a physical task.

Webb and his colleagues tested the GPT-3’s ability to solve problems inspired by a test known as Raven’s Progressive Matrices, which asks the subject to predict the next image in a complex arrangement of shapes. In order for the GPT-3 to “see” the shapes, Webb converted the images to a text format that the GPT-3 could process; this approach also ensured that AI had never faced such questions before.

The researchers asked 40 UCLA undergraduates to solve the same problems.

“Surprisingly, GPT-3 not only performed as well as humans, but also made similar errors,” said UCLA psychology professor Hongjing Lu, senior author of the study.

The GPT-3 got 80 percent of the problems correct—far better than the subjects’ average score of just under 60 percent.

The researchers also prompted GPT-3 to solve a set of SAT-analog questions that they believe have never been published online—meaning the questions would likely not have been part of GPT-3’s training data. They compared GPT-3 results to published SAT scores of college applicants and found that the AI ​​outperformed the average human score.

The researchers then asked the GPT-3 and student volunteers to solve analogies based on short stories—prompting them to read one passage and then identify a different story that had the same meaning. The technology did less well than the students in solving these problems, while the latest iteration of OpenAI technology, GPT-4, performed better than GPT-3.

UCLA researchers hope to investigate whether language learning models actually start to “think” like humans, or whether they do something completely different that simply mimics human thought.

Source: Science Daily