Popular AI tools like GPT-4 generate smooth, human-like text and perform so well on various language tasks that it’s becoming increasingly difficult to determine whether the person you’re communicating with is a human or a machine.
This scenario echoes Alan Turing’s famous thought experiment, in which he proposed a test to evaluate whether a machine could exhibit human-like behavior to the extent that a human judge could no longer reliably distinguish between human and machine based solely on based on their responses.
UC San Diego’s Department of Cognitive Sciences decided to see how modern AI systems fared and evaluated ELIZA (a simple rule-based chatbot from the 1960s that was used as the basis in the experiment), GPT-3.5 and GPT-4 in a controlled manner. Turing test. Participants had a five-minute conversation with a human or an AI and then had to decide whether their conversation partner was human.
No better than coincidence
GPT-4 and GPT-3.5 were told how to respond to messages. Researchers instructed the model to “take on the personality of a young person who was very terse, did not take the game very seriously, used slang and occasionally made spelling mistakes.” Additionally, the prompt included “general information about the game’s setup and information about recent news stories that occurred after the model’s training stopped.” The two models were accessed via the OpenAI API with the temperature set to 1 and to prevent the AI from responding too quickly, their messages were sent with a delay based on the number of characters in the message.
In the study, 500 participants were assigned to one of five groups. They had a conversation with a human or one of three AI systems. The game interface resembled a typical messaging app. After five minutes, participants rated whether they believed their conversation partner was human or AI and gave reasons for their decisions.
The results were interesting. GPT-4 was identified as human 54% of the time, ahead of GPT-3.5 (50%), with both significantly outperforming ELIZA (22%) but lagging behind real humans (67%). Participants were no better than chance at identifying GPT-4 as AI, indicating that current AI systems can trick people into believing they are human.
Analysis of the results showed that interrogators often relied on linguistic style, social-emotional factors and knowledge-based questions to decide whether they were talking to a human or a machine.
Details of the test and full results can be seen in the article published on the website arXiv preprint server.