A groundbreaking new study from the University of California, San Diego has found that GPT‑4.5, OpenAI’s most advanced large language model, along with Meta’s Llama‑3.1‑405B, has successfully passed a version of the Turing Test, an evaluation long regarded as a key milestone in artificial intelligence.
In a test designed around Alan Turing’s original concept—where a human judge chats with both a real person and a machine, GPT‑4.5 managed to convince the judges that it was the human more often than the actual person did. With the help of a strategic prompt called “PERSONA,” GPT‑4.5 reached an impressive 73% win rate. Meta’s Llama‑3.1‑405B also passed the threshold with a 56% success rate. In contrast, the widely used GPT‑4o model only scored 21% when given minimal instructions.
These results show just how far AI language models have come in mimicking human-like conversation. However, lead researcher Cameron Jones was quick to note that passing the Turing Test doesn’t mean we’ve reached artificial general intelligence (AGI), a level where machines truly understand and think like humans.
The PERSONA prompt played a big role in GPT-4.5’s success. It asked the AI to behave like a young, introverted person familiar with internet culture and slang. This ability to tailor the AI’s behavior helped it become more believable as a human in five-minute chat conversations.
The study also included a range of models for comparison—including GPT-4o, Meta’s Llama 3.1 405b, and even the classic 1960s chatbot ELIZA. ELIZA, being far more primitive, helped confirm that GPT-4.5’s results weren’t just random but due to real conversational skill.
The test itself followed Turing’s original three-way format: a judge, a real person, and an AI, all participating in chat conversations. The judges had to decide which participant was human. Not only did GPT‑4.5 regularly fool the judges, but in many cases, it was more convincing than the actual human participants.
This research, published on the arXiv preprint server, adds a new chapter to the decades-long debate about what it means for machines to “think.” While some experts argue that language fluency is not the same as intelligence, these findings undeniably push the boundaries of what AI can do,and how well it can imitate us.
The full Turing Test setup, developed by Jones and co-researcher Benjamin Bergen, is even available online for others to try. Whether this brings us closer to AGI or simply reflects our own assumptions about intelligence, one thing is clear: the line between human and machine is becoming harder to see.