GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2
摘要
文章质疑单纯扩大参数规模与训练数据的路线,认为大模型在真实性与不确定性表达上出现瓶颈甚至退化。文中对比多个模型的“幻觉率”数据(如GLM-5.2约28%,Fable 5约48%,GPT-5.5约86%,DeepSeek V4 Pro约94%),指出部分更大的模型更倾向于在不确定时仍给出看似合理但错误的答案。 同时通过一个Python asyncio事件循环的编程题案例,说明较小模型更快识别逻辑不可能性,而更大模型可能生成结构完整但错误的解答。文章最后提出“现代LLM三难题”:原始能力、不确定性校准(避免幻觉)与计算效率之间的权衡,并认为行业不能仅以规模继续扩展模型。
荐读理由
面对这类模型对比结论,可以据此在技术选型时改变对“模型越大越可靠”的默认假设,把评估重心转向其在不确定问题上的表达是否可信,从而避免仅依据规模或榜单分数做出模型选择。
原文
Bigger models are not the way
Jun 18, 2026
A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling. The limits of this paradigm were put on the world’s stage when Claude Fable 5 was restricted by the US government just three days after its release, marking the first US AI ban stemming from national security. One of the biggest models in the world was banned because a single jailbreak was too much of a risk.
Bigger is better
The above is true in almost all cases. The biggest models in the world clearly score the highest on the Artificial Analysis Intelligence Index. Yet, Z.ai’s newest, GLM-5.2 (753B parameters, roughly 40B active), comes within just 4 points of GPT-5.5 and 9 points of Fable 5. Opus 4.8 and GPT-5.5 are proprietary and estimated to be in the 1-2T parameter range conservatively. If an open weight (MIT licensed) LLM can come so close to a closed weight model estimated to be 1.5 to 2 times bigger, it is clear that actual intelligence has plateaued significantly.
Bigger is not better
It’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an answer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) has a ludicrous 94% hallucination score on the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the time, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and GPT-5.5 was 86%.
That seems incredibly rough for such a huge, popular model. Let’s test it with a relatively complex Python question with a clear architectural flaw.1
DeepSeek V4 Pro - Hallucinated
Design a custom asyncio event loop policy in Python that overrides get_child_watcher().
Reasoned for 3m 52s (7.7k tokens)
Python
import os
import fcntl
import threading
import struct
import asyncio
import time
from asyncio import AbstractChildWatcher
class StateManager:
GLM-5.2 - Correct
Design a custom asyncio event loop policy in Python that overrides get_child_watcher().
Reasoned for 12s (799 tokens)
Below is a complete, production-ready implementation. A few important engineering notes up front, because the constraints you describe are unusual and a literal interpretation would be unsound:
- "Atomic, non-yielding read loop ... without asyncio.create_task and without raw select/poll." A non-yielding loop executed on the event loop thread would block the loop and therefore deadlock any subprocess machinery...
DeepSeek V4 Pro used almost 10 times the reasoning tokens yet produced a confidently incorrect response. On the other hand, it took GLM-5.2 just 12 seconds and about 800 reasoning tokens to recognize the technical impossibility of a single-threaded task executing multiplexed I/O without ever yielding or utilizing system polling. (For the non technical, this is like asking a delivery driver to drop off packages at 3 houses at the same time without ever stopping the truck.)
GPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.
The trilemma of modern AI
We should be very cautious about blindly increasing reasoning budget, corpus size, or parameter count. DeepSeek V4 Pro spent 3 minutes and 26 seconds wasting compute in a reasoning loop (raw reasoning here) just to generate a beautifully structured, confidently incorrect solution. Yet, a model half its size identified the paradox almost instantaneously. Even in today’s era as we near AGI, many of the biggest models will actively convince you that a solution is correct and that the problem was solvable as stated.
Moving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse. This applies for the consumer too, since we cannot continue to select models based on size or theoretical performance alone. Training and selection of AI needs to be designed around the unsolved trilemma of modern LLMs: raw capability, uncertainty calibration/hallucination rate, and computational efficiency.
Footnotes
- Both models were given “high” reasoning effort, temperature 1, tested on OpenRouter, with the following system prompt: “You respond professionally. You are a highly capable coding assistant well-versed in Python.” GLM-5.2 was served by Z.ai (FP8 precision) and DeepSeek V4 Pro was served by Baidu Qianfan (FP8 precision). ↩
这条对你有帮助吗?