国内前沿大语言模型识别医学影像图像能力的比较：以颅内出血为例

廖建勇; 张斌; 勾振恒; 姚永刚; 张洁

doi:10.15953/j.ctta.2025.447

国内前沿大语言模型识别医学影像图像能力的比较：以颅内出血为例

Comparison of Medical Image Recognition Capabilities of Cutting-Edge Domestic Large Language Models: Taking Intracerebral Hemorrhage as an Example

摘要

摘要: 目的：通过比较国内四个前沿大语言模型（LLMs）识别颅内出血的表现，探讨通用LLMs在医学影像领域的应用价值。材料与方法回顾性收集我院2025年7月至2025年9月颅脑出血的CT图像。将图像上传至DeepSeek-V3.2元宝版（简称deepseek）、doubao-seed-1.6（简称豆包）、qwen-vl-Max（简称通义千问）及ERNIE-X1.1（简称文心一言）LLMs，逐条询问3个核心问题（成像技术、是否出血、出血类型）。每个病例采用新对话避免上下文干扰，1 周后重复询问，记录答案并统计分析。结果：颅内出血患者102例，配对正常头颅CT患者102例。所有LLMs对成像技术的识别率均为100%。在Q2（是否出血）中，豆包表现最佳，其准确率、灵敏度分别为91%和83%，显著优于其它3个LLMs（P＜0.001）。所有LLMs的特异性高达98%~99%。在Q3（出血类型）中，豆包表现最佳，其整体准确率为67%，对硬膜外血肿、脑内出血、蛛网膜下腔出血及硬膜下血肿的识别率分别为19%、98%、33%及71%。所有LLMs对脑内出血的识别率较高，而对硬膜外血肿、蛛网膜下腔出血及硬膜下血肿的识别率相对较低。在Q2和Q3的一致性检验中，豆包均表现最佳，其Kappa系数分别为0.87和0.71，其余LLMs在Q3中表现较差。结论：国内前沿LLMs在医学影像领域已具备一定的图像识别能力，但不同LLMs对颅内出血的识别能力及稳定性存在差异，豆包表现最佳。

Abstract: Objective: To compare the performance of four cutting-edge domestic large language models (LLMs) in identifying intracranial hemorrhage on head computed tomography (CT) and to explore the applicability of general-purpose LLMs in medical imaging. Materials and Methods: CT images of intracranial hemorrhage were retrospectively collected from our hospital between July 2025 and September 2025. These images were uploaded to four LLMs: DeepSeek-V3.2 Yuanbao Edition (DeepSeek), Doubao-seed-1.6 (Doubao), Qwen-VL-Max (Qwen), and ERNIE-X1.1 (ERNIE). Three core questions were asked sequentially for each image, as follows: imaging technique, presence of hemorrhage, and hemorrhage type. A new conversation was initiated for each case to avoid context interference, and the inquiry was repeated one week later. Thereafter, the answers were recorded and subjected to statistical analysis. Results: A total of 102 intracranial hemorrhage cases and 102 matched normal CT scans were included. All LLMs achieved 100% accuracy for imaging technique recognition. For Q2 (presence of hemorrhage), Doubao demonstrated the best performance, with an accuracy of 91% and sensitivity of 83%, which significantly outperformed the other models (P < 0.001). Additionally, all models exhibited high specificity (98%~99%). For Q3 (hemorrhage type), Doubao again achieved the highest overall accuracy (67%) and the best sensitivity across hemorrhage subtypes: 19% for epidural hematoma, 98% for intracerebral hemorrhage, 33% for subarachnoid hemorrhage, and 71% for subdural hematoma. Moreover, all LLMs demonstrated higher sensitivity for intracerebral hemorrhage, but lower performance for the other three subtypes. In consistency testing, Doubao achieved the highest agreement for both Q2 (Kappa = 0.87) and Q3 (Kappa = 0.71), whereas the remaining models performed poorly in Q3. Conclusion: Although domestic LLMs have demonstrated preliminary capability in medical image interpretation, their performance and stability in detecting intracranial hemorrhage vary considerably. Among all the models studied, Doubao exhibited the best overall performance in identifying intracranial hemorrhage.

HTML全文

参考文献(20)

施引文献

资源附件(0)