Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a
Neurology Specialist Examination in Spain

Ros Arlanzón, Pablo; Pérez Sempere, Ángel

Please use this identifier to cite or link to this item: https://hdl.handle.net/11000/37213

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ros Arlanzón, Pablo	-
dc.contributor.author	Pérez Sempere, Ángel	-
dc.contributor.other	Departamentos de la UMH::Medicina Clínica	es_ES
dc.date.accessioned	2025-09-05T07:49:34Z	-
dc.date.available	2025-09-05T07:49:34Z	-
dc.date.created	2024-11	-
dc.identifier.citation	JMIR Med Educ . 2024 Nov 14:10:e56762	es_ES
dc.identifier.issn	2369-3762	-
dc.identifier.uri	https://hdl.handle.net/11000/37213	-
dc.description.abstract	Background: With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine. Methods: We conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom's Taxonomy. Statistical analysis of performance, including the κ coefficient for response consistency, was performed. Results: Human participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher κ coefficient of 0.73, compared to ChatGPT-3.5's coefficient of 0.69. Conclusions: This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4's performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment.	es_ES
dc.format	application/pdf	es_ES
dc.format.extent	8	es_ES
dc.language.iso	eng	es_ES
dc.publisher	JMIR Publications	es_ES
dc.rights	info:eu-repo/semantics/openAccess	es_ES
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	artificial intelligence	es_ES
dc.subject	ChatGPT	es_ES
dc.subject	clinical decision-making	es_ES
dc.subject	OpenAI	es_ES
dc.title	Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.relation.publisherversion	10.2196/56762	es_ES
Appears in Collections: Artículos Medicina Clínica

View/Open:
Evaluating AI Competence in Specialized Medicine Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain.pdf

312,87 kB
Adobe PDF
Share:

Show simple item record View statistics

???jsp.display-item.text9???