February 1, 2024

ChatGPT study prompts questions about clinical applications for large-language-model AI

Editor's Note

Although ChatGPT has shown human-level performance on several professional and academic benchmarks, a recent study of its potential for clinical applications raised questions among surgeon evaluators. Findings were reported in the journal Surgery on January 20.

Specifically, researchers tested OpenAI’s general-purpose large-language model on questions from the Surgical Council on Resident Education question bank. They also fed the AI a second commonly used surgical knowledge assessment, referred to in the study as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. Surgeon evaluators assessed answers for accuracy, categorized reasons for model errors, and assessed the stability of performance on repeat queries.

The tool performed better on multiple-choice questions, correctly answering 71.3% and 67.9% of questions for each dataset, respectively, versus 47.9% and 66.1% of the open-ended questions. Common reasons for incorrect responses included inaccurate information in the question and accurate information with circumstantial discrepancy. Asked the same question again, ChatGPT’s answers varied for 36.4% of questions answered incorrectly the first time.

Better performance on multiple-choice than open-ended questions and inconsistency in responses raise questions about the safety and consistency required for clinical application, researchers conclude. “Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care." 

Read More >>

Join our community

Learn More
Video Spotlight
Live chat by BoldChat