Editor's Note
Although ChatGPT has shown human-level performance on several professional and academic benchmarks, a recent study of its potential for clinical applications raised questions among surgeon evaluators. Findings were reported in the journal Surgery on January 20.
Specifically, researchers tested OpenAI’s general-purpose large-language model on questions from the Surgical Council on Resident Education question bank. They also fed the AI a second commonly used surgical knowledge assessment, referred to in the study as Data-B. Questions were entered in 2 formats: open-ended and multiple-choice. Surgeon evaluators assessed answers for accuracy, categorized reasons for model errors, and assessed the stability of performance on repeat queries.
The tool performed better on multiple-choice questions, correctly answering 71.3% and 67.9% of questions for each dataset, respectively, versus 47.9% and 66.1% of the open-ended questions. Common reasons for incorrect responses included inaccurate information in the question and accurate information with circumstantial discrepancy. Asked the same question again, ChatGPT’s answers varied for 36.4% of questions answered incorrectly the first time.
Better performance on multiple-choice than open-ended questions and inconsistency in responses raise questions about the safety and consistency required for clinical application, researchers conclude. “Despite near or above human-level performance on question banks and given these observations, it is unclear whether large language models such as ChatGPT are able to safely assist clinicians in providing care."
Read More >>