Editor’s Note
Large language models (LLMs) outperformed traditional methods in predicting postoperative complications, according to a study on artificial intelligence (AI) in perioperative risk assessment published February 11 in the journal Nature. Results indicate AI-driven models could enhance patient safety and streamline clinical workflows by detecting complications earlier.
Researchers analyzed nearly 85,000 preoperative clinical notes from Barnes-Jewish Hospital, comparing LLMs to conventional word embedding techniques. Trained on clinical notes recorded before surgery, LLMs, including bioGPT, ClinicalBERT, and bioClinicalBERT, sought to predict six postoperative risks: 30-day mortality, acute kidney injury (AKI), pulmonary embolism (PE), pneumonia, deep vein thrombosis (DVT), and delirium. Compared to traditional word embedding models like word2vec and GloVe, pretrained LLMs improved the Area Under the Receiver Operating Characteristic (AUROC) curve by up to 38.3% and the Area Under the Precision-Recall Curve (AUPRC) by 33.2%, demonstrating superior predictive accuracy, researchers wrote.
Refining the models through self-supervised fine-tuning—a process in which the LLMs were adapted to perioperative care notes—further increased AUROC by up to 3.2% and AUPRC by up to 1.5%. When additional outcome labels were incorporated into training (semi-supervised fine-tuning), AUROC rose by 1.8% and AUPRC by 2% compared to self-supervised models. The highest performance was achieved with a foundation model that leveraged multi-task learning across all six surgical risks, leading to further improvements of 3.6% in AUROC and 2.6% in AUPRC over self-supervised models.
The study found that, for every 100 patients who experienced a postoperative complication, foundation models correctly identified up to 39 additional high-risk patients compared to traditional word embeddings. The models also demonstrated strong generalizability when tested on external datasets, suggesting they could be applied beyond perioperative care. Notably, incorporating structured clinical data—such as demographics and lab results—alongside text-based notes further enhanced predictions, particularly for rare conditions like PE and DVT.
Comparisons with the NSQIP Surgical Risk Calculator, a widely used clinical tool, revealed that foundation models achieved higher accuracy and precision, though NSQIP had better sensitivity. The study also employed SHapley Additive exPlanations (SHAP) to interpret model predictions, helping to address concerns about AI-driven decision-making in clinical practice.
Read More >>