Presentation Type

Article

Location

Kennesaw, Georgia

Start Date

1-4-2026 10:15 AM

End Date

1-4-2026 11:30 AM

Description

Vision-language models are increasingly deployed for clinical documentation tasks in radiology. A common application involves extracting diagnostic labels from chest radiographs with reports. These models must communicate calibrated uncertainty to avoid influencing patient care incorrectly. Overconfident predictions from poorly calibrated models pose a documented patient safety risk. Existing medical vision-language models produce only point estimates without any confidence bounds. Standard post-hoc calibration methods such as temperature scaling omit cross-modal signals entirely. They cannot detect cases where image and text branches make genuinely inconsistent predictions. This paper presents MedUncertainVLM, a framework that addresses this gap directly and systematically. The system uses deep ensembles on a fine-tuned BioViL-T model with a novel disagreement component. When image-branch and text-branch predictions diverge beyond a learned threshold, uncertainty is flagged. Flagged cases are routed to a human radiologist for review before any decision is finalized. We evaluate the framework on the MIMIC-CXR benchmark across 14 diagnostic labels in full. The held-out test set contains 5,000 samples drawn from the official evaluation split. MedUncertainVLM reduces Expected Calibration Error by 52.8 percent versus the uncalibrated baseline. Micro-averaged F1 is maintained at 84.1 percent across all 14 diagnostic label categories. A selective prediction protocol abstains on 19.4 percent of the highest-uncertainty cases. On the remaining retained predictions, micro-F1 reaches 90.3 percent across all labels. Cross-modal disagreement predicts true classification error better than ensemble variance alone. The disagreement signal achieves an AUROC of 0.81 versus 0.69 for variance-only uncertainty. We further present a lightweight MC Dropout approximation that recovers 89.5% of the ensemble's calibration benefit at one-fifth the inference cost, a threshold sensitivity analysis demonstrating robustness across operating points, and a comparative evaluation against additional medical VLM architectures including MedCLIP and PubMedCLIP.

Share

COinS
 
Apr 1st, 10:15 AM Apr 1st, 11:30 AM

MedUncertainVLM: Multi-Modal Uncertainty Quantification in Vision-Language Models for Clinical Documentation

Kennesaw, Georgia

Vision-language models are increasingly deployed for clinical documentation tasks in radiology. A common application involves extracting diagnostic labels from chest radiographs with reports. These models must communicate calibrated uncertainty to avoid influencing patient care incorrectly. Overconfident predictions from poorly calibrated models pose a documented patient safety risk. Existing medical vision-language models produce only point estimates without any confidence bounds. Standard post-hoc calibration methods such as temperature scaling omit cross-modal signals entirely. They cannot detect cases where image and text branches make genuinely inconsistent predictions. This paper presents MedUncertainVLM, a framework that addresses this gap directly and systematically. The system uses deep ensembles on a fine-tuned BioViL-T model with a novel disagreement component. When image-branch and text-branch predictions diverge beyond a learned threshold, uncertainty is flagged. Flagged cases are routed to a human radiologist for review before any decision is finalized. We evaluate the framework on the MIMIC-CXR benchmark across 14 diagnostic labels in full. The held-out test set contains 5,000 samples drawn from the official evaluation split. MedUncertainVLM reduces Expected Calibration Error by 52.8 percent versus the uncalibrated baseline. Micro-averaged F1 is maintained at 84.1 percent across all 14 diagnostic label categories. A selective prediction protocol abstains on 19.4 percent of the highest-uncertainty cases. On the remaining retained predictions, micro-F1 reaches 90.3 percent across all labels. Cross-modal disagreement predicts true classification error better than ensemble variance alone. The disagreement signal achieves an AUROC of 0.81 versus 0.69 for variance-only uncertainty. We further present a lightweight MC Dropout approximation that recovers 89.5% of the ensemble's calibration benefit at one-fifth the inference cost, a threshold sensitivity analysis demonstrating robustness across operating points, and a comparative evaluation against additional medical VLM architectures including MedCLIP and PubMedCLIP.