Semester of Graduation

Spring 2026

Degree Type

Dissertation/Thesis

Degree Name

Master of Science in Information Technology

Department

Department of Information Technology

Committee Chair/First Advisor

Honghui Xu

Abstract

Multimodal large language models (MLLMs) increasingly process screenshots, scanned documents, diagrams, and other visually grounded inputs. This capability creates a safety risk because many multimodal jailbreaks are not harmful in the prompt or image alone. Harm can emerge only after the model binds a benign-looking operation, such as summarizing or translating, to a localized visual target. This thesis studies this reference-dependent failure mode and argues that the security-relevant unit is the grounded operation–target pair rather than the whole prompt–image pair. To address this problem, it proposes COMIC, a reference-aware pre-generation safety gate for MLLMs. COMIC infers the requested operation and reference type, constructs candidate targets from OCR and open-vocabulary visual proposals, grounds plausible referents, and evaluates safety before generation. Its routing rule combines conservative maximum-risk aggregation with evidence-quality checks, so ambiguous or weakly grounded requests are not treated as automatically safe. Evaluation across representative open-source MLLMs, localized jailbreak benchmarks, broader multimodal attacks, and benign reference-sensitive settings shows that COMIC substantially reduces attack success while preserving practical benign utility and runtime efficiency. These findings show that multimodal safety mechanisms should intervene at the point where user intent becomes grounded action, before unsafe generation can occur.

Available for download on Friday, October 29, 2027

Share

COinS