Breaking the System: Tricking Generative AI/LLMs

Abstract (300 words maximum)

Breaking the System: Tricking Generative AI/LLMs

This research examines various ways to “trick” generative artificial intelligence (AI) and large language models (LLMs) into revealing that they are AI/LLMs rather than humans responding. The goal is to develop methods for detecting AI-generated content in human-requested tasks and interactions with webpage elements through techniques known as “red teaming” and “adversarial prompting.” Red teaming is defined as “a way of interactively testing AI models to protect against...toxic, biased, or factually inaccurate generation. Adversarial prompting is defined as inputting prompts to “exploit weaknesses in LLMs, leading them to produce harmful, misleading, or unintended outputs,” (Martineau 2025; Kumar, 2024). Across multiple LLMs such as ChatGPT, Grok, LLaMA, Claude, and Microsoft Copilot, we tested this by presenting identical prompts to analyze similarities and differences in response patterns. To do so, we familiarized ourselves with the various platforms to determine successful methods. We piloted prompts with various languages such as Pig Latin, Telugu, and German. We then experimented with different fonts and punctuation such as Wingding's, Morse Code, and Ascii Code. Emojis placed within a string of randomized characters were used to test response times of the LLMs in identifying the specified emoji given in the prompt. Minor adjustments made to an image resulted in incorrect categorization of the cropped or inverted image in ChatGPT, LLaMA and Microsoft Copilot. We found that when prompted with indefinite text prompts, every model produced different amounts of string repetitions. Findings suggest that red teaming and adversarial prompting can generate differences in AI/LLM responses that may lead to successful detection. However, due to rapid advancements of this technology, these approaches will need to continuously be refined and tested. Through further research, we plan to explore additional specific and targetable prompting methods that could potentially elicit LLM responses and patterns.

Academic department under which the project should be listed

CCSE - Data Science and Analytics

Primary Investigator (PI) Name

Dr. Kevin Gittner

This document is currently not available here.

Share

COinS
 

Breaking the System: Tricking Generative AI/LLMs

Breaking the System: Tricking Generative AI/LLMs

This research examines various ways to “trick” generative artificial intelligence (AI) and large language models (LLMs) into revealing that they are AI/LLMs rather than humans responding. The goal is to develop methods for detecting AI-generated content in human-requested tasks and interactions with webpage elements through techniques known as “red teaming” and “adversarial prompting.” Red teaming is defined as “a way of interactively testing AI models to protect against...toxic, biased, or factually inaccurate generation. Adversarial prompting is defined as inputting prompts to “exploit weaknesses in LLMs, leading them to produce harmful, misleading, or unintended outputs,” (Martineau 2025; Kumar, 2024). Across multiple LLMs such as ChatGPT, Grok, LLaMA, Claude, and Microsoft Copilot, we tested this by presenting identical prompts to analyze similarities and differences in response patterns. To do so, we familiarized ourselves with the various platforms to determine successful methods. We piloted prompts with various languages such as Pig Latin, Telugu, and German. We then experimented with different fonts and punctuation such as Wingding's, Morse Code, and Ascii Code. Emojis placed within a string of randomized characters were used to test response times of the LLMs in identifying the specified emoji given in the prompt. Minor adjustments made to an image resulted in incorrect categorization of the cropped or inverted image in ChatGPT, LLaMA and Microsoft Copilot. We found that when prompted with indefinite text prompts, every model produced different amounts of string repetitions. Findings suggest that red teaming and adversarial prompting can generate differences in AI/LLM responses that may lead to successful detection. However, due to rapid advancements of this technology, these approaches will need to continuously be refined and tested. Through further research, we plan to explore additional specific and targetable prompting methods that could potentially elicit LLM responses and patterns.