Location
https://www.kennesaw.edu/ccse/events/computing-showcase/sp26-cday-program.php
Document Type
Event
Start Date
22-4-2026 4:00 PM
Description
AI-generated code is increasingly prevalent in software engineering practices, yet its reliability in preserving backward compatibility remains underexplored. This paper presents a unified study of (i) how often AI-generated code introduces breaking changes and (ii) whether large language models (LLMs) can detect such changes from commit-level diffs with explanations. We analyze 7,191 agent-generated and 1,402 human-authored pull requests from Python repositories using an AST-based approach to identify potential breaking changes. Our results show that AI agents introduce fewer breaking changes overall than humans (3.45% vs. 7.40%) in code generation tasks. However, agents show higher risk in maintenance tasks, where refactoring and chore changes introduce breaking changes at rates of 6.72% and 9.35%, respectively. To mitigate this risk and evaluate the effectiveness of the LLM-based AI agent, we developed an AI agent, PyCoReX, that can detect breaking changes from code commits. Our agent achieves a baseline F1-score of 0.82. Our findings show that commit-level LLM-based detection can support earlier and more reliable identification of breaking changes, improving the safety of agent-assisted software development.
Included in
GRM-012-173 Can You Trust AI Code? Understanding and Detecting Breaking Changes using LLMs
https://www.kennesaw.edu/ccse/events/computing-showcase/sp26-cday-program.php
AI-generated code is increasingly prevalent in software engineering practices, yet its reliability in preserving backward compatibility remains underexplored. This paper presents a unified study of (i) how often AI-generated code introduces breaking changes and (ii) whether large language models (LLMs) can detect such changes from commit-level diffs with explanations. We analyze 7,191 agent-generated and 1,402 human-authored pull requests from Python repositories using an AST-based approach to identify potential breaking changes. Our results show that AI agents introduce fewer breaking changes overall than humans (3.45% vs. 7.40%) in code generation tasks. However, agents show higher risk in maintenance tasks, where refactoring and chore changes introduce breaking changes at rates of 6.72% and 9.35%, respectively. To mitigate this risk and evaluate the effectiveness of the LLM-based AI agent, we developed an AI agent, PyCoReX, that can detect breaking changes from code commits. Our agent achieves a baseline F1-score of 0.82. Our findings show that commit-level LLM-based detection can support earlier and more reliable identification of breaking changes, improving the safety of agent-assisted software development.