DigitalCommons@Kennesaw State University - C-Day Computing Showcase: GRM-012 (TCC) Transformer Embedded Synthetic Source Code Multiclass Classification

 

Presenter Information

Rene LisasiFollow
Patrick WuFollow

Location

https://www.kennesaw.edu/ccse/events/computing-showcase/sp25-cday-program.php

Streaming Media

Document Type

Event

Start Date

15-4-2025 4:00 PM

Description

Recent advances in large language models have significantly increased their capability to write code. While tools such as ChatGPT are useful and represent increased efficiency for many programmers, they represent a major issue when used in academically dishonest ways. To solve the problem of identifying code written by language models, we offer a novel, light-weight classification solution based on a transformer architecture. We compare the performance of three separate transformer models (GraphCodeBERT, PLBART, and CodeBERT) for tokenization and processing and then perform classification using a random forest classifier. Preliminary results indicate that the GraphCodeBERT-based model has a 100% test and train accuracy on detecting human or AI generated code and PLBART has 100% train with 95% test F1-score on categories of AI generators like chatbot, model, IDE extension, or human

Share

COinS
 
Apr 15th, 4:00 PM

GRM-012 (TCC) Transformer Embedded Synthetic Source Code Multiclass Classification

https://www.kennesaw.edu/ccse/events/computing-showcase/sp25-cday-program.php

Recent advances in large language models have significantly increased their capability to write code. While tools such as ChatGPT are useful and represent increased efficiency for many programmers, they represent a major issue when used in academically dishonest ways. To solve the problem of identifying code written by language models, we offer a novel, light-weight classification solution based on a transformer architecture. We compare the performance of three separate transformer models (GraphCodeBERT, PLBART, and CodeBERT) for tokenization and processing and then perform classification using a random forest classifier. Preliminary results indicate that the GraphCodeBERT-based model has a 100% test and train accuracy on detecting human or AI generated code and PLBART has 100% train with 95% test F1-score on categories of AI generators like chatbot, model, IDE extension, or human