Examining the Capabilities of GhidraMCP and Claude LLM for Reverse Engineering Malware

Abstract

Can Large Language Models (LLMs) enhance the efficiency of reverse engineering by assisting malware analysts in the de-obfuscation of ransomware and other forms of malicious software? This research explores the integration of LLMs into reverse engineering workflows through the use of GhidraMCP, a plugin designed for the Ghidra open-source software reverse engineering suite. GhidraMCP leverages the capabilities of Claude’s Sonnet 4 model (as well as other LLMs) to rename decompiled variables and functions, generate descriptive annotations for disassembled code, and highlight potentially relevant strings or routines. These features are intended to reduce the cognitive load on analysts and accelerate the identification of critical components such as encryption routines, embedded URLs, command-and-control (C2) indicators, and external library calls within malware samples.

This study compares traditional reverse engineering workflows with LLM-augmented workflows using GhidraMCP. Multiple pseudo-ransomware samples were analyzed to assess differences in discovery efficiency, accuracy of function labeling, and qualitative analytical quality. Although no formal timing metrics were recorded, the research team determined that the LLM-augmented process consistently achieved insights more quickly and with fewer manual steps. In several instances, Claude Sonnet 4 successfully identified static relationships and artifacts that human analysts initially overlooked, demonstrating its potential to enhance traditional workflows through contextual inference and advanced pattern recognition.

The combination of GhidraMCP and Claude Sonnet 4 effectively leveraged static analysis to identify the hidden flags for ESCALATE challenges one through seven. However, while the research team was ultimately able to solve all challenges, several required dynamic analysis and binary patching—tasks that the current LLM-augmented setup could not perform due to the lack of patching capabilities within GhidraMCP. It remains unclear whether this limitation stems from Ghidra, the plugin, or the integration framework itself. During testing, Claude Sonnet 4 occasionally exhibited hallucinations, producing inaccurate or speculative annotations that required human correction and additional prompting, particularly during challenges three and four. These occurrences emphasize the ongoing need for human oversight and iterative validation when employing generative AI in critical cybersecurity tasks.

Despite these limitations, the findings indicate that LLM-augmented reverse engineering can meaningfully improve analytical comprehension, efficiency, and context awareness. Claude Sonnet 4’s linguistic reasoning and ability to infer code intent proved especially valuable for de-obfuscating complex binaries. Future work will focus on enabling dynamic capabilities within GhidraMCP to support patching and execution-based testing, as well as refining prompt strategies and hallucination detection. This research establishes a foundation for the continued development of intelligent, LLM-assisted tooling designed to augment human expertise in malware analysis and reverse engineering.

This document is currently not available here.

Share

COinS
 

Examining the Capabilities of GhidraMCP and Claude LLM for Reverse Engineering Malware

Can Large Language Models (LLMs) enhance the efficiency of reverse engineering by assisting malware analysts in the de-obfuscation of ransomware and other forms of malicious software? This research explores the integration of LLMs into reverse engineering workflows through the use of GhidraMCP, a plugin designed for the Ghidra open-source software reverse engineering suite. GhidraMCP leverages the capabilities of Claude’s Sonnet 4 model (as well as other LLMs) to rename decompiled variables and functions, generate descriptive annotations for disassembled code, and highlight potentially relevant strings or routines. These features are intended to reduce the cognitive load on analysts and accelerate the identification of critical components such as encryption routines, embedded URLs, command-and-control (C2) indicators, and external library calls within malware samples.

This study compares traditional reverse engineering workflows with LLM-augmented workflows using GhidraMCP. Multiple pseudo-ransomware samples were analyzed to assess differences in discovery efficiency, accuracy of function labeling, and qualitative analytical quality. Although no formal timing metrics were recorded, the research team determined that the LLM-augmented process consistently achieved insights more quickly and with fewer manual steps. In several instances, Claude Sonnet 4 successfully identified static relationships and artifacts that human analysts initially overlooked, demonstrating its potential to enhance traditional workflows through contextual inference and advanced pattern recognition.

The combination of GhidraMCP and Claude Sonnet 4 effectively leveraged static analysis to identify the hidden flags for ESCALATE challenges one through seven. However, while the research team was ultimately able to solve all challenges, several required dynamic analysis and binary patching—tasks that the current LLM-augmented setup could not perform due to the lack of patching capabilities within GhidraMCP. It remains unclear whether this limitation stems from Ghidra, the plugin, or the integration framework itself. During testing, Claude Sonnet 4 occasionally exhibited hallucinations, producing inaccurate or speculative annotations that required human correction and additional prompting, particularly during challenges three and four. These occurrences emphasize the ongoing need for human oversight and iterative validation when employing generative AI in critical cybersecurity tasks.

Despite these limitations, the findings indicate that LLM-augmented reverse engineering can meaningfully improve analytical comprehension, efficiency, and context awareness. Claude Sonnet 4’s linguistic reasoning and ability to infer code intent proved especially valuable for de-obfuscating complex binaries. Future work will focus on enabling dynamic capabilities within GhidraMCP to support patching and execution-based testing, as well as refining prompt strategies and hallucination detection. This research establishes a foundation for the continued development of intelligent, LLM-assisted tooling designed to augment human expertise in malware analysis and reverse engineering.