Leidos Is Securing the Agentic AI Future with LLM Trust and Safety
Three Points to Remember
- The Model Context Protocol (MCP) is a fast-rising open framework allowing universal plug and play between large language models (LLMs) and external information sources, tools and services.
- Bad actors can set up MCP-based cyberattacks by posting seemingly innocent content with malicious code and execution instructions for LLM agents to retrieve, infiltrating users' systems and networks.
- Leidos is applying "refusal training" to LLM agents that helps them distinguish between benign and attack data when searching the internet, accessing databases and interacting with tools.
Agentic AI has the potential to fundamentally change how organizations operate. The excitement of AI accomplishing work tasks with little to no supervision has many AI providers prioritizing ease of adoption and seamless operation, which may leave security gaps for bad actors to exploit.
Interoperability frameworks like Model Context Protocol (MCP) are fueling agentic AI’s growth. Introduced in late 2024, MCP is an open standard that allows anyone to create databases and applications for agentic large language models (LLMs) to discover and use.
While enabling AI models to pursue information sources and tools on their own to achieve outcomes is a powerful force multiplier, it also poses serious security risks. Leidos is proactively addressing these potential cybersecurity and governance challenges, with research and solutions that make the agentic AI frontier a more protected and trusted one.
“With AI agents becoming integral in daily operations and reshaping interactions across business processes, enterprises need to rethink and evolve their strategies to secure them,” says Bobby Scharmann, vice president of Leidos’ Cyber Accelerator. “We are providing insights and solutions on how accelerants such as MCP for AI agents and LLMs can be made more trustworthy.”
Attackers can infiltrate covertly through MCP
Leidos research has uncovered how MCP-enabled LLM agents can be vulnerable to a new set of cyberattacks. These MCP retrieval-based attacks take advantage of the LLMs’ web-scraping, information-retrieval nature to execute malicious code, gain system access and steal data.
For example, a bad actor can add malicious code and execution instructions onto a webpage with otherwise innocuous content about a topic — say, a viral recipe for vegan black bean chili. When an LLM agent retrieves that webpage, the exploit is brought into the user’s system.
What makes MCP-based cyberattacks insidious is that there’s no need for downloads.
A new defensive approach for safer agentic retrieval
Through a “refusal training” algorithm, Leidos is creating a method to protect LLM agents from being deceived by attackers. It can be compared to building an environment that teaches a child to identify and choose healthy snacks over junk food.
The algorithm separates benign and attack data to help LLMs better discern safe choices from harmful ones when seeking information sources. Dubbed RAG-Pref (for retrieval augmented generation for preference alignment), it has shown to significantly enhance the ability of several popular LLMs to recognize retrieval-based attacks, with an average 247% “strict refusal” improvement.
Leidos has also created an automated pipeline to collect MCP attack data as work on fine-tuning RAG-Pref continues.
Thinking smarter on trusted mission agentic AI
“Agentic AI changes the narrative because it’s not just text in and out; now you’re allowing the LLM to perform actions,” says John Halloran, a principal research scientist in the Cyber Accelerator leading Leidos’ MCP safety research. “Our work shows how attackers can hijack agentic AI systems by exploiting the retrieval of documents, so we need to add guardrails for AI models in this new frontier.”
Agentic AI governance is still in its early stages but imagine autonomous agents in the military that control weapons taking catastrophic actions before they can be stopped. Or picture a scenario where attackers could manipulate AI models into publishing people’s private financial and health data.
It’s critical, therefore, to teach models how to assess the trustworthiness of a source and when to refuse to take an action.
There could come a time when “we allow AI to replace a drone operator and make high-stakes decisions,” says Halloran. “The AI needs to have the ability to say ‘no’ to a request that’s in conflict with what it knows to be right.”
As agentic AI promises great gains in efficiency and productivity, Leidos is committed to helping government and commercial organizations navigate its complexities and harness the benefits of autonomous work agents at the point where the technology and cybersecurity meet.
“The AI paradigm of text in and actions out is not yet well understood,” says Halloran, “so now is the perfect time to get ahead of these things.”
For more on MCP safety and technical details on the RAG-Pref refusal-training algorithm, read John Halloran's research paper available at Cornell Univ.'s arXiv open-access archive.