The Cybersecurity of Gen-AI and LLMs: Current Issues and Concerns

OVERVIEW

The explosive growth and prevalence of Generative Artificial Intelligence (Gen-AI) and LLM (Large Language Model) systems[1] has been accompanied by various security and privacy concerns. Nonetheless, since 2023, organisations are increasingly integrating Gen-AI platforms into their business processes. Smartphone manufacturers, as well as app developers, have also begun to incorporate AI-powered digital assistants into their products.[1] Given the potential that every individual with a smartphone can now bring AI capabilities with them everywhere (and in everything they do), it has become imperative to understand the nature of AI-related threats and risks.

This issue of CyberSense looks at several risks of using generative AI systems, such as accidental data leaks and risks of using it to code; misuse of AI for malicious purposes, and what tech firms have done to mitigate data privacy concerns.

Risks Of Using AI Systems

Accidental Data Leaks

As AI continues to be integrated into systems and devices that we use every day (such as on personal computers and mobile devices), the risk of accidental data leakage will grow in tandem.

First, LLM applications possess the inherent potential to reveal sensitive information and other confidential details through their output, resulting in unauthorised access to proprietary data, leakage of intellectual property, and privacy violations. This usually happens when ‘overfitting’ or memorisation of sensitive data in the LLM’s training process (which could occur by simply uploading documents containing sensitive data) happens, and / or the lack of proper data sanitisation when making use of Gen-AI systems. An example of such a leak involved an incident in April 2023 where employees at Samsung reportedly used ChatGPT to assist with coding tasks, inadvertently inputting confidential data into the AI model. The data reportedly included proprietary source code and internal meeting notes for Samsung. The firm subsequently restricted the use of Gen-AI platforms for work, concerned that sensitive data shared with LLMs could be provided to other users. Other companies have, likewise, similarly curtailed the use of Gen-AI platforms for official work purposes. [2]

Second, even without issues in the training process, is that LLM applications are beginning to be integrated into devices that hold personal and sensitive data. For example, iPhones equipped with Apple Intelligence would be able to analyse text messages on the device and provide a more personalised experience to the user. Whilst this no doubt improves user experience, it would, theoretically, also lead to increased risks, since private data is being worked on in real-time. Further, as such data is considered unique to the system, it may be sent to the cloud for further processing, also increasing the likelihood of accidental data exposure. 

GPT-4o by OpenAI is also not immune to similar privacy concerns. Although ChatGPT would only have access to photos, videos and audio recordings explicitly provided by the user, users might accidentally upload images (such as screenshots) containing sensitive data[3]. The chance of accidental data exposure leakage is, hence, arguably greater compared to ChatGPT 3, which restricted users to text-based input only.

Risks in Gen-AI Developed Code

Another risk posed by AI platforms comes from how their expanding role in code development, giving rise to potential supply chain risks. Based on academic research conducted on code output from ChatGPT3 and ChatGPT4, researchers discovered that, unless explicitly prompted, the code generated could be insecure and might contain multiple flaws that could be easily exploited. Given that LLM-assisted code development is becoming increasingly common (especially with the release of GPT-4o to the public), this will likely be a serious concern. The risks associated with the usage of LLM-assisted codes need to be properly mitigated, and in the meanwhile, human supervision and code review is still very much required.[4]

Artificial Intelligence Misuse

As the computational ability of these platforms continue to develop, so does the risk that they could be misused by malicious actors with nefarious intent. For example, research by the University of Illinois Urbana-Champaign (UIUC) found that OpenAI’s ChatGPT 4 was very capable of exploiting vulnerabilities highlighted in Common Vulnerabilities and Exposures (CVE) reports written by cybersecurity vendors or researchers, exploiting as many as 87% of all vulnerabilities ingested in the UIUC’s proof-of-concept. However, it is also worth noting that when the CVE descriptions were omitted from the training data, ChatGPT 4’s ability to exploit these vulnerabilities was greatly curtailed , dropping from 87% to a mere 7% of all the tested vulnerabilities. Based on this observation, one can surmise that AI systems are currently still not capable of fully automating the attack kill chain, especially when there are gaps in the training data.

Mitigating Data Privacy Concerns

To their credit, tech firms have rolled out several initiatives to prevent data leaks or exposure, a perennial concern with Gen-AI models. Previously, AI platforms are known to leverage user input to train and refine the models that power their LLMs. Many researchers have noted that users of these platforms generally do not have much control of how their data is stored, and if it is used for training purposes. 


OpenAI has rolled out a function that allows users to expressly disallow the usage of their conversation data with ChatGPT for training purposes. The function also includes an option for users to delete all data – prompts, queries, and all - that is stored on the platform. Similarly, Gemini also includes functions which allow users to opt out of using their conversations to train the underlying LLM or delete their chats altogether. [6]

Other major platforms have also rolled out similar privacy functions within their AI-powered services, but depending on the platform, the opt-outs might not apply to data that has already been collected and used for training. Hence, the important thing to note here is something that tech firms themselves have always reminded their users: never share confidential or sensitive data with AI platforms.

Conclusion

In conclusion, it is important to understand the associated risks as Gen-AI advances and becomes more integrated into everyday workflows. Only by doing so can we fully leverage the efficiency and productivity improvements that these systems bring.

On one end of the spectrum, it is heartening to see that the major players of the industry have been (and continuing to) invest heavily in improving the security of these systems. Over the course of 2023 and 2024, tech firms have introduced measures to mitigate several risks associated with Gen-AI platforms. Nonetheless, as with all technology, users also play an important role in ensuring that their use of Gen-AI is safe and secure. Several best practices come to mind:

First, for organisations integrating Gen-AI and LLMs into their business processes, there is a need to increase employee awareness and training on the risks and dangers associated with Gen-AI. Second, existing IT and Data Loss Prevention (DLP) policies should also be reviewed and updated to factor related data leak or exposure risks into account. Third, and perhaps most important, is that decision makers and IT teams should continue to keep abreast of the latest developments in the Gen-AI scene, and to understand their potential risks. Ultimately, AI is a powerful tool with immense potential. The challenge lies in ensuring that its benefits outweigh its risks, and that it is used safely and securely.

 

 

 

FOOTNOTES

[1] Gen-AI refers to a broader category encompassing AI systems that can create various types of content, including text, images, and videos. LLMs, on the other hand, are a specific type of Gen-AI focusing exclusively on language tasks, such as text generation and language translation.

REFERENCES:

[1] What are generative AI Large Language Models & Foundation Models

https://cset.georgetown.edu/article/what-are-generative-ai-large-language-models-and-foundation-models/

[2] Samsung Bans ChatGPT Among Employees After Sensitive Code Leak

https://www.forbes.com/sites/siladityaray/2023/05/02/samsung-bans-chatgpt-and-other-chatbots-for-employees-after-sensitive-code-leak/

[3] ChatGPT-4o could be a privacy nightmare

https://www.forbes.com/sites/kateoflahertyuk/2024/05/17/chatgpt-4o-is-wildly-capable-but-it-could-be-a-privacy-nightmare/

[4] Generative AI Security: Challenges And Countermeasures

https://arxiv.org/html/2402.12617v1

[5] LLM Agents can Autonomously Exploit One-Day Vulnerabilities

https://arxiv.org/abs/2404.08144

[6] OpenAI Data Usage For Consumer Services

https://help.openai.com/en/articles/7039943-data-usage-for-consumer-services-faq