Shadow AI: Who's Watching Your Data Shared with Public AI Tools
Aloke Guha
Serial Entrepreneur | Innovator | Data-Driven Distributed Systems | Data Science
A Concerning Trend
A recent survey[1] on a trend called ‘shadow AI’ caught my attention. In case you had not been watching this trend in the security world, as the name indicates, shadow AI represents the growing use of generative AI tools by enterprise employees who are bypassing the traditional IT purchase decisions. It parallels , similar to what had happened with the use of shadow IT in the early days of cloud adoption. The bigger concern now is not cost, but that users of public AI tools are unwittingly disclosing sensitive data.
In a previous post[2], I had mentioned how inadvertent data sharing via application-to-application integration is causing large-scale security breaches. However, we are now facing a similar issue but one that is caused by humans.
LLMs are ushering in the age of Data Agents
Within the last two years, with the launch of ChatGPT in November 2021, the average Internet user has become very familiar with and an occasional, if not in some cases, frequent users of AI tools such as OpenAI's ChatGPT, Google Gemini (and previously, Bard) and Microsoft Copilot. What we have not realized is the extent of data sharing created by shadow AI that escapes corporate approval.
Sensitive Data Shared with Public AI Tools
This is not new. A well-known case was the incidents[3] from 2023 when on three separate instances Samsung employees unintentionally leaked trade secret information to ChatGPT. However, the recent survey by Cyberhaven highlights why this is a more concerning issue than that of shadow IT: 23.6% of tech workers have uploaded corporate data into public AI platforms even as the usage of these tools by business users has grown exponentially. Corporate-sensitive data being sent to these tools include intellectual property (IP),? trade secrets (as with Samsung), source code and R&D data, as well as legal and financial documents that would clearly violate regulatory policies.
Why is this Happening?
It's not that difficult to understand why use of public AI tools is expanding:
Ease of Use and Convenience: all public AI tools have simple and easy APIs so any user, whether novices or experts in AI technologies can use them with ease. ChatGPT usage is a good indicator; it had 1 million users 5 days after launch (November 2022) growing to 152 million visitors in its first month, and today averages 600 million per month.
Cost of Building and Running LLMs: both training as well as running large LLMs are resource-intensive and expensive so most users other than the large cloud companies. i.e., typically, over $B valuation, can ill-afford to run at the same levels as the foundational LLMs[4].
It’s not just the familiar generative AI based public tools, but data lake and warehouse companies, such as Snowflake, now also provide cloud platforms into which users can upload their data for running analytics jobs. Unfortunately, they are also subject to data breaches[5].
There are non-cloud options but . . .
One option to reduce dependence on public AI tools, specifically LLMs, is to run your own LLM, especially, small language models (SLM) which require training models on a much smaller data corpus, and have garnered more recent interest. SLMs can be custom built for specific tasks or domains, and depending on the context, can be quite effective. Open-source models, such as LLama and Mistral[6], as well as the recent release of Phi-3 by Microsoft[7], provide many options to run your own model without the need to share data with public AI tools.
Like many other AI DIYers, I have experimented with SLMs for the precise reason of avoiding data exposure to external entities. Getting robust models and results, as one might expect, requires using the sufficient appropriate data and time for training and fine tuning. Given that, we should expect most organizations will not have the resources, either skills or financial, to build effective SLMs.
领英推荐
Another growing trend is the emergence of private AI cloud platforms by a number of infrastructure companies[8],[9]. This provides organizations with the privacy shield for their data but at this point the proverbial cat’s out of the bag, as it is too tempting for many AI practitioners to use the readily available easy-to-use public AI platforms.
How to Mind the Store: Distributed Edge Intelligence
What we need are mechanisms to ensure that data being shared by well-meaning AI tool users is not compromising the corporate security and data privacy policies.
There are solutions being proposed[10] such as identifying unauthorized AI tools that pose a risk of unintended data sharing, or manually checking for shadow AI instances. While these are short term measures, they require continuous vetting of the AI tools that change over time, as well as manual inspection.
A more scalable automated approach is to use distributed edge analytics where a non-intrusive intelligent edge service enforces corporate security policies without exposing proprietary or private data to public clouds. The policies have to be defined centrally at the enterprise level since the definition of what is sensitive or proprietary will be specific to each organization.
Recognized R&D leader | CTO | Entrepreneur | Mentor and Advisor | Product and People Leader
4 个月Great article!
This is a tremendously big storm that is brewing! Both users themselves as well as user written tools will contribute. Add to this your earlier post on how Google Analytics unknown to Kasier was shipping private data out!! My opinion is that this is not really shadow AI or shadow IT - this is Shadow Sharing! which is quickly becoming a headache for enterprises. Unless, as you suggested, there is an edge service that can oversee all such data sharing. #DataSharing #ShadowSharing #ShadowAI
Great article. Finding the balance between these tools that are most likely essential for business, and making sure the users and applications using them don't share sensitive information will be critical.