登录查看更多内容

Shadow AI: Who's Watching Your Data Shared with Public AI Tools

Aloke Guha

Serial Entrepreneur | Innovator | Data-Driven Distributed Systems | Data Science

发布日期: 2024年6月24日

A Concerning Trend

A recent survey[1] on a trend called ‘shadow AI’ caught my attention. In case you had not been watching this trend in the security world, as the name indicates, shadow AI represents the growing use of generative AI tools by enterprise employees who are bypassing the traditional IT purchase decisions. It parallels , similar to what had happened with the use of shadow IT in the early days of cloud adoption. The bigger concern now is not cost, but that users of public AI tools are unwittingly disclosing sensitive data.

In a previous post[2], I had mentioned how inadvertent data sharing via application-to-application integration is causing large-scale security breaches. However, we are now facing a similar issue but one that is caused by humans.

LLMs are ushering in the age of Data Agents

Within the last two years, with the launch of ChatGPT in November 2021, the average Internet user has become very familiar with and an occasional, if not in some cases, frequent users of AI tools such as OpenAI's ChatGPT, Google Gemini (and previously, Bard) and Microsoft Copilot. What we have not realized is the extent of data sharing created by shadow AI that escapes corporate approval.

Sensitive Data Shared with Public AI Tools

This is not new. A well-known case was the incidents[3] from 2023 when on three separate instances Samsung employees unintentionally leaked trade secret information to ChatGPT. However, the recent survey by Cyberhaven highlights why this is a more concerning issue than that of shadow IT: 23.6% of tech workers have uploaded corporate data into public AI platforms even as the usage of these tools by business users has grown exponentially. Corporate-sensitive data being sent to these tools include intellectual property (IP),? trade secrets (as with Samsung), source code and R&D data, as well as legal and financial documents that would clearly violate regulatory policies.

Why is this Happening?

It's not that difficult to understand why use of public AI tools is expanding:

Ease of Use and Convenience: all public AI tools have simple and easy APIs so any user, whether novices or experts in AI technologies can use them with ease. ChatGPT usage is a good indicator; it had 1 million users 5 days after launch (November 2022) growing to 152 million visitors in its first month, and today averages 600 million per month.

Cost of Building and Running LLMs: both training as well as running large LLMs are resource-intensive and expensive so most users other than the large cloud companies. i.e., typically, over $B valuation, can ill-afford to run at the same levels as the foundational LLMs[4].

It’s not just the familiar generative AI based public tools, but data lake and warehouse companies, such as Snowflake, now also provide cloud platforms into which users can upload their data for running analytics jobs. Unfortunately, they are also subject to data breaches[5].

There are non-cloud options but . . .

One option to reduce dependence on public AI tools, specifically LLMs, is to run your own LLM, especially, small language models (SLM) which require training models on a much smaller data corpus, and have garnered more recent interest. SLMs can be custom built for specific tasks or domains, and depending on the context, can be quite effective. Open-source models, such as LLama and Mistral[6], as well as the recent release of Phi-3 by Microsoft[7], provide many options to run your own model without the need to share data with public AI tools.

Like many other AI DIYers, I have experimented with SLMs for the precise reason of avoiding data exposure to external entities. Getting robust models and results, as one might expect, requires using the sufficient appropriate data and time for training and fine tuning. Given that, we should expect most organizations will not have the resources, either skills or financial, to build effective SLMs.

Generative AI 3 周前

Why Private AI Makes A Lot Of Sense Right Now

Bernard Marr 10 个月前

Daily Update: What AI Development Means for…

S&P Global 1 年前

Another growing trend is the emergence of private AI cloud platforms by a number of infrastructure companies[8],[9]. This provides organizations with the privacy shield for their data but at this point the proverbial cat’s out of the bag, as it is too tempting for many AI practitioners to use the readily available easy-to-use public AI platforms.

How to Mind the Store: Distributed Edge Intelligence

What we need are mechanisms to ensure that data being shared by well-meaning AI tool users is not compromising the corporate security and data privacy policies.

There are solutions being proposed[10] such as identifying unauthorized AI tools that pose a risk of unintended data sharing, or manually checking for shadow AI instances. While these are short term measures, they require continuous vetting of the AI tools that change over time, as well as manual inspection.

A more scalable automated approach is to use distributed edge analytics where a non-intrusive intelligent edge service enforces corporate security policies without exposing proprietary or private data to public clouds. The policies have to be defined centrally at the enterprise level since the definition of what is sensitive or proprietary will be specific to each organization.

[1] https://www.cyberhaven.com/blog/shadow-ai-how-employees-are-leading-the-charge-in-ai-adoption-and-putting-company-data-at-risk

[2] https://www.dhirubhai.net/pulse/why-data-sharing-still-creating-significant-breaches-aloke-guha-jygdc/

[3] https://mashable.com/article/samsung-chatgpt-leak-details

[4] https://fortune.com/2024/04/04/ai-training-costs-how-much-is-too-much-openai-gpt-anthropic-microsoft/

[5] https://techcrunch.com/2024/06/10/mandiant-hackers-snowflake-stole-significant-volume-data-customers/

[6] https://www.dell.com/en-uk/blog/the-rise-of-the-small-language-models-slms/

[7] https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/

[8] https://www.dell.com/en-us/blog/dell-ai-ready-data-platform-your-ai-innovations-start-here/

[9] https://www.networkworld.com/article/2154051/hpe-and-nvidia-unveil-private-cloud-for-ai.html ?

[10] https://www.securityweek.com/when-vendors-overstep-identifying-the-ai-you-dont-need/

Shmuel Shottan

Recognized R&D leader | CTO | Entrepreneur | Mentor and Advisor | Product and People Leader

4 个月

Great article!

1 次回应

Alak Deb

5 个月

This is a tremendously big storm that is brewing! Both users themselves as well as user written tools will contribute. Add to this your earlier post on how Google Analytics unknown to Kasier was shipping private data out!! My opinion is that this is not really shadow AI or shadow IT - this is Shadow Sharing! which is quickly becoming a headache for enterprises. Unless, as you suggested, there is an edge service that can oversee all such data sharing. #DataSharing #ShadowSharing #ShadowAI

1 次回应

Sudarshan Raghavan

5 个月

Great article. Finding the balance between these tools that are most likely essential for business, and making sure the users and applications using them don't share sensitive information will be critical.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Shadow AI: Who's Watching Your Data Shared with Public AI Tools

Aloke Guha

Serial Entrepreneur | Innovator | Data-Driven Distributed Systems | Data Science

A Concerning Trend

LLMs are ushering in the age of Data Agents

Sensitive Data Shared with Public AI Tools

Why is this Happening?

There are non-cloud options but . . .

领英推荐

How to Mind the Store: Distributed Edge Intelligence

更多精彩文章

社区洞察

其他会员也浏览了

Latest AI, Crypto Trends, Insights and News Headlines for October 10, 2024

?? Why Size Doesn't Matter to Your LLMs

AI in '24: Cheaper, Stronger and Multimodal

The Curious AI #28

Scared of the AI-Bomb? Meet the AI botbusters.

8 Predictions for the next 12 moths of AI/ML and Bot Agents.

The Unethical Data Collection Practices of Tech Giants

Regulating Artificial Intelligence

Catch Up on the Latest News You Might Have Missed

From Data to Dynamos: A Perspective on AI's Evolution in Network Intelligence

A Concerning Trend

LLMs are ushering in the age of Data Agents

Sensitive Data Shared with Public AI Tools

Why is this Happening?

There are non-cloud options but . . .

领英推荐

How to Mind the Store: Distributed Edge Intelligence

Why Data Sharing Is Still Creating Significant Data Breaches

2024年6月3日

Generative AI for ITOps Observability: Lessons from Automated Causal Analysis

2023年7月22日

Observations from AWS re:Invent 2022

2022年12月7日

The OpsCruise patent and its significance

2021年11月17日

Of Causality and Reasoning . . . OpsCruise’s Automated Root Cause Analysis

2021年6月8日

Dissecting Amazon DevOps Guru: A Short Review

2020年12月18日

Highlights From the 2020 Virtual Kubecon

2020年11月29日

Effective Kubernetes Observability Using Contextual Integration of Metrics, Logs and Traces

2020年8月7日

Microservices: an explosion of metrics and few insights?

2019年4月18日

社区洞察

其他会员也浏览了

Latest AI, Crypto Trends, Insights and News Headlines for October 10, 2024

?? Why Size Doesn't Matter to Your LLMs

AI in '24: Cheaper, Stronger and Multimodal

The Curious AI #28

Scared of the AI-Bomb? Meet the AI botbusters.

8 Predictions for the next 12 moths of AI/ML and Bot Agents.

The Unethical Data Collection Practices of Tech Giants

Regulating Artificial Intelligence

Catch Up on the Latest News You Might Have Missed

From Data to Dynamos: A Perspective on AI's Evolution in Network Intelligence