登录查看更多内容

How Private and Enterprise data can help accelerate Generative-AI

Phane Mane

AI-Practitioner | Blogger | eCommerce SME | IT Strategist | MS, MBA

发布日期: 2024年8月26日

By Phane Mane

As we count down to the 2nd anniversary of OpenAI’s launch of ChatGPT, which made words like “LLMs,” “Prompt,” “Models,” and “Generative-AI” household terms, there has been an explosion of different AI/ML models with seemingly unlimited types of tasks that they can perform.?

At the time of this writing, there are about 900,000 models on Huggingface (the largest repository of ML models), which include some of the popular types such as below:

· Multimodal models (e.g., Image-Text-to-Text, Visual Question Answering, Document Question Answering, etc.)

· Computer Vision models (e.g., Image Classification/Segmentation, Text-to-Image/3D, Image-to-Text/Image/Video, etc.)

· NLP models (e.g., Text Classification/Generation, Table Question Answering, Translation, Summarization, etc.).?

Models are usually trained on a vast amount of publicly available data on the internet and the datasets that have been put together from various sources. Per research from arXiv - a curated research-sharing platform maintained and operated by Cornell University, “..if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained.”?

In other words, future models may have different quantities and diversity of data to train and test, inhibiting the progress of generative AI technology. While synthetic data, transfer learning from data-rich domains, and data efficiency improvements might support further progress, it is not just the volume but the relative closeness to real-world data (i.e., quality) that is crucial for sustained progress.

Suppose these findings make you anxious about the future of Generative AI. In that case, do not worry?—?the amount of “text” data that the frontier models (e.g., GPT-4o, Gemini, Claude, BERT, etc.) were trained on is a subset of the total useful texts that humans have created to date.

Per the research done by Educating Silicon shared in the article,?how much LLM training data is there? Is it within the limit?? “…Including non-English data might get you to the 100 – 200T range.”

So, as you can see in the tables below, “..private data is much larger. Facebook posts alone likely come to 140T, Google has around 400T tokens in Gmail, and with all text everywhere, you could maybe reach 2,000 trillion tokens”.?

These estimates do not include the enormous amount of data within a large enterprise that major models may have never seen during training due to obvious restrictions with security infrastructure.

In other words, there is still a ton of data that could be used to train/test Generative AI models, but that can only happen if we can access the privately held (by users) and enterprise-managed data.

Possible Solutions

One of the most effective ways to overcome the data access limitation is Federated Model Training/Learning, which is essentially a collaborative machine-learning approach where AI models are trained across multiple data sources and devices—such as data lakes/warehouses housed within an enterprise, smartphones, medical instruments, or even autonomous vehicles, etc.—without transferring the raw data to a central location, enhancing data privacy and security.

领英推荐

This AI newsletter is all you need #10

Towards AI 2 年前

OpenAI's AI Model Aims for "Ph.D.-Level" Intelligence

Innovation Incubator Advisory 7 个月前

What is DeepSeek? Understanding the Impact of This…

Saletancy 4 周前

In this approach, instead of sending data to a central server, each capable device processes its data and sends only model updates—like changes in "weights"—back to the server. For instance, your handheld devices can improve predictive text by learning from your typing patterns, while an MRI machine can refine image analysis models with local scans. Models in a hospital setting can learn from diagnostic imaging from various subsidiaries that share data in a trusted infrastructure without breaching patient confidentiality. These updates can aggregated/averaged on a central server to improve the model without exposing PII/PHI data.

Such “federated learning” can enable models to train on diverse, context-rich data from real-world scenarios while safeguarding privacy. For instance, text-to-speech/audio and speech recognition models can be refined using data from millions of smartphones across different languages and dialects without the local audio files ever leaving the user’s devices.

Data efficiency techniques allow models to learn more effectively from limited data. Few-shot learning, for example, allows a model to generalize from just a handful of examples, significantly reducing the data requirements for training. Self-supervised learning also plays a critical role by enabling models to learn representations from unlabeled data, which is more abundant than labeled datasets.

Transfer learning enables models to apply what they’ve learned from large, data-rich domains to other, more specialized areas with limited data. By fine-tuning pre-trained models on smaller datasets, AI can adapt to new tasks efficiently. For instance, models trained on vast general text datasets (e.g., Wikipedia, Common Crawl, etc.) can be fine-tuned on niche datasets housed within enterprise data centers.

Before you get started…

Now that we have discussed a few different ways Generative-AI can continue to thrive despite the shrinking availability of publicly available data, here are three (3) key things to remember when training AI/ML models using private and enterprise data.

Data Readiness and Structuring— an AI model’s effectiveness largely depends on well-prepared and structured data availability. Enterprises must invest in robust data management systems that handle structured and unstructured data types. Properly curated data sets enable AI models to generate more accurate and relevant outputs, essential for real-world applications.

Alignment with Business Value— align data strategy with the business's goals to maximize value. This means identifying specific AI use cases that can drive business outcomes and ensuring that the data needed for these use cases is prioritized. By focusing on data directly supporting business objectives, enterprises can accelerate the deployment of AI solutions that deliver tangible value.

Data Architecture and Infrastructure— ensure that your enterprise has the necessary data architecture to support technologies such as federated integrations with systems such as vector databases optimized to handle large-scale, high-dimensional data and that your infrastructure can support data pipelines that can efficiently preprocess and serve data to AI models for training, testing, inference,?API access, etc.

Conclusion

In conclusion, data is naturally distributed across organizations and devices. While the norm is to bring data closer to a central location, closer to where the models are trained, this is not always practical or even feasible. So, instead of moving data closer to models, enable models to access data closer to them.

Even though synthetic data can help supplement data deficiency, it doesn’t always reflect real-world scenarios. Federated learning with data-rich industries via well-architected data-sharing frameworks and using abstracted private data can help address the data deficit.

Private and enterprise data are fundamental to future model training. Still, proper governance around data readiness, quality, architecture, security, and the right talent can help ensure the proper use of such data to accelerate the growth of Generative-AI in the future.

Hope you find this blog insightful. Please like, share, and comment with your feedback, and let me know other topics you would like me to discuss in future articles.

Mihir Busa, MBA

Innovative, Strategic and Tactical Digital Leader

6 个月

Great article Phane! Federated Model is a good option from a guardrailing perspective as we know they can even hallucinate too :)

1 次回应

Deirdre Peters

Head of Digital Commerce @ Boston Scientific | MIT MBA, ex Converse/Nike

6 个月

Great article. Very insightful and helpful as we think through the AI opportunities in front of us and the problems we are trying to solve!

2 次回应

Rohit Teja M.

Frontend Engineer. Much love for Web Accessibility. A huge tech enthusiast! Big foodie, enjoys cooking.

6 个月

Great insights Phane! Thirst for data will never quench. I bet other consumer orientend folks like Apple are already using privacy focussed protocols for their own training with consumer data. It’s inevitable and not too long before consumer and enterprise data must be employed for training.

1 次回应

Manjunath M.

6 个月

Very informative Phane!

1 次回应

查看更多评论

要查看或添加评论，请登录

Phane Mane的更多文章

How AI is Accelerating Cardiovascular Care to Improve Patient Outcomes

2024年12月25日

How AI is Accelerating Cardiovascular Care to Improve Patient Outcomes

By Phane Mane Back in October, I had the privilege of participating in a panel discussion with Dr. Peter Monteleone…

4 条评论
How Generative-AI "Agents" Can Accelerate Enterprise Business Transformation

2024年10月7日

How Generative-AI "Agents" Can Accelerate Enterprise Business Transformation

By Phane Mane According to the latest McKinsey Global Survey on AI adoption, nearly 65 percent of respondents reported…

8 条评论
How “AI PCs” can help expedite enterprise adoption of Generative AI

2024年6月24日

How “AI PCs” can help expedite enterprise adoption of Generative AI

By Phane Mane In the past few weeks, tech giants like Microsoft, Dell, Intel, AMD, and Nvidia have announced the launch…

2 条评论
Smart Innovation: Balancing the Risks and Benefits of Generative-AI in B2B eCommerce and MedTech Manufacturing

2024年5月20日

Smart Innovation: Balancing the Risks and Benefits of Generative-AI in B2B eCommerce and MedTech Manufacturing

By Phane Mane and Deirdre Peters Per a recent report from the Boston Consulting Group, “GenAI is projected to grow…

2 条评论
How Generative AI Can Help Accelerate Healthcare Insights Using Multi-Modal Data

2024年4月2日

How Generative AI Can Help Accelerate Healthcare Insights Using Multi-Modal Data

By Phane Mane and Brian Peet When you think about organizations in the broader healthcare industry, especially the…

4 条评论
Innovating Safely: Software-as-a-Medical Device Compliance in the Era of Generative AI

2024年3月1日

Innovating Safely: Software-as-a-Medical Device Compliance in the Era of Generative AI

By Phane Mane and Brian Peet According to a recent report from BCG on Generative AI in health and opportunities..

5 条评论
Scaling Down but Powering Up: How Mid-size businesses can leverage Generative-AI using Small Language Models (SLMs)

2024年1月30日

Scaling Down but Powering Up: How Mid-size businesses can leverage Generative-AI using Small Language Models (SLMs)

By Phane Mane and Brian Peet Just over a year ago, very few people outside core engineering or academic research…

1 条评论
Generative AI is every Organization's "Business"

2023年12月21日

Generative AI is every Organization's "Business"

By, Phane Mane and Brian Peet In late November 2022, when OpenAI made ChatGPT publicly available it marked a…

4 条评论
How to leverage Generative-AI for external facing applications using Retrieval-Augmented Generation (RAG)

2023年12月1日

How to leverage Generative-AI for external facing applications using Retrieval-Augmented Generation (RAG)

In my role as an architect at one of the most forward-looking companies in the world, I often interact with my peers…

6 条评论
From Outdated to Outstanding: How Generative-AI can help Modernize your Legacy Applications

2023年11月1日

From Outdated to Outstanding: How Generative-AI can help Modernize your Legacy Applications

Despite the rapid emergence of various technologies and platforms that are making it ever easy to create, deploy…

2 条评论

See all articles

How Private and Enterprise data can help accelerate Generative-AI

Phane Mane

AI-Practitioner | Blogger | eCommerce SME | IT Strategist | MS, MBA

By Phane Mane

Possible Solutions

领英推荐

Before you get started…

Conclusion

Phane Mane的更多文章

社区洞察

其他会员也浏览了

?? Essential GenAI Terms You Should Know! ??

Insider's Edit: The AI-tinerary for London

The World This Week in AI (17th September 2024)

Why Traditional Machine Learning Still Holds Power in the Age of Generative AI

Is AI Progress Slowing?

The impact of AI on your business

A Glance into GPT Capabilities and Limitations

AI Brilliance: A Deep Dive into mParsec’s Genius-Level Artificial Intelligence and Machine Learning Models

AI/ML News Digest | 21st edition

Generative AI Explained - Its Impact And Future In Healthcare

By Phane Mane

Possible Solutions

领英推荐

Before you get started…

Conclusion

Phane Mane的更多文章

How AI is Accelerating Cardiovascular Care to Improve Patient Outcomes

How Generative-AI "Agents" Can Accelerate Enterprise Business Transformation

How “AI PCs” can help expedite enterprise adoption of Generative AI

Smart Innovation: Balancing the Risks and Benefits of Generative-AI in B2B eCommerce and MedTech Manufacturing

How Generative AI Can Help Accelerate Healthcare Insights Using Multi-Modal Data

Innovating Safely: Software-as-a-Medical Device Compliance in the Era of Generative AI

Scaling Down but Powering Up: How Mid-size businesses can leverage Generative-AI using Small Language Models (SLMs)

Generative AI is every Organization's "Business"

How to leverage Generative-AI for external facing applications using Retrieval-Augmented Generation (RAG)

From Outdated to Outstanding: How Generative-AI can help Modernize your Legacy Applications

社区洞察

其他会员也浏览了

?? Essential GenAI Terms You Should Know! ??

Insider's Edit: The AI-tinerary for London

The World This Week in AI (17th September 2024)

Why Traditional Machine Learning Still Holds Power in the Age of Generative AI

Is AI Progress Slowing?

The impact of AI on your business

A Glance into GPT Capabilities and Limitations

AI Brilliance: A Deep Dive into mParsec’s Genius-Level Artificial Intelligence and Machine Learning Models

AI/ML News Digest | 21st edition

Generative AI Explained - Its Impact And Future In Healthcare