How Private and Enterprise data can help accelerate Generative-AI
Generated via Prompt Engineering using text from this blog

How Private and Enterprise data can help accelerate Generative-AI

By Phane Mane

As we count down to the 2nd anniversary of OpenAI’s launch of ChatGPT, which made words like “LLMs,” “Prompt,” “Models,” and “Generative-AI” household terms, there has been an explosion of different AI/ML models with seemingly unlimited types of tasks that they can perform.?

At the time of this writing, there are about 900,000 models on Huggingface (the largest repository of ML models), which include some of the popular types such as below:

· Multimodal models (e.g., Image-Text-to-Text, Visual Question Answering, Document Question Answering, etc.)

· Computer Vision models (e.g., Image Classification/Segmentation, Text-to-Image/3D, Image-to-Text/Image/Video, etc.)

· NLP models (e.g., Text Classification/Generation, Table Question Answering, Translation, Summarization, etc.).?

Models are usually trained on a vast amount of publicly available data on the internet and the datasets that have been put together from various sources. Per research from arXiv - a curated research-sharing platform maintained and operated by Cornell University, “..if current LLM development trends continue, models will be trained on datasets roughly equal in size to the available stock of public human text data between 2026 and 2032, or slightly earlier if models are overtrained.”?

In other words, future models may have different quantities and diversity of data to train and test, inhibiting the progress of generative AI technology. While synthetic data, transfer learning from data-rich domains, and data efficiency improvements might support further progress, it is not just the volume but the relative closeness to real-world data (i.e., quality) that is crucial for sustained progress.

Suppose these findings make you anxious about the future of Generative AI. In that case, do not worry?—?the amount of “text” data that the frontier models (e.g., GPT-4o, Gemini, Claude, BERT, etc.) were trained on is a subset of the total useful texts that humans have created to date.

Per the research done by Educating Silicon shared in the article,?how much LLM training data is there? Is it within the limit?? “…Including non-English data might get you to the 100 – 200T range.”

So, as you can see in the tables below, “..private data is much larger. Facebook posts alone likely come to 140T, Google has around 400T tokens in Gmail, and with all text everywhere, you could maybe reach 2,000 trillion tokens”.?

Source:

These estimates do not include the enormous amount of data within a large enterprise that major models may have never seen during training due to obvious restrictions with security infrastructure.

In other words, there is still a ton of data that could be used to train/test Generative AI models, but that can only happen if we can access the privately held (by users) and enterprise-managed data.

Possible Solutions

One of the most effective ways to overcome the data access limitation is Federated Model Training/Learning, which is essentially a collaborative machine-learning approach where AI models are trained across multiple data sources and devices—such as data lakes/warehouses housed within an enterprise, smartphones, medical instruments, or even autonomous vehicles, etc.—without transferring the raw data to a central location, enhancing data privacy and security.

In this approach, instead of sending data to a central server, each capable device processes its data and sends only model updates—like changes in "weights"—back to the server. For instance, your handheld devices can improve predictive text by learning from your typing patterns, while an MRI machine can refine image analysis models with local scans. Models in a hospital setting can learn from diagnostic imaging from various subsidiaries that share data in a trusted infrastructure without breaching patient confidentiality. These updates can aggregated/averaged on a central server to improve the model without exposing PII/PHI data.

Such “federated learning” can enable models to train on diverse, context-rich data from real-world scenarios while safeguarding privacy. For instance, text-to-speech/audio and speech recognition models can be refined using data from millions of smartphones across different languages and dialects without the local audio files ever leaving the user’s devices.

Data efficiency techniques allow models to learn more effectively from limited data. Few-shot learning, for example, allows a model to generalize from just a handful of examples, significantly reducing the data requirements for training. Self-supervised learning also plays a critical role by enabling models to learn representations from unlabeled data, which is more abundant than labeled datasets.

Transfer learning enables models to apply what they’ve learned from large, data-rich domains to other, more specialized areas with limited data. By fine-tuning pre-trained models on smaller datasets, AI can adapt to new tasks efficiently. For instance, models trained on vast general text datasets (e.g., Wikipedia, Common Crawl, etc.) can be fine-tuned on niche datasets housed within enterprise data centers.

Before you get started…

Now that we have discussed a few different ways Generative-AI can continue to thrive despite the shrinking availability of publicly available data, here are three (3) key things to remember when training AI/ML models using private and enterprise data.

Data Readiness and Structuring— an AI model’s effectiveness largely depends on well-prepared and structured data availability. Enterprises must invest in robust data management systems that handle structured and unstructured data types. Properly curated data sets enable AI models to generate more accurate and relevant outputs, essential for real-world applications.

Alignment with Business Value— align data strategy with the business's goals to maximize value. This means identifying specific AI use cases that can drive business outcomes and ensuring that the data needed for these use cases is prioritized. By focusing on data directly supporting business objectives, enterprises can accelerate the deployment of AI solutions that deliver tangible value.

Data Architecture and Infrastructure— ensure that your enterprise has the necessary data architecture to support technologies such as federated integrations with systems such as vector databases optimized to handle large-scale, high-dimensional data and that your infrastructure can support data pipelines that can efficiently preprocess and serve data to AI models for training, testing, inference,?API access, etc.

Conclusion

In conclusion, data is naturally distributed across organizations and devices. While the norm is to bring data closer to a central location, closer to where the models are trained, this is not always practical or even feasible. So, instead of moving data closer to models, enable models to access data closer to them.

Even though synthetic data can help supplement data deficiency, it doesn’t always reflect real-world scenarios. Federated learning with data-rich industries via well-architected data-sharing frameworks and using abstracted private data can help address the data deficit.

Private and enterprise data are fundamental to future model training. Still, proper governance around data readiness, quality, architecture, security, and the right talent can help ensure the proper use of such data to accelerate the growth of Generative-AI in the future.


Hope you find this blog insightful. Please like, share, and comment with your feedback, and let me know other topics you would like me to discuss in future articles.



Mihir Busa, MBA

Innovative, Strategic and Tactical Digital Leader

6 个月

Great article Phane! Federated Model is a good option from a guardrailing perspective as we know they can even hallucinate too :)

Deirdre Peters

Head of Digital Commerce @ Boston Scientific | MIT MBA, ex Converse/Nike

6 个月

Great article. Very insightful and helpful as we think through the AI opportunities in front of us and the problems we are trying to solve!

Rohit Teja M.

Frontend Engineer. Much love for Web Accessibility. A huge tech enthusiast! Big foodie, enjoys cooking.

6 个月

Great insights Phane! Thirst for data will never quench. I bet other consumer orientend folks like Apple are already using privacy focussed protocols for their own training with consumer data. It’s inevitable and not too long before consumer and enterprise data must be employed for training.

Very informative Phane!

要查看或添加评论,请登录

Phane Mane的更多文章

社区洞察

其他会员也浏览了