The Hidden Truth About AI Data: Who Really Holds the Power?
ChandraKumar R Pillai
Board Member | AI & Tech Speaker | Author | Entrepreneur | Enterprise Architect | Top AI Voice
Whose Data Is It Anyway? The Cultural Blind Spots of AI
Artificial Intelligence (AI) is often perceived as a black box, with much of its data origin left unexamined. Data is the lifeblood of AI, determining what the models learn and how they perform. However, a closer look reveals concerning trends about where this data comes from and how it’s used.
The Roots of AI Data
AI’s evolution over the last decade has shifted dramatically. Initially, data sets were curated with precision, sourced from encyclopedias, parliamentary transcripts, and other specific domains. However, with the introduction of transformer models in 2017, bigger became better. Today, indiscriminate scraping of data from the web dominates, prioritizing quantity over quality.
Key questions to ponder:
The Data Concentration Problem
The findings of the Data Provenance Initiative—a collaboration of over 50 researchers—showcase a worrisome trend. The dominance of tech giants like Google, which owns YouTube, is reshaping AI. Over 70% of video and audio data used for multimodal models comes from YouTube alone, concentrating power and resources in one entity.
?? Critical Thought:
Moreover, exclusive data-sharing deals between AI companies and platforms further partition the internet, creating “zones of access” favoring corporations with vast resources. This asymmetry limits smaller players, nonprofits, and academic researchers from contributing to the AI ecosystem.
Cultural Blind Spots in AI Data
Another significant issue is the cultural imbalance in data. Over 90% of AI data originates from Europe and North America, with fewer than 4% coming from Africa. This Western skew extends to language dominance, as the internet remains heavily English-centric.
Consequences of this disparity include:
?? Discussion Point:
领英推荐
Hidden Constraints and Ethical Concerns
AI companies often obscure the specifics of their training data, citing competitive advantages or lack of clarity about data provenance. Many data sets come with restrictive licenses, complicating their use and often leading to unintentional breaches, such as training on copyrighted material.
?? Consideration:
The implications are significant. Exclusive contracts and unclear licensing make it harder for developers to choose ethical data sources. This fosters a system where only the largest corporations can afford to play by the rules—an advantage that stifles innovation from smaller, resource-strapped entities.
Synthetic Data and Its Role
As the hunger for data grows, synthetic data—data generated artificially—has emerged as a key player. While it can fill gaps, synthetic data raises its own set of questions:
The Future of AI Data Practices
The issues with AI data practices aren’t just technical—they’re deeply societal. The concentration of power, cultural bias, and opaque practices all demand attention. AI is shaping the infrastructures of our world, and the way data is sourced and used will determine whether it serves humanity broadly or just a select few.
Let's Discuss
As we consider the challenges surrounding AI data, here are some questions to engage with:
Let’s continue this conversation to ensure AI is inclusive, ethical, and representative of our diverse world.
Join me and my incredible LinkedIn friends as we embark on a journey of innovation, AI, and EA, always keeping climate action at the forefront of our minds. ?? Follow me for more exciting updates https://lnkd.in/epE3SCni
#AIData #TechEthics #FutureOfAI #DiversityInAI #AIInnovation #InclusiveTech #DigitalEthics #DataTransparency
Reference: MIT Tech Review
OK Bo?tjan Dolin?ek
Info Systems Coordinator, Technologist and Futurist, Thinkers360 Thought Leader and CSI Group Founder. Manage The Intelligence Community and The Dept of Homeland Security LinkedIn Groups. Advisor
3 个月Thanks for the information always ChandraKumar R Pillai
Visionary Thought Leader??Top Voice 2024 Overall??Awarded Top Global Leader 2024??CEO | Board Member | Executive Coach Keynote Speaker| 21 X Top Leadership Voice LinkedIn |Relationship Builder| Integrity | Accountability
3 个月An insightful perspective, ChandraKumar. Your expertise as a leading voice in AI truly adds depth to this complex conversation about data and inclusivity.
Life Transformation Coach | Helping Working Professionals with Self-Love, Manifestation, and NLP Techniques | Self-Empowerment and Mindset Strategist | Career Growth, Emotional Wellness | Speaker
3 个月Thank you, ChandraKumar, for shedding light on such an important topic in AI. Your insights into the power dynamics of data are crucial for fostering a more equitable tech landscape. Your leadership in this conversation inspires us all to engage more thoughtfully in these discussions.
CEO @ Truthpass Digital Wallet | Business Innovation, Problem Solving
3 个月Data is the new asset of the future trust economies. Data integrity is critical and verified truthful data is essential. AI is mimicking human behaviours. Good and bad. While meta data has been collected by govt for a long time now to feed algorithms and to collect user histories - AI now requires human inputs from real humans. We all have a collective responsibility in building AI and maintaining balance as we input data into AI. Garbage in will equal garbage out if we allow it. Trust is the new currency - don’t let someone else decide that for you. ???♂? https://detective.nz/news/07-12-2024/building-trustworthy-systems/