Proposal for a Healthy AI Data Ecosytem

Proposal for a Healthy AI Data Ecosytem

This was my project submission for the BlueDot AI Safety Fundamentals: Governance Course, Winter 2024. It was an excellent experience! Check out https://bluedot.org/.

Introduction

First AI supercharged the Fourth Revolution, then it galvanized governments into reactionism, sclerotically blocking access to data in an effort to protect their constituents from privacy violations without addressing the stifling impacts of doing so.?

It’s true that AI developers, mainly US AI giants, have scoured the earth for online data in ways that leave creators and data subjects feeling used and uncompensated. But that is a zero-sum perspective: if one side wins, the other side necessarily loses; you took my data and gained, and I lost. Another perspective would be to acknowledge that the data has new value and all parties can benefit from a richer data-sharing, benefits-accruing ecosystem. Multinational treaties and entity-to-entity agreements, even personal information management, are mechanisms for healthy sharing of data in return for the scientific, economic, technosocial, and personal benefits of the AI revolution.

Background: Everything is data.

We have crossed over into a new world that will be AI driven and AI dominated because of the triad of new techniques, amped up compute, and mega doses of data. Artificial intelligence solutions such as foundation models have been trained on mass quantities of data are the technologies that will define and control the future. And it appears that nearly everything can be quantified, from our thoughts and sentiments, to democracy, to financial markets data, to ions at the farthest reaches of the earth. And whereas in the past, data needed to be structured for the machines to know what to do with it, we now have techniques such as deep learning from artificial neural networks that can absorb anything that can be coded. Text, images, the minutiae of digital behavior, anything that is tracked and measured, from the stock market to geological activity … if you can get an observation into code, then an AI system will be able to absorb it. Machine learning is getting closer to human learning with its multitudinous uncanny ways of absorbing linguistic, sensory, kinesthetic, psychic, social and environmental inputs.?

Everyone has data.

These explosive capabilities are built on an imbalance situation that is provoking a global reaction. It is well known that the American AI giants developed their solutions by foraging for data from all corners of the Internet. That which was public, or quasi-public, or just publicly exposed has been used to develop large language models, image generators and foundation models. People’s literature, product reviews, and Reddit chatter have been “data-ified” into input for large language models. Their YouTube videos and social media pictures were fed into models and products for image, speech and video generation. There have long been vast stores of data readily available for which AI developers have unearthed a new purpose. What was once mere gold now has the value of diamonds, except holders of the gold are still only holding gold and wondering why someone else is enjoying diamonds.?

Territoriality over data is unproductive.

And now the world is reacting to this imbalance. First, the commercial sector is locking down its content with paywalls and technical controls, buttressed by licensing terms and litigation. In the two years since ChatGPT 3.0 exploded, creators and publishers have decried these new uses and monetization of their content.

“For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.”

The Data That Powers A.I. Is Disappearing Fast | NY Times?

Second, governments are exerting themselves to block AI developers from accessing data of their constituents. These are non-U.S. governments which believe that AI data uses are invasive of the rights of their people and have risen to be a matter of national interest. So they are exercising legal mechanisms such as data protection and intellectual property law to inhibit those activities.

  • “Brazil has blocked Meta from using Brazilians' Instagram and Facebook posts to train its artificial intelligence (AI) models.

It comes weeks after the company abandoned similar plans to use UK and European users' posts for the same purpose.

… Meta has a significant market in Brazil. There are 102 million Facebook users and more than 113 million Instagram users in the country.

… “Brazil's data protection regulator also found that personal data found in children and teenagers' posts could be collected and used to train Meta's AI systems, which could be in breach of the country's data protection law. See Brazil suspends Meta from using Instagram posts to train AI | BBC

  • From data protection authorities of Australia, Canada, UK, Hong Kong, Switzerland, Norway, New Zealand, Colombia, Jersey, Morocco, Argentina, Mexico, Guersey, Spain, Monaco & Israel:

“The co-signatories also want to emphasise their expectation that all companies, not just SMCs, protect the publicly accessible personal information that they host against unlawful scraping. Failure to implement adequate safeguards in compliance with applicable laws could result in regulatory intervention, including enforcement action.” See Privacy watchdog signs global joint statement calling on social media firms to guard against mass scraping of data | Hong Kong Free Press

Initially, these are bans on access to social media content, the most incendiary of data sources, but these may be extended to all manner of sensitive and public data. In addition to its role as regulator and enforcer, government is a vast source of data on affairs of public interest, including demographic, health, education, economic, environmental, judicial and more info. See e.g. https://catalog.data.gov/dataset/. And hence, we need to pay attention to the consequences of blocking developers from accessing all the vast insights that could inform the development of beneficial, aligned AI:

  • Representative performance.? When training data is not representative, the resultant model or technology will not be representative in its performance. For example, without access to communications content of certain countries or cultures, LLMs will not be as capable or conversant in those languages, dialects or other modes of communication.??
  • Innovation for the people.? Even worse, new modalities or use cases may not be pursued, for lack of training data. For example, developers of personal finance AI agents may pivot away from Brazilian use cases given the challenges of accessing data of and about Brazilians. Or, without access to infrastructure and?
  • Assurance & fairness.? Without broad and representative training data, it is much harder to perform fairness and inclusivity validation and testing. For example, without access to facial images across Southeast Asia, computer vision developers will not be able to validate if their tech works evenly across all types and shades of Southeast Asian faces.?

And those are just the outcomes of participating in training data. There are far greater benefits to being part of the fulsome AI ecosystem, with its economic and social drivers. The United Nations starkly states that countries and peoples which withdraw from this ecosystem due to a lack of trust will ultimately miss out on the AI transformation..?

Left ungoverned, however, AI’s opportunities may not manifest or be distributed equitably. Widening digital divides could limit the benefits of AI to a handful of States, companies and individuals. Missed uses – failing to take advantage of and share AI-related benefits because of lack of trust or missing enablers such as capacity gaps and ineffective governance – could limit the opportunity envelope.?See Governing AI for Humanity –?Final Report | United Nations at p.7

AI goes beyond privacy.

This reactivity is not invalid. It has hit upon real issues: there’s value in data; there’s risk to individuals from having it used in unexpected manners; and data subjects are disassociated from the benefits of those data uses. Notably, data protection authorities are taking the lead, empowered by the strong arm of laws like the GDPR in Europe and throughout the world. It also is not surprising that authorities with the mandate to safeguard the privacy of individuals would react in this manner. This mindset and these legal frameworks leave little room to factor in public, economic, long-term benefits for allowing data uses. But indeed, individuals do care about medical breakthroughs on the diseases that impact them, inventions that are tailored to their languages and interests, jobs in the AI economy, and generally participating in the AI boom.

Conceiving of a healthy AI-data ecosystem.

From the United States to Nigeria, every country is keenly aware that their ability to lead, or even participate in, the AI boom, will determine future prosperity. Just as there is an imbalance in how data and its benefits flow, there is an imbalance in access to AI tech and possibilities around the world. Leading AI development is dominated by the United States and China. Barriers to entry are significant, and those assets are also gated by a few countries: (i) compute and infrastructure, (ii) know-how and people, (iii) innovation-friendly economies and regulatory environments.?

“AI is, therefore, a developmental equaliser at a scale similar to the internet. It will also be the great differentiator, and the nations that become the leaders in its application will rule the emerging world.

… Nigeria, Africa’s largest economy, stands at a massive opportunity in leveraging AI for high-impact transformation. On the one hand, its youthful population and growing tech scene present fertile ground for the development and adoption of AI. On the other hand, infrastructure limitations and a nascent AI ecosystem pose significant challenges.” See Nigeria’s National AI Strategy at pp.10, 14

Let’s put these problems together. The AI giants want more data to press forward in the innovation race. Publishers and platforms have seen the opportunity to charge for access to content developed by professionals and members of their platforms. Regulators, in an attempt to protect privacy interests, are locking down access to social media data. Governments are rethinking intellectual property frameworks to accommodate data scraping for AI uses, exploring compensation models and other means to foster innovation. This is not a healthy ecosystem, but it has the potential to be such. An ecosystem is an interconnected environment in which diverse organisms and inanimate components interact with and nourish each other. In the AI economy, we have these components. Technology developers need the nourishment of data. Individuals, society, organizations, sectors , governments –?they all want to partake of the nourishment emanating from being able to use leading AI solutions and the vast economic benefits of a thriving, homegrown AI industry. Why not let data flow in return for:

  • Access to chips and related know-how
  • Access to infrastructure & other hardware – emerging AI countries could use cloud-based access to servers and data centers now, while developing their own infrastructure, to save valuable time and jump start R&D
  • Knowledge-sharing on a wide range of matters, from setting up infrastructure to how to best incentivize innovation, to access to researchers, teachers and thought leaders?
  • R&D partnerships, employment and scholarship opportunities, educational support, and other people investments?
  • Personal monetization of data?

Streams of mutual benefit, by design.

There are many players in the burgeoning AI-driven world, and hence many “fish” here to nourish. If key relationships can establish healthy flows of nutrition (see the list above), benefits will flow to other members of the ecosystem. At the risk of being corny, let’s say the AI developers are the sharks, the nation-states are the whales, and we people are the minnows. What do they each want from each other, and how can they get it?

  1. Government-to-Developer Flow of Data & AI Benefits.?

  1. ??Individuals-to-Developer Flow of Data & AI Benefits

Conclusion. Please react! What do you think?

There is no denying that the future will develop based upon an AI foundation that will either enable or restrict humans from participating in progress. Thus, it behooves us to rethink outmoded concepts of data ownership and participation in the AI data ecosystem. While we don’t want developers to unscrupulously hoover up data from around the world, we also don’t want to inhibit the evolution of a healthy data-AI ecosystem where all can participate both as inputs and as outcomes. Let the fish feed each other.

要查看或添加评论,请登录

Annie C. Bai的更多文章

社区洞察

其他会员也浏览了