Is Big Tech wrong to train AI models on 'messy' public data?

Is Big Tech wrong to train AI models on 'messy' public data?

The rapid growth of Artificial Intelligence technology in many different parts of a business is complicating contractual agreements with customers, 3rd parties as well as mergers and acquisitions (M&A) transactions. ?Data ownership and licensing related risks may not be currently addressed in existing contract and legal reviews and other due diligence activities including:

  • Training Data: AI systems heavily rely on training data - which is often scraped or open-source data where ownership and data usage rights are not clear. Understanding and documenting training data flows is the first step.
  • Data Quality: AI data pollution occurs when the data used to train or operate AI models is flawed, incomplete, or biased, potentially leading to biased predictions, unreliable recommendations, and inaccurate insights
  • Clarity of Ownership: Determining ownership of training data can be complex and uncertain. It might be subject to claims by third parties, infringement claims, privacy issues or other legal restrictions. This uncertainty could impact not only the use of training data, but also ownership of algorithms built using that data and any synthetic data created.
  • Use Limitations: If training data has use limitations, it can restrict how a company commercializes and licenses the data, develops technology, and applies algorithms.

Synthetic Data

The reliance on public data for training AI models exposes companies to significant copyright and privacy risks, while synthetic data, though a promising alternative, may face limitations in scope and accuracy when derived from insufficient original data. The push for AI models to handle large-scale data creates operational challenges, including data freshness, regulatory pressures, and the need for real-time insights, all of which necessitate robust and secure technology infrastructures to manage and mitigate these risks effectively.

Ali Golshan, CEO and cofounder of Gretel, which allows companies to experiment and build with synthetic data. Golshan says synthetic data is a safer and more private alternative to "messy" public data, and that it can shepherd most companies into the next era of generative AI development.

要查看或添加评论,请登录

Mark Carey的更多文章

  • 2025 Global Trends to Watch Part 2

    2025 Global Trends to Watch Part 2

    This is Part 2 of our review of Global Trends that risk leaders should be aware of in 2025, covering. See our review of…

  • 2025 Global Trends to Watch

    2025 Global Trends to Watch

    In the rapidly evolving risk landscape of today’s fast-moving business environment, forecasting and managing risks is…

    3 条评论
  • Reliable Electricity at Risk: The Surge in Electricity Demand

    Reliable Electricity at Risk: The Surge in Electricity Demand

    The digital age of AI and cloud computing and its rapidly growing demand for data centers and computing power is…

  • Using AI to Spot Black Swan Events

    Using AI to Spot Black Swan Events

    We recently conducted a Linkedin poll about recent events and if they were black swan events. With 144 respondents to…

    1 条评论
  • Unintended Consequences of the Crowdstrike Outage

    Unintended Consequences of the Crowdstrike Outage

    A faulty update to CrowdStrike's Falcon sensor software caused a widespread system crash as millions of Windows…

社区洞察

其他会员也浏览了