DATA vs DATA

DATA vs DATA

Introduction

With the advent of Generative AI and the explosion of data generated by large language models (LLMs), it is time to revisit the data frameworks that the industry has established to ensure data quality, privacy, compliance, and ownership.

For the sake of argument, let us refer to?

  • User-generated data as UGD?
  • Generative AI-generated data as GGD.

UGD World?

With UGD, data management is a complex process involving the handling of data from various sources to multiple destinations, incorporating both simple and complex transformations along the way. We ensure data quality through continuous observability, maintaining completeness, comprehensiveness, consistency, and accuracy - at a minimum.

We collaborate with privacy and security teams to understand data classifications and the relevant regulations for Personal Identifiable Information(PII), confidential data, and internal data. These regulations and policies are implemented on both data in motion and data at rest. Role-based access control (RBAC) is adopted to regulate data access based on user roles.

In all enterprises dealing with UGD, emphasis is placed on training data teams, consumers, producers of data about data security and data privacy. This training is supplemented with regular internal audits and scheduled external audits to ascertain responsible data management. Ultimately, data ownership means managing UGD diligently and securely, in compliance with GDPR and other data security and privacy regulations.

GGD World?

Now, let’s explore the world of Generative AI-generated Data (GGD). This domain is similar to the wild west, full of untapped potential and emerging challenges. I categorize GGD into four main sections.

  1. Training Data - This is the raw data that is used to train the Large Language Models (LLM). This data is obtained by scraping the public internet and is stored in huge data repositories owned by the organizations who build the LLMs. Presumably these same organizations are taking the necessary steps to comply with General Data Protection Regulation (GDPR) and Privacy laws.?
  2. Tokens - Tokens are the processed units derived from the raw training data, as described above. Each large language model (LLM) has tens of thousands of unique tokens, which collectively form its vocabulary. These tokens can be words, subwords, or characters. Regardless of their form, tokens are considered data—in this case, derived data.?How do we guarantee compliance of these tokens (aka data) against data security and privacy laws as discussed above?
  3. Generated responses by LLMs - This is the set of multimodal data that is generated? by the large language models in response to prompts. There are many questions that come to mind. Who owns the data and the data management responsibility? If generated data (say an image) is used by an enterprise, is it a violation of copyright law?
  4. Synthetic Data - With limitations on annotated data, synthetic data is essential for training, generating rare "black swan" events, adding diversity, and removing biases. It is ideal for testing and experimenting with future scenarios due to its cost-effectiveness and efficiency. Soon, enterprises will routinely use synthetic data to supplement training datasets, conduct A/B testing, and create fair and balanced datasets. However, this raises complex questions about privacy and regulatory compliance. I am personally a strong advocate for synthetic data, but its use must be carefully managed to address these concerns. For synthetic data, how do we explain data ownership, define data lineage, classify data and take measures so that synthetic data does not inadvertently reveal real personal information?

As a Data Leader with many years of experience handling user-generated data (UGD), I am now trying to understand the new world of Generative AI-generated Data (GGD) with the same data governance mindset. I realize that this comparison isn't entirely straightforward, even though UGD and GGD may appear similar in look and feel. If any of you reading this are also seeking answers or have already solved part of the puzzle, please DM me. I would love to learn from your experiences and contribute to the discussion.

Fun Fact: The article is UGD and the image above is GGD

#data #security #data privacy #data compliance

?

Sumit Chakraborty

Director - Customer Success, Tech Advisory, Online and Platform SBU | Brillio - A Bain Company | Ex Oracle

8 个月

A great and very pertinent discussion Ishita Majumdar. Privacy and Security of GGD is a concern that almost all enterprises carry on their back as they continue to invest on the same. The fundamental difference between approaching Privacy and Security for Non-GenAI and GenAi apps is that the former is a design time process/framework vs the later a run time process where weights of tokens are modeled pretty much based on data being used in training . Once PII data is used to train an LLM, there is no way some one ( even OpenAI) can tell how this impact the outputs . With PII data used in prompt engineering , the challenges and the biases are generally more pronounced . In the industry we have started seeing impacts of these not so desired issues and generally obfuscating too much also takes way data lineage and utility of the training data.. so not sure I am providing any solutions here but this is something thats very worth discussing.. Any other views ?

要查看或添加评论,请登录

Ishita Majumdar的更多文章

  • Celebrating Dads

    Celebrating Dads

    If you could time travel back to the early 20th century, you’d find that Father’s Day was first conceptualized then but…

  • Celebrating Mothers - A Multifaceted Leadership

    Celebrating Mothers - A Multifaceted Leadership

    Five years ago, I would have hesitated to share Mother's Day messages on LinkedIn. However, the COVID-19 pandemic has…

    8 条评论
  • Tech's Kindness Culture

    Tech's Kindness Culture

    Reflecting on my journey from a graduate student to a seasoned professional in the tech industry in US, I've identified…

    6 条评论
  • Opportunities at eBay's Data Analytics Org

    Opportunities at eBay's Data Analytics Org

    Join us on an exciting Data Analytics journey to build platforms, services and solutions. Software Engineer:…

    1 条评论
  • Technical Product Manager Opportunities at eBay

    Technical Product Manager Opportunities at eBay

    If you are a Product Manager with Platform thinking, we have many opportunities in our CTO Organization for you to…

    1 条评论
  • Opportunities in Data Analytics at eBay

    Opportunities in Data Analytics at eBay

    During these challenging times, eBay continues to unlock the power of data, connecting buyers with sellers while…

  • Leading the Way - Leadership Panel

    Leading the Way - Leadership Panel

    San Jose Woman's Club organized a fascinating evening of networking and knowledge featuring a panel of four inspiring…

    2 条评论
  • eBay WIT Supports The Hunger Project to Empower Women Worldwide

    eBay WIT Supports The Hunger Project to Empower Women Worldwide

    Last Fall, Trina Limpert, Senior Strategy Performance Manager, took a trip to Senegal with eBay – and her life was…

    1 条评论
  • eBay hosts Girls in Technology Summer Camp

    eBay hosts Girls in Technology Summer Camp

    Girls in Technology at eBay As the summer ends and schools reopen, let us take a look at how eBay hosted its own Summer…

    1 条评论
  • ebay open sources Kylin

    ebay open sources Kylin

    At eBay, our data volume has become bigger while our user base has become more diverse. Our users – for example, in…

    5 条评论

社区洞察

其他会员也浏览了