Generative AI & Ethical Usage of Synthetic Data

Generative AI & Ethical Usage of Synthetic Data

With great power comes great responsibility. With lot of attention these days focused on Generative AI, organizations unlike need to think thru ethical aspects of Responsible AI and ensure there is proper governance around data creation and usage. There is a new framework that has evolved called RAFT (Reliable, Accountable, Fair & Transparent) which has played a critical role in Generative AI and data creation.


No alt text provided for this image
Fulli.org

Ok before we get any further, first and foremost let’s understand what "Synthetic" data means. Synthetic data is data that gets created by AI trained real world data samples, thru pattern recognition algorithms and statistical sampling of data. The data resembles very closed to real data from meaning and representation purposed just that it has been trained by Generative AI.?Who uses Synthetic Data? one may ask typically Data Scientists, AI Analysts, Data Engineers - work on synthetic data which eliminates the down time in producing algorithms and large language models #llm . Simply put #syntheticdata is revolutionary from a privacy standpoint as it de-identified data with more privacy and anonymity.


No alt text provided for this image

Synthetic data generation can be categorized into two distinct classes: process-driven methods and data-driven methods. Process-driven methods derive synthetic data from computational / mathematical models of an underlying physical process. Data-driven methods, on the other hand, derive synthetic data from generative models that have been trained on observed data.

For this article, we will focus on two types of Synthetic data from Data Driven Methods?

·?????Structured Synthetic ie; tabular data

·?????Un-Structured Synthetic ie; image & video

With growing popularity and adoption of #generativeai there will be more usage and synthetic data. AI Developers can get started in minutes with open-source reference examples and simple APIs for generating unlimited amounts of synthetic data, labeling personally identifiable information, or anonymizing and removing biases from data. With Cloud providing unlimited scalable storage, by using simple GUI / web interface synthetic data can be fully managed in the cloud or deployed on-premises. Python SDCK has a Package such as pydgen, which is a wrapper around Faker, which makes it very easy to generate synthetic data that looks like real world data.

Let's pivot and also look at Generative AI and low-code software used by AI developers as game changer putting innovation on a freeway if organizations don’t compromise on the responsibility factor. This day and age speed of innovation is a must-have to be very competitive, let’s look at #bardai , the Adobe-Google offering that is set to compete with OpenAI’s #chatgpt in the generative AI space.

Organizations are still grappling with applying proper governance for educating key stakeholders and leaders UpToDate with all the possibilities of Generative AI.

Here are some key criteria to consider while generating synthetic data?

  • Understand Computational horsepower:?how much compute is needed to generate data or to build a model.
  • Manual Time used and labor cost:?how much human expertise and labor goes into the generation process.
  • Biased behavior: flexible nature of synthetic data makes it prone to potentially biased results.
  • Privacy concerns: Care must be taken to ensure synthetic data does not reveal sensitive information.
  • Environment & System complexity:?how difficult it is to build such a data generation system.
  • Type of content:?how much information is present in the synthetic data.

No alt text provided for this image

At a high level let’s highlight some quick steps in creating Tabular Synthesized Data:

We will use a Large Language model (LLM) to create tabular synthetic data. In this example, we will train our model using an. You can synthesize tabular data using any interface.

1)???Create Project & Environment with Python SDK / CLI

2)???Get Training Data

3)???Redact PII (personal identifiable information

  • Create configuration with transform policy
  • Use Faker to make training and test data sets
  • Create Model
  • Generate redacted data and view results

4)???Discover PII

  • Create configuration with classify policy
  • Use Faker to make training and test data sets
  • Create Model
  • Classify test data using trained model

Responsible AI has been accelerated by the release of different generative models. Here are some important things we need to understand with regards to Generative AI decision making ability while using Synthetic Data:

1)???Explain ability: Understanding rationale/reasoning behind decision.

2)???Provability: Making sense from the mathematical certainty behind decisions

3)???Transparency: Knowledge and usage of Generative AI decision

No alt text provided for this image
Let’s look at OpenAI release of ChatGPT and its growing popularity, and how this has accelerated the concerns and questions around need of applying ethical governance layers on Generative AI.

  • Deepfakes is a Generative AI approach of generating synthetic media which is impossible to distinguish from real media, it’s a grave concern as it might spread misinformation and defame people.
  • From a Truthfulness and accuracy perspective it is upto the individual to verify information rendered by Generative AI, as ChatGPT is not dynamic, and it will relay information that was last updated/model trained.
  • Biased nature of the large language models, it all depends on how the trained synthetic data has been updated with social bias, key is to remove the bias being balanced and not to succumb to misuse of the data,
  • Copyright protection of data that gets generated, as there is growing ambiguity on who owns the synthetic data.

No alt text provided for this image
Case and a Point: Looking at Generative AI in Banking and Usage of Synthetic Data.

Artificial intelligence (AI) is a rapidly evolving field with a cornucopia of usages?across many industries. Let’s look at how banking & financial institutions are driving innovation in moving towards AI.?This requires banking on the laurels of synthetic financial data. To compete effectively in a survival of the fittest mode banks and fin-techs will need realistic synthetic financial data. There is a big push towards investing on secure privacy protected AI generated synthetic data.

No alt text provided for this image
McKinsey & Company

According to McKinsey, banking institutions embracing financial data sharing could see GDP gains of 1-5% by 2030, with benefits flowing to consumers and financial institutions. More data equals better operational performance, better AI models, more powerful analytics, and enhanced customer-centric digital banking products.

Gartner identified Generative AI as the top technology trend in FY 2022. One such use case was detecting fraudulent transactions. A Generative AI based model Generative Adversal Network (GAN) produced synthetic fraudulent transactions, when compared to original data set it outperformed original unprocessed original data sets.

No alt text provided for this image

From a Data privacy perspective Synthetic data continues to have the potential to overcome the banking industry challenges eliminating the data privacy concerns. Also from a Risk Management perspective, there is substantial Risk minimization while using Generative AI for minimizing such losses resulting from the lack of adequate risk management.?

In the banking sector, generative AI offers several transformative applications. It can generate synthetic financial data for training predictive models, assisting in risk analysis, fraud detection, and credit scoring.

Summarize:

To ensure Generative AI sustainable progress and usage of Synthetic data, my recommendation is we need new practices like Responsible AI that will enable insights from personal data while reliably protecting individuals’ privacy. Not all synthetic data generators are created equal, it is important to select the right synthetic data vendor who can match the organization needs.

About the Author:

For further queries on how to execute use Generative AI with ethical usage of Synthetic data, please feel free to reach out to Azmath Pasha via linked-in. Azmath is a consummate Chief Technology Officer at Metawave Digital with over 25 years of real-world tech leadership consulting experience, as an advisor to the tech community through board memberships including the DevNetwork, Forbes Technology Council, Azmath brings wealth of experience in Generative AI, ML/Ops and Advanced Analytics, and has been featured in many key-note address and Absolute AI podcasts.



?

James J. C.

Network AI Evangelist @ Blue Yonder | Guiding Complex Supply Chains

1 å¹´

Wow! This sounds like an incredibly insightful article. I'm looking forward to learning more about the ethical and practical implications of synthetic data usage. Thanks for sharing! #generativeai #ethicalai #syntheticdata

要查看或添加评论,请登录

Azmath Pasha的更多文章

社区洞察

其他会员也浏览了