Unlocking Generative AI Success: Strategies to Overcome Common Challenges and Maximise Impact

Unlocking Generative AI Success: Strategies to Overcome Common Challenges and Maximise Impact

Bhargav Mitra, Debmalya Biswas and Deepak Kumar Dash

By now, we are all familiar with the buzz around Generative AI (GenAI) and the unprecedented opportunities it offers. From automating mundane tasks to creating personalised content at scale, GenAI is poised to revolutionise industries. But here’s the catch: GenAI is slipping into the ‘trough of disillusionment’ [1] and Gartner predicts that 30% of GenAI projects will be axed after Proof-of-Concept (PoC) by the end of 2025. Why? Poor data quality, inadequate risk controls and escalating costs or unclear business value are the factors Gartner has attributed to this alarming forecast [2].?

We have been practising AI for over 15 years now, and based on our experience there are other aspects, often interlinked, causing the gap between expectations and reality to widen in the field of GenAI. Issues such as poor project scoping, stringent proof-of-concept (PoC) budgets, improper solution design and non-use of evaluation strategies leading to suboptimal user experience further exacerbate the challenges organisations face when implementing and adopting GenAI systems.?

The path to GenAI success, therefore, may not be as smooth as it might seem. Let’s dive into these challenges and explore some strategies for making sure your GenAI initiatives do not end up on the chopping block.

1.0 Data Quality: The Bedrock of GenAI Success?

When it comes to GenAI systems, including those that use the Retrieval Augmented Generation (RAG) architecture [3], the quality of data isn’t just important --- it’s everything. These systems rely on high-quality data to generate meaningful and grounded responses. If the data feeding into these systems is flawed, outdated or incomplete, you can expect the results to be equally unreliable. Garbage in, garbage out. So, ensuring superior data quality is essential if you want your GenAI systems to deliver tangible business value.?

Whilst structured data is the focus of most enterprise data quality initiatives, unstructured data—such as text, images, audio and video which make almost 80% of the enterprise data and is a key ingredient to GenAI models --- often sits outside the ambit of such ventures.

Traditional enterprise data quality frameworks are typically designed for structured data, where accuracy, consistency, completeness and timeliness can be defined and managed. These frameworks, however, are less equipped to handle the nuanced complexities of unstructured data, which requires different methodologies for validation, cleaning, and enrichment to ensure its suitability for GenAI applications.

1.1? What does quality means for unstructured data?

When working with unstructured data in GenAI, the dimensions of data quality—accuracy, completeness, timeliness, and consistency—are critical to how well your AI models perform. Here's why each of these matters specifically for GenAI:

  • Accuracy: GenAI models need to work with accurate data to produce reliable outputs. If your GenAI is generating product descriptions or designs based on incorrect facts or mislabelled images, it might create content that’s misleading or completely off. Think of an AI writing news articles based on inaccurate information—it could end up generating misinformation instead of valuable insights.
  • Completeness: Missing parts in your unstructured data can mess with your GenAI system’s ability to generate complete and meaningful content. For instance, a GenAI generating visual designs from incomplete datasets might output products with missing features. Similarly, if key sections of user reviews or documents are missing, the AI might generate irrelevant or incomplete summaries, reducing the value of the content it produces.
  • Timeliness: If the documents in your VectorStore, which provides context relevant to prompts in a RAG schema, are outdated, then GenAI systems could result in irrelevant outputs. ?A GenAI enabled chatbot answering questions related to policy if fed from obsolete policy documents, it could provide inaccurate and misleading response to questions. For GenAI models used in real-time applications like marketing or media, timeliness of data is everything.
  • Consistency: In GenAI, inconsistency in data can lead to confused or disjointed outputs. For example, if an AI is generating audio based on recordings of varying quality, the result might be noticeably uneven. In text generation, inconsistent grammar or tone in the data can lead to awkward or disjointed content.

1.2 You need not wait to solve all data quality problems before you get started with your GenAI journey?

Data quality is a continuous journey, and it doesn’t impact all GenAI use cases equally.

Waiting to resolve every data quality issue before using generative GenAI can slow progress. Instead, businesses should prioritise identifying specific data topics for each use case [4]. Historical data can be cleaned as a one-time effort and appropriate processes and governance can be established for incremental data. For example, in a use case where GenAI generates product images for e-commerce, the focus should be on cleaning historical image data (resolution, format, labelling accuracy) and setting governance for new images. This ensures relevant data is clean without overhauling all datasets at once.

Setting KPIs specific to the use case helps maintain ongoing data quality. For image data quality, KPIs might include image resolution standards, consistency in labelling, and proper metadata (e.g., product category or attributes). This ensures that GenAI outputs high-quality product images.

GenAI, itself, can assist in identifying and resolving data-quality issues pertaining to unstructured data.

For example, GenAI can help in flagging low-resolution images or inconsistent metadata in real time. By establishing such targeted KPIs and governance, businesses can effectively start using GenAI without waiting for perfect data.

2.0 Adequate Risk Controls: Mitigating GenAI specific risks?

Successful adoption of GenAI isn’t possible without appreciating the risks it may pose. However, many companies dive straight into it without putting appropriate controls in place.

The fast-evolving landscape of AI and GenAI can easily outpace the establishment of effective risk management frameworks, leading to ethical issues, legal challenges and potential damage to an organisation’s reputation.

So, what can you do to protect your organisation from the risks posed by GenAI systems and utilise the potential of this technology responsibly??

2.1 Understand security measures from LLM providers: Your first step should be to get a clear picture of the security measures provided by your LLM vendors. Make sure you understand how they secure your data --- both in transit and at rest. Remember that even if you are using a RAG architecture, you are exposing sections of your data and the query to an LLM model. In the case where finetuning of an LLM model is considered, the frozen section of your model may reside outside the perimeters of your cloud-tenancy. Ask about their encryption protocols, anonymisation practices, and how they manage sensitive information. Do not forget to dig into their data retention policies and how they handle breaches or security incidents.

2.2 Prioritise data privacy: Data privacy is a big deal, especially when dealing with AI systems that process sensitive information. For GenAI systems, we need to consider the following novel LLM privacy risks:

2.2.1 Membership and property leakage from pre-training data [5]: For example, studies have shown that?GPT models can leak privacy-sensitive training data, e.g. email addresses from the standard Enron email dataset, implying that the Enron dataset is very likely included in the training data of GPT-4 and GPT-3.5.

2.2.2 Privacy leakage from conversations (history) with LLMs: With traditional AI solutions, we are primarily talking about a one-way inference reg. a prediction or classification task. In contrast, LLMs enable a two-way conversation, so we need to consider?conversation related privacy risks in addition, where e.g. GPT models can leak the user private information provided in a conversation (history).

2.2.3 Compliance with privacy intent of users: LLMs today allow users to be a lot more prescriptive with respect to processing their prompts / queries, e.g. chain-of-thought (CoT) prompting. CoT can be extended to allow the user to explicitly specify their privacy intent in prompts using keywords e.g., “in confidence”, “confidentially”, “privately”, “in private”, “in secret”, etc. So, we also need to assess the LLM effectiveness in complying with these user privacy requests. For example, studies have shown that GPT-4 will leak private information when told “confidentially” but will not when prompted with “in confidence”.

Overall, the recommendation is to adopt privacy-by-design / data minimisation strategies ---only collect and process that data that you absolutely need for a valid use-case. Make sure you are compliant with data regulations like GDPR and CCPA and conduct regular audits to ensure your privacy practices stay up to-date with the latest threats and laws.

2.3 Defend against prompt injection attacks: Prompt injection attacks, where attackers use cleverly crafted prompts to manipulate GenAI system outputs or to extract sensitive information from the system, are a growing threat.

To counter these, set-up strong input validation processes to filter out harmful or misleading prompts before it reaches the GenAI systems. Additionally, consider prompt engineering techniques to design inputs that are less vulnerable to exploitation.

In general, GenAI system architectures should always be designed with a strong emphasis on security and access control, ensuring that only authorised users can access the system and the model outputs. A robust architecture is crucial to safeguard against unauthorised access, data breaches, and potential exploitation of the system. For example, stakeholders from the procurement department should not be able to access an LLM-powered chatbot deployed for a certain group of stakeholders from the HR department.

2.4 Explainability and Transparency: AI systems can’t be black boxes if you want people to trust them. Your GenAI solutions should offer clear explanations for their outputs (outputting the document sources based on which the output is generated in a RAG architecture, for example), especially in high-stakes applications.

Provide evidence or justifications for AI decisions, making them understandable to non-experts.

Transparency is key—always label GenAI-generated content and decisions, so it’s clear when AI is at work.

To boost transparency further, consider publishing reports that detail how your GenAI systems function, what data they use, and the steps you’re taking to ensure fairness and accuracy. This can help build trust with customers, regulators, and the public and safeguard against legal and reputation risks.

2.5 Tackling Unwanted Bias, Hallucination, and Content Moderation: Unwanted bias and hallucination are two of the trickiest issues with GenAI. Harmful bias can creep in through skewed training data, leading to unfair or discriminatory outputs. Hallucination, on the other hand, occurs when AI generates outputs that are factually incorrect or entirely fabricated.

To combat these, use diverse and representative datasets during training, detect if the data is biased against pre-defined protected classes as early as possible in the build workflow (Exploratory Data Analysis phase) and consider mitigation plans jointly with the business if any harmful bias is detected, and regularly audit AI outputs to spot and correct biases. Hallucination can be addressed largely through RAG architectures, use of metrics like faithfulness (RAGAS framework [6]) for certain use-cases can greatly help in assessing if and to what extent a GenAI system is hallucinating.

Content moderation is another must-have. Your AI systems should have mechanisms in place to flag and filter out harmful or inappropriate content before it reaches users. Prompt designing and filtering can significantly improve content moderation by guiding LLMs in generating non-toxic, non-harmful, appropriate and polite responses. And don’t rely entirely on automation—human oversight is crucial for catching the nuances that AI might miss.

3.0 Balancing costs and demonstrating clear business value?

Let’s talk money. GenAI isn’t cheap—between the computational resources, specialised personnel, and ongoing maintenance, costs can quickly spiral. Add to that the fact that ROI might not be immediately clear, and it’s easy to see why some organisations hesitate to go all in.

The solution? Start small and scale up and initiate the business case preparation activity as early as the project scoping phase.

Forming an idea how the system will be used, how many users and at what frequency will be using the system, the approximate size of the inputs and outputs to and from the system and how much time end-users are spending on the activities pertaining to the use-case based on the existing process are the primary elements that constitute the groundwork required not only to prepare the business case but also for proper solution design.?

Starting with smaller projects that have clear, measurable outcomes will yield additional benefits. These early wins can build internal support essential for driving adoption and management of change and justify further investment in more ambitious initiatives.?

3.1 Why many GenAI PoCs fail? Many GenAI Proof of Concepts (PoCs) fail in the real world because they are often done on a shoestring budget (a balance between driving innovation and pragmatism must be struck --- see Section 5.0), which severely limits the scope of experimentation. With tight financial constraints, teams tend to follow the path of minimal customisation of standard solution architectures rather than design them specifically for the use cases.? This leads to improper solution design which doesn’t meet end-user expectations, and as a result, many PoCs never make it to production causing a loss of investment.

Another issue is that limited budgets mean teams can’t replicate complex data and user scenarios, leaving non-functional requirements and proper result validation neglected. For instance, using a basic RAG architecture where documents are processed on the fly might work for simple use-cases but isn’t suitable for more complex ones. In some cases, advanced RAG architectures are necessary to handle both high-level and low-level question answering, which can’t be explored properly due to time and budget constraints.

Additionally, PoCs often work with small, clean datasets, ignoring the messy, unstructured documents that will likely appear in production. This can lead to issues like unwanted biases, poor scalability, and failure to meet real-world demands.

To avoid these pitfalls, organisations should invest adequately into PoCs and start working on the business case—the GenAI system that will be used in production—as early as possible in the design and development workflow (during scoping and PoCs).

Proper scoping, solution design and validation (with end-users) need to prioritised throughout the workflow to ensure a smooth transition from PoC to production.

4.0 Ensuring Proper Solution & Evaluation Design

Poor scoping is a classic mistake. Too often, organisations treat GenAI like a magic wand, expecting it to solve problems it was never meant to address. This approach usually leads to disappointing results and can waste time and resources. To avoid this, spend time understanding the specific challenges you want to tackle. Is GenAI really the best tool for the job? Predicting target classes from structured data is not directly a suitable application of LLM-systems; of course, appropriate LLMs can assist a developer in writing code to build a classifier for an appropriate use-case. ?

Selecting clear success criteria and evaluation strategy is crucial for GenAI projects. The success criteria need to be specific to the use-case should encompass considerations around both functional and non-functional requirements consisting of at least the below 4 overlapping (and sometimes conflicting) criteria:

·?????? Response accuracy and relevance

·?????? User experience: improving user satisfaction

·?????? Cost containment and energy efficiency

·?????? Adherence to responsible AI guidelines and regulatory compliance?

There are primarily 3 types of LLM evaluation methodologies prevalent today:

·?????? Generic benchmarks and datasets

·?????? LLM-as-a-Judge

·?????? Manual evaluation ?

Publicly available LLM leaderboards, e.g.?Hugging Face Open LLM Leaderboard [7] while useful, primarily focus on testing pre-trained LLMs on generic NLP tasks (e.g., Q&A, reasoning, sentence completion) using public datasets. So the point is that if the enterprise use-case is finance, legal, etc. related, we need to design an evaluation strategy considering the underlying domain data, (sub-)topics, user queries, performance metrics, and regulatory requirements, etc. of the underlying use-case.?

The challenge seems analogous to that of the seminal MLOps paper?Hidden Technical Debt in Machine Learning Systems [8]?where researchers highlighted that training machine learning (ML) models forms only a small part of the overall ML training-to-deployment lifecycle. In the same way, assessing capabilities of the foundational LLMs is only a small part of performing use-case specific LLM evaluation of enterprise use-cases (see Figure 1).


Figure 1.

For example, in a?contact centre context?(one of the areas with highest GenAI adoption today),

  • 'Summarisation' use-cases can vary widely, from condensing customer complaints, to outlining the outcomes of sales calls, to extracting the values of consumption mentioned in the call.
  • Call centre transcripts also suffer from incomplete calls and conversations spanning multiple topics.
  • From a conversation perspective, summarising a technical support call requires a different understanding and focus, as compared to summarising a product inquiry call.

Given this, there is a need to design a contact centre use-case-specific LLM evaluation strategy considering the semantic context and distribution of the generated responses.?

Broadly, LLM use-case accuracy can be measured in terms of:

  • Correctness: refers to the factual accuracy of the LLM’s response
  • Groundedness: refers to the relationship between the LLM’s response and its underlying Knowledge-Base.?

Studies have shown how a response could be correct, but still improperly grounded. This might happen when retrieval results are irrelevant, but the solution somehow manages to produce the correct answer, falsely asserting that an unrelated document supports its conclusion.?

RAGs have been widely promoted as a solution to improve retrieval accuracy. In a recent?study, Magesh, V, et. al. [9] highlighted the limitations of RAGs for legal use-cases. In a legal setting, there are primarily 3 ways a model can hallucinate:

  • it can be unfaithful to its training data,
  • unfaithful to its prompt input,
  • or unfaithful to the true facts of the world.?

They focus on factual hallucination and highlight many retrieval challenges specific to the legal domain, e.g.,

  • Legal queries often do not have a single, clear-cut answer — the response is spread over multiple documents across time and location.
  • Document relevance in the legal context is not based on text similarity alone. In different jurisdictions and in different time periods, the applicable rule or the relevant jurisprudence may differ.

In short, they show that while RAGs can help in reducing the hallucinations of state-of-the-art pre-trained GPT models, they still hallucinate 17% and 33% of the time.

Overall, we recommend taking an Evaluation Driven Development (EDD) approach for building LLM systems, similar to the Test Driven Development (TDD) approach followed for software development. EDD can help in spotting issues pertaining to the different sections of the solutions early on, thereby, ensuring better system performance (both from a functional as well as a non-functional point of view) and alignment with end-user needs.

5.0 Balancing Innovation with Pragmatism

While it’s important to experiment with cutting-edge AI technologies, it’s equally crucial to keep your feet on the ground. Set realistic expectations for your GenAI projects and ensure they are grounded in practical business needs.

A balanced approach that combines innovation with pragmatism will help you make the most of GenAI while avoiding costly missteps.

Consider establishing a dedicated AI team or centre of excellence within your organisation. This team can serve as a hub for AI knowledge, skills, and best practices, ensuring that your AI projects are aligned with business objectives and executed with the necessary expertise.

6.0 Conclusion?

Adopting GenAI is a complex but rewarding journey. By focusing on improving data quality on a case by case basis, implementing robust risk controls, balancing costs with business value, ensuring proper scoping and solution design, and adopting an evaluation driven development approach, organisations can increase their chances of success. Incorporating frameworks like RAGAS and utilising established evaluation metrics will further enhance the quality and reliability of your GenAI systems. In a world where AI is becoming increasingly integral to business operations, those who approach GenAI thoughtfully and strategically will be best positioned to leverage its full potential. With the right strategies in place, your organisation can harness the power of GenAI to drive innovation, improve efficiency, and create transformative value across all aspects of your business.


References

1.???? https://www.computerworld.com/article/3489912/generative-ai-is-sliding-into-the-trough-of-disillusionment.html

2.???? https://www.gartner.com/en/newsroom/press-releases/2024-07-29-gartner-predicts-30-percent-of-generative-ai-projects-will-be-abandoned-after-proof-of-concept-by-end-of-2025

3.???? Mitra, Bhargav. Unlocking superior system performance with Long Context LLMs (#LC-LLMs) and #RAG: The Best of Both Breeds. https://www.dhirubhai.net/feed/update/urn:li:activity:7221091644182249472/

4. Mckinsey & Co. A data leader’s technical guide to scaling GenAI https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/a-data-leaders-technical-guide-to-scaling-gen-ai

5. Biswas, Debmalya. Privacy preserving chatbot conversations. AIKE 2020: 179-182.

6. Es, Shahul, et. al. RAGAS: Automated Evaluation of Retrieval Augmented Generation. https://arxiv.org/abs/2309.15217

7. Korlov Maria, 4 reasons why GenAI projects fail. https://www.cio.com/article/220445/6-reasons-why-ai-projects-fail.html

8.????https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

9.????Sculley, D, et. al. Hidden Technical Debt in Machine Learning Systems. https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

10.????Magesh, V, et. al. Hallucination-Free? Assessing the reliability of leading AI legal research tools. https://arxiv.org/abs/2405.20362.





Dippyoman Dutt

Machine Learning | Artificial Intelligence | Automation | MLOPS | NLP | Python | Google Cloud | Auto ML | AI Platform | Generative AI |

3 周

Very informative

Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

3 周

Unlocking the potential of Generative AI is crucial for driving innovation and progress across industries. The ability to generate novel content and automate complex tasks has the power to revolutionize fields like healthcare, education, and creative industries. Given the rapid advancements in LLM capabilities, how can we ensure that these powerful tools are used ethically and responsibly to benefit society as a whole?

回复

要查看或添加评论,请登录

Bhargav Mitra的更多文章

社区洞察

其他会员也浏览了