Generative AI might revolutionize Data Science!
Kent Thorén, PhD
Executive Advisor | Foresight, Strategy and Innovation Management | ex-eBay
The launch of ChatGPT Code Interpreter signals the possibility of using generative AI to generate code to solve analytical tasks. This could potentially revolutionize data science by making it more accessible to a wider range of people. Generative AI is a type of artificial intelligence that can create new data, such as text, images, or code. This is in contrast to traditional AI, which is better at analyzing existing data. In the context of data science, generative AI could be used to automate many of the tasks that are currently done by human data scientists. For example, generative AI could be used to clean and prepare data, identify patterns and trends in data, develop models to predict future behavior, and generate reports and visualizations.
This could lead to data science at scale, empowering people all over the organization to get a much better basis for making decisions and solving problems. The potential benefits of using generative AI for data science are significant. By making data science more accessible, generative AI could democratize this field and help organizations make better decisions. However, one must also recognize that generative AI may not always produce accurate results. It is important to carefully evaluate the output of generative AI models before using them to make decisions. This post outlines how it could work and covers some of the benefits and pitfalls, to a large extent based on a much longer article by AI expert Lance Eliot in Forbes (link).
Why is this interesting?
The combination of generative AI and data science is interesting because it has the potential to democratize data science. Today, data science is a specialized field that requires a lot of training and expertise. However, generative AI could make it possible for anyone to use data science to solve problems. This could have a major impact on businesses and organizations of all sizes. By making data science more accessible, generative AI could help businesses to make better decisions, improve their operations, and gain a competitive edge.
The potential of generative AI for data science is still being explored. However, the early results are promising. As generative AI continues to develop, it is likely to play an increasingly important role in data science. But, “the question arises whether we will be awash in nonsensical data science or at least amateurish data science that is replete with errors, falsehoods, biases, glitches, AI hallucinations, and other undesirable maladies.“ (Eliot, 2023). So on the downside there is a risk of a flood of poor data science “that misleads, misguides, and furthers the tidal wave of societal misinformation and disinformation.” (ibid.)
The key to getting this right seems to be to understand how it works so it can be implemented properly and to carefully update and communicate the organization’s data strategy with this in mind. Data strategies are intended to leverage and maximize their beneficial use of data, so they also ought to give guidance on how to avoid mishandling the data and misreporting its implications.
? What is Data Strategy?
A data strategy is a critical framework employed by organizations to effectively acquire, organize, analyze, and deliver data in alignment with their business objectives. It has become increasingly essential due to the emergence of Big Data and the growing importance of data in decision-making. It should support the organization in achieving its mission and objectives while also providing a competitive advantage in the market. However, the relationship is two-way; the top-level organizational strategy shapes the data strategy, and vice versa.
A comprehensive data strategy outlines how data will be collected, processed, and utilized to meet business goals. If used well, some argue that it can disrupt management theories by enabling real-time decision-making based on collected data. For example, organizations can leverage data to explore new markets, identify consumer trends, and redefine competitive forces. Personally, I doubt it can fully replace human thinking, innovation, and forward-looking strategizing because for much of what matters in the world, there is no data, and the data that do exist are always about the past.
Nevertheless, without a cohesive data strategy, organizations often struggle with decision-making processes, suffering from having "multiple versions of the truth" (MVOT). As departments or teams create and interpret data independently, conflicting narratives and confusion occur due to contradictions. One of the main goals of a data strategy is, therefore, to establish a "single source of truth" (SSOT), where high-quality, standardized data are readily available, and MVOTs are carefully controlled for.
For a data strategy to effectively support the organization, it involves both “offense” and “defense”. Data defense focuses on minimizing risks and ensuring data integrity, while data offense aims to support business objectives by generating actionable insights about customer patterns or growth opportunities. This is how it can become an invaluable asset for winning in the marketplace in today's data-driven world. It enables fast decisions but without confusion and conflicts arising from multiple versions of the truth. But how can generative AI be used in this context?
? The Coupling of Generative AI and Data Science
First, let's clarify what data science is. Data science is a multidisciplinary field that employs science-based approaches to extract insights from data. Through the systematic application of principles, mathematical methods, and processes, it extracts meaningful and non-obvious patterns from large datasets. Relying on a set of principles and problem-definition methods, data science helps organizations process real-world problems into clear options and answers. Usually, this involves, as a first step, an exploration and cleaning of the data, possibly also inspecting it through data visualization, followed by advanced multivariate statistical methods applied for drawing conclusions from the underlying patterns and trends that answer a problem the organization struggles with. The results are typically disseminated in appropriate formats like commented graphs to support decision-making.
How then can Generative AI be integrated into data science?
What amazes us with Generative AI is that it is seemingly fluent in interactive dialoguing and capable of producing texts virtually indistinguishable from those composed by humans. It does so by computational pattern-matching that can mimic human ways of expression. Essentially, it puts text parts together, without having any understanding of what they mean, based on patterns of how similar parts are typically put together that it acquired through training on a massive volume of text and other content found on the Internet.
The analytical work can often be iterative, as the application of one tool, like clustering, may show something that warrants the addition of analytical steps to the set originally intended. After finding a good resolution to the original question, the data scientist usually summarizes the data and makes various displays or portrayals, sometimes also interpreting and describing what the visualizations indicate. In the myriad of tasks needed for extracting insight from the data, there are many points where generative AI can give support.
领英推荐
Obviously, it can help generate the texts and visualizations the data scientist wants to put in the final report. However, if the generative AI is able to produce programming code, the kind that software engineers compose, things become much more interesting. If it can recognize the type of analysis needed to output the solutions that move the process along, something it can learn by training, it could potentially generate code as needed for whatever problem it faces. Thereby it becomes able to deal with analytical tasks as they come up, even those for which there was no program in the first place. For example, noticing that the data set has outliers, it could write a code that judges which to eliminate and then one that removes them from the data set if the user wishes it. If it finds that there are too many gaps in the data for an analytical approach that appears promising, it could realize a way to estimate the missing values, then write and run a code that enters them in a new set to be used in the subsequent steps.
Seemingly, such a generative AI can potentially produce programming code to perform whatever function or calculations that might be needed. In other words, “the sky is the limit as to whatever programming code can be concocted” (Eliot, 2023). This is revolutionary because such a general-purpose programming capability opens immense possibilities, effectively removing many constraints of what generative AI can do. Using the Code Interpreter plug-in, ChatGPT Plus subscribers can already generate code that is then run or executed in a sandboxed and firewalled execution environment.
The good news here is that the generative AI takes care of the programming specifics for you. You do not need to know how to design programs or any programming language. All that you need to do is use your everyday language to express what you want the generative AI to do when it comes to tackling data science tasks.
An example: Without any training, it becomes possible to express in everyday language what you want the AI app to examine the data and tell you what it finds. The output is a description of what the data is about and how it is structured in everyday terms. Then you can ask the AI app to investigate whether the data contains questionable values or other issues. If it does, you can ask it to generate a new cleaned data set before moving on, without needing to know how to do so yourself. From the initial description of what the data set contains, you can ask the AI app different questions you need answers to. For example, “what are the most profitable customer groups?” Or, “what would happen if we increased the price by 2%”? The AI app can suggest several statistical methods and inform you which of them is the most promising. You could then select which to run or ask it to decide for you. The AI app might also generate convincing descriptive and interpretive text about the results of the analysis and generate supporting visualizations like plots and diagrams. You only interact naturally with the generative AI, and it generates the needed code and executes it, then shows the results to you in seamless interaction.
? Risks and Concerns
There is a risk that with data science being conveniently put at everyone's fingertips, some people may become too complacent and fail to double-check and ascertain whether the results make sense, are viable, and usable in practice. Eliot (2023) proposes that this unfortunately is both plausible and likely. Some specific risks are:
Some believe (hope) that state-of-the-art generative AI will be good enough to largely avoid these risks. Given that we can expect a massive increase in data science, this would be highly desirable. However, it will be each organization’s responsibility to ensure the proper use of generative AI for data science in accordance with its own objectives and standards. They will need to update their data strategy with guidelines and principles regarding when, how, and by whom AI can be applied for data science, and how the outcome should be verified for relevance and accuracy. They must also make sure that the employees concerned are aware of the AI Ethics and AI Law concerns that pertain to generative AI.
In essence, the potential is enormous, but there are also many traps on the way!
? References
Eliot, L (2023), Generative AI And Data Science Have Mightily Paired Up To Reinvent Data Strategies, Exemplified Via Release Of OpenAI’s ChatGPT Code Interpreter, Forbes, July 17. https://www.forbes.com/sites/lanceeliot/2023/07/17/generative-ai-and-data-science-have-mightily-paired-up-to-reinvent-data-strategies-exemplified-via-release-of-openais-chatgpt-code-interpreter/?sh=193cd725287e
Mazzei, M. & Noble, D. (2017), Big Data Dreams: A Framework For Corporate Strategy, Business Horizons.
DalleMule, L. & Davenport, T. (2017), What’s Your Data Strategy?,?Harvard Business Review, May-June. https://hbr.org/webinar/2017/04/whats-your-data-strategy
Ozsu, T. (2023), Data Science – A Systematic Treatment, Communications of the ACM?(CACM), July. https://dl.acm.org/doi/10.1145/3582491 ?
?
?