Part 1 : Automatic Exploratory Data Analysis of Tabular Data Using Large Language Models and LIDA
Data Labs Analytics (Indonesia) - datalabs.id
Your trusted partner in Data Analytics and Enterprise Business Solution
In the era of ever-evolving information technology, tabular data analysis is becoming a key ingredient in intelligent decision-making. One innovative approach that is emerging is automated Exploratory Data Analysis (EDA) using Large Language Models. By utilizing artificial intelligence, tabular data analysis can be performed automatically, enabling the identification of patterns, trends and valuable insights without the need for intensive human intervention. This article will explain more about how Large Language Models can change the way we approach EDA on tabular data.
Large Language Models
Large Language Models (LLMs) are artificial intelligence models designed to understand and generate text in natural language with a high level of complexity. The main advantage of LLMs is their ability to process and understand human language context very well, making them capable of answering questions, generating text, and even handling more complex language tasks such as translating text into other languages, understanding context-based text. This model has become an important milestone in the development of natural language technology, bringing advanced and reliable solutions to a wide range of applications, from human language processing to the development of more advanced artificial intelligence systems.?
With the ability to process large amounts of human language data, LLM can also be used to analyze language trends, understand changes in word usage, and even detect nuances and hidden meanings in text. The success of LLM in handling these language tasks demonstrates its potential in improving human interaction with technology, opening the door for more innovative and integrated applications in a variety of fields, including education, customer service, and product development. Nonetheless, keep in mind that the use of LLM also brings up several challenges, such as ethical and privacy considerations, which need to be managed wisely in developing and applying this technology.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a statistical approach that is very useful in digging deeper and understanding the structure of a dataset. With EDA, researchers or data analysts can perform analysis without relying on rigid assumptions regarding data distribution or specific statistical models. The main focus of EDA is to uncover interesting patterns, possible trends, and relationships and anomalies that may be hidden in the data.
The main goal of EDA is not only to generate insights about the data, but also to help form initial hypotheses. By understanding the key characteristics of the dataset, the researcher or data analyst can guide the next steps of the analysis. One of the common stages in EDA is Data Visualization, where the creation of graphs and plots is key to clearly illustrate the patterns and trends contained in the dataset.
Various types of graphs such as histograms, scatter plots, box plots, and bar charts are invaluable tools for visualizing variable distributions, relationships between variables, and even detecting outliers. With the help of EDA, researchers or analysts can more easily interpret information and make decisions based on the insights gained.
The importance of EDA is not just limited to the initial stages of data analysis. EDA also contributes to designing follow-up experiments or collecting additional data if needed. This process not only enriches the initial understanding of the data but can also steer the researcher or analyst in the right direction of the research or project at hand.
By conducting EDA, we are not only building an initial picture of the data, but also opening up opportunities to make more informed and efficient decisions in taking the next steps. As a critical step in data analysis, EDA helps create a solid foundation for more in-depth and quality research.
领英推荐
LIDA
LIDA is a library for generating data visualizations and data-faithful infographics. LIDA is grammar agnostic (will work with any programming language and visualization libraries (e.g., matplotlib, seaborn, altair, d3 etc) and works with multiple large language model providers (GPT-3, PaLM, and etc.). LIDA has 4 main functions summarized, goal explorer, vis generator and infographic. All these functions have been installed into python functions and are ready to be used as a library.
Summarizer – LLMs (Large Language Models) have the capability of being zero-shot predictors, capable of solving a variety of tasks with little or no example guidance. Despite this, LLMs can suffer from hallucinations, such as generating text with no basis in the training data or current task. One approach to address this issue is to add basic context to LLM. Therefore, the goal of this summary is to generate an informative and concise summary for a dataset, serving as a base context for the visualization task. The summarization implementation process consists of two stages. The first stage involves the generation of a basic summary by applying data set property extraction rules, including atomic types, using the pandas library for general statistics and randomized examples for each column. The second stage, which is optional, is the compaction of the summary by involving LLM or user intervention through the LIDA interface. This compaction includes a description of the semantics of the dataset and fields, as well as a prediction of the semantic types of the fields, providing contextual information useful to the analyst in understanding the dataset and the tasks that can be performed on it.
Goal Explorer – This module creates data exploration goals, based on the summaries generated by SUMMARIZER. We express goal generation as a multi-task generation problem where LLMs must generate questions (hypotheses), visualizations that respond to the questions, and reasons. We found that requiring LLMs to generate reasons leads to more semantically meaningful goals.
Vis Generator – generates visualization specifications and consists of 3 submodules – code framework builder, code generator, and code executor. The code skeleton builder implements a library of code skeletons corresponding to programming languages and visualization grammars such as Matplotlib, GGPlot, Plotly, Altair, Seaborn, and Bokeh. Each framework is an executable program that imports relevant dependencies and defines empty functions that return visualization specifications. The code generator takes the framework, dataset summary, visualization goals, and build prompt. An LLM is used to generate n candidate visualization code specifications. The code executor processes and executes the code specifications and filters the results using several filtering mechanisms implemented by LIDA to detect errors. The final output is a list of visualization specifications (code) and associated raster images.
Infographic – This module oversees generating distilled graphics based on the output of the VISGENERATOR module. This module implements a library of visual styles described in natural language that are applied directly to the visualization images. It should be noted that the style library can be edited by the user. These styles are used in generating infographics by utilizing the text-conditioned image-to-image generation capabilities of the diffusion model, which is implemented using the Peacasso library API. Optional processing steps are then applied to improve the generated image (e.g., replacing the axes with the correct values from the visualization, removing grid lines, and sharpening edges).
This article is the first part of the whole article, if you are interested in reading more about the Implementation of Automatic Exploratory Data Analysis using Streamlit Web Apps please click the following link [Akan di isi link artikel part 2]
If you have any question about Generative AI or other analytics use cases, please contact datalabs.id for further discussion.
Reference: