Data Analytics and Generative AI: An Epicyclic Approach to Insight
Today, inspired by the shortcomings I have observed in some Data Science and Data Analytics projects, I will talk a little more about these topics and how GenAI can help.
Data analytics is often seen as a straightforward, linear process—a sequence of steps that, when followed, lead to the extraction of insights from data. However, as those deeply embedded in the field understand, this couldn't be further from the truth. The reality is that data analysis is a complex, non-linear process that requires constant iteration, refinement, and reconsideration. The core activities of data analysis—stating the question, exploring the data, building models, interpreting results, and communicating findings—do not follow a simple, one-step-after-the-other path. Instead, they resemble a series of "epicycles," where each step informs and reshapes the others in an ongoing cycle of learning and adjustment.
The notion of epicycles in data analysis, as described by Roger D. Peng and Elizabeth Matsui in "The Art of Data Science", provides a valuable framework for understanding this iterative process. At each stage of the analysis, one must set expectations, collect relevant data, and then compare this data to the initial expectations. If the data does not match what was expected, it becomes necessary to either adjust the expectations or correct the data, repeating this cycle until a satisfactory alignment is achieved. This continuous loop of hypothesis and verification ensures that the analysis evolves in response to the data, leading to more robust and reliable outcomes. And wow, given the exploratory nature of data analytics, I have used "without knowing" this data analytics epicycle. I want to share this with you. Take a look at the following image in which the epicycle of data analytics is shown.
The world of Generative AI (GenAI) intersects with this iterative process in different ways. GenAI models, which include advanced algorithms capable of creating new content— including source code and data analytics pipelines — are built to have the ability to "understand" context and face complex problems following a previously defined step-by-step process. Thus, it is possible for GenAI applications to face a data analytics or data science problem following the epycycle of data analytics and support data scientists and AI practitioners to refine and adjust their solutions through more refined (and profitable) cycles of experimentation and analysis.
Large Language Models (LLMs) have become incredibly valuable tools in the field of data analytics programming and, in some ways, offer ways to tackle an analytical data problem in a similar way to Peng and Matsui's proposal. One of the leading applications is Codex, which powers GitHub Copilot. Codex is designed to assist developers by generating code snippets, entire functions, and even helping to debug scripts. In the context of data analytics, it can suggest code in various languages like Python, SQL, or R, streamlining the process of data manipulation, analysis, and visualization.
领英推è
Another notable application is Google's PaLM (Pathways Language Model). PaLM is capable of understanding complex queries and generating detailed, accurate code to solve data analytics problems. It can help in writing scripts for data preprocessing, building machine learning models, or conducting exploratory data analysis. The strength of PaLM lies in its ability to handle nuanced and sophisticated programming tasks, making it an invaluable asset for data scientists.
And the acclaimed ChatGPT could not be missed. ChatGPT has proven to be a versatile tool for data analytics programming. While not as specialized as Codex, ChatGPT can still provide substantial support by answering technical questions, explaining coding concepts, and offering guidance on best practices in data analysis. It can interactively help users troubleshoot issues, optimize code, and suggest improvements, making it a powerful tool for both beginners and experienced professionals in data analytics.
Here, there are some tools like LIDA, which operate as facilitators or consultants specialized in data analytics, between the programmer and some LLM. For example, LIDA implements the data analytics pipeline automatically, making it easier for users who are not familiar with this pipeline to generate high-quality solutions in a simple way. And because those who are already familiar with the pipeline can speed up their development process. In fact, I have personally used LIDA and GPT to compare the learning process of data analytics skills in students and professionals who do not have (or have few) computational thinking skills and, moreover, who are not familiar with the data analytics pipeline. Interesting results were published by Frontiers in a scientific article that I share with you HERE.
Moreover, the iterative process highlighted in the epicycle of analysis is essential for dealing with the complexities and uncertainties inherent in AI. As models generate new data, code, or other types of outputs, these must be continuously scrutinized against the expectations set by the developers. Any deviation from expected results could indicate a need to revisit the assumptions made at the start, leading to further iterations of model refinement. I also invite you to learn more about the data science epicycle pipeline by accessing the "The Art of Data Science" book here.
In conclusion, the integration of data analytics with GenAI illustrates the critical importance of iterative processes in developing advanced AI systems as well as computational thinking skills. By embracing the epicyclic nature of data analysis, practitioners can ensure that their models are not only powerful but also adaptable and aligned with the evolving requirements of real-world applications. Whether it’s refining the questions we ask, exploring new data, or interpreting complex outputs, the iterative approach ensures that we remain agile and responsive in the face of the ever-changing landscape of AI and data science.
And what do you think, can the data analytics epicycle be fully automated with Generative Artificial Intelligence?
Innovation Expert Amplifying Business Growth
7 个月Jorge, obrigado por compartilhar!
Estudiante en Tecnológico de Monterrey en ingenierÃa industrial y de sistemas de octavo semestre
7 个月Del texto entiendo que el análisis de datos no es un proceso sencillo y lineal, y que incluso las personas más avanzadas pueden llegar a usar IA. Además, comprendo que existen etapas en las que se deben establecer expectativas y recopilar datos. Aquà es donde las IA entran en juego, ya que los modelos generativos de IA son capaces de crear nuevos contenidos, como códigos y nuevos canales para analizar distintos datos. Estos modelos están dise?ados para comprender el contexto y proporcionar un paso a paso para encontrar una solución. El LLM es una gran herramienta para el análisis de datos, ya que puede ayudarnos a dise?ar códigos en Python, SQL y R. Además, nos proporciona un mejor conocimiento de ChatGPT, que puede ser utilizado para darnos soporte en este análisis de datos. Lo que más me gustó de este post es que se puede usar ChatGPT para hacer comparaciones de análisis de datos, y concuerdo en que la implementación de IA puede lograr un avance significativo para los programadores en el mundo real.
Minor in Data Analysis & AI Tools | B.A in Business Intelligence
7 个月The article explains how GenAI can help make the repeated process of data analysis faster and more flexible. While GenAI can automate some parts, people are still needed to interpret the results, make decisions, and adjust the process. It may not be possible to fully automate the data analysis process because human insight and flexibility are important for understanding complex data and making sure the results fit real-world needs.
Estudiante en Tecnológico de Monterrey
7 个月The concept of the "epicycle" in data analysis really shifted my perspective. I used to think of data analysis as more straightforward, but Peng and Matsui's idea of continuous feedback and adjustment is spot on. The integration of Generative AI into this process, especially with tools is fascinating. While full automation is still a challenge, I agree that a blend of technology and human skill is key. It makes me wonder how much this could reshape our approach to data analysis in the future.
Software Engineer @ Zillow | Ex-Cisco, Ex-Tripadvisor | Good at building what you didn't know you needed
7 个月Great to see all these students on your comment section! It's interesting to see you're discussing this during class!