登录查看更多内容

Redash chatbot add-on: LLM-based chatbot for Data Analytics, Visualisation, and Automated Insight Extraction

Aaron Gebremariam

Data Scientist | Generative AI Engineer | Machine Learning Engineer | Python Developer

发布日期: 2024年1月6日

Introduction

In a bold and visionary quest to revolutionize data analysis capabilities, our company is strategically immersing itself in the expansive landscape of YouTube data exploration. At the heart of this transformative initiative lies the ambitious goal of pioneering the development of an avant-garde Redash chat add-on. This innovative tool empowers our organization, facilitating the extraction of valuable insights effortlessly from a myriad of interconnected Redash dashboards and databases through the natural language medium.

The core of this groundbreaking chat add-on creates a dynamic and interactive space, fostering conversations in a fluid question-and-answer format. It unlocks the potential for autonomous knowledge discovery, providing users with a seamless experience. Queries span a spectrum, ranging from inquiries about information displayed on dashboards to those necessitating the generation of SQL queries using sophisticated Large Language Models (LLMs). These queries are then executed against our interconnected databases. The comprehensive end-to-end system we are embarking on building is poised to empower our company, enabling the extraction of profound, meaningful, and actionable insights from our Business Intelligence (BI) platforms.

Our company's BI dashboards serve a dual purpose; they act as powerful monitors of our business processes and transformative tools that convert data collected from YouTube into actionable insights. These insights are pivotal in steering strategic decisions and providing a competitive edge in understanding digital content consumption trends. As we embark on this data-driven journey, we envision a future where our organization effortlessly navigates the intricate landscape of data analytics, unlocking new dimensions of understanding and strategic advantage.

Tech-stack used

Data source

Initially, they give data gathering raw data from YouTube, which is meticulously curated and stored in CSV format. Before its integration into the database, a crucial phase unfolds in the data analytics journey. This involves the thorough cleansing and refining of the data within the Python script. This meticulous data-cleaning process ensures that the dataset is pristine and well-prepared, setting the stage for a robust and reliable foundation for subsequent analytical endeavors.

import pandas as pd

class DataFrameManipulator:
    def __init__(self, df):
        self.df = df

    def rename_columns(self, new_column_names):
        """
        Rename columns in the Pandas DataFrame.

        Parameters:
        - new_column_names: Dictionary of old-to-new column names

        Returns:
        - Modified DataFrame
        """
        return self.df.rename(columns=new_column_names)

    def drop_rows(self, rows_to_drop, by_index=True):
        """
        Drop rows from the Pandas DataFrame.

        Parameters:
        - rows_to_drop: List of index labels or row numbers to drop
        - by_index: If True, drop rows by index labels; if False, drop rows by row numbers

        Returns:
        - Modified DataFrame
        """
        if by_index:
            return self.df.drop(rows_to_drop)
        else:
            return self.df.drop(self.df.index[rows_to_drop])

    def drop_columns(self, columns_to_drop):
        """
        Drop columns from the Pandas DataFrame.

        Parameters:
        - columns_to_drop: List of column names to drop

        Returns:
        - Modified DataFrame
        """
        return self.df.drop(columns=columns_to_drop)

Designing tables in a schema with a Python script is a precise process involving configuring a structured database. The script efficiently constructs the schema, giving life to well-crafted tables tailored to specific data model requirements.

from sqlalchemy import create_engine, Column, Integer, String, Float, Date
from sqlalchemy.orm import declarative_base
from sqlalchemy.orm import sessionmaker
from postgres_conn import ConnectToPostgres
from datetime import datetime

Base = declarative_base()

class Views(Base):
    
    __tablename__ = 'view_chart'
    date = Column('date', datetime)
    city_id = Column('city_id',String,primary_id=True)
    city_name = Column('city_name', String)
    views = Column('views',Integer)

    def __init__(self,date,city_id,city_name,views):
      self.date = date
      self.city_id = city_id
      self.city_name = city_name
      self.views = views
class Device_type(Base):
    __tablename__ = "device_type"

    date = Column('date', datetime)
    device_type = Column("device_type", String)
    Views = Column('Views',Integer)

    def __init__(self, date, device_type, Views):
        self.date= date
        self.device_type = device_type
        self.Views = Views

Generating SQL Query with OpenAI

We employ OpenAI's text-davinci-003 model to generate SQL queries in response to user questions. By providing the question as a prompt to the model, we extract the resulting SQL query. The model's operation is rooted in the Transformer architecture, a deep learning model utilizing attention mechanisms to capture relationships between different words or tokens in a given text. Leveraging self-attention, the model focuses on various parts of the input text throughout the generation process, enabling a nuanced understanding of context and the generation of accurate, contextually relevant responses.

For SQL query responses based on user questions, we utilize OpenAI's text completion API, specifying parameters in the openai.Completion. create() function.

engine: The engine parameter specifies the language model to use for text completion. In this case, "text-davinci-003" is the engine being used. It is a specific version of the GPT-3 model.
prompt`: The prompt parameter is the starting text or context given to the language model. It describes the desired SQL query based on the user's question. In this code, the prompt is dynamically generated using the user's input.
max_tokens: The max_tokens parameter determines the maximum number of tokens in the generated completion. It limits the length of the generated SQL query. In this code, it is set to 50 tokens.
n: The n parameter specifies the number of completions to generate. Here, it is set to 1, so only one completion will be generated.
stop: The stop parameter can be used to specify a stopping condition for the completion. If a specific text pattern is provided, the completion will stop when that pattern is encountered. In this code, None is used, meaning the completion will continue until the max_tokens limit is reached.
temperature: The temperature of the parameter controls the randomness of the generated completion. A higher value, such as 1.0, will result in more diverse and creative completions, while a lower value, such as 0.5, will make the completions more focused and deterministic. Here, it is set to 0.5.
top_p: The top_p parameter sets a threshold for the cumulative probability distribution of the generated completion. Tokens with cumulative probabilities up to the top_p value are considered. A value of 1.0 means all tokens are considered.
requency_penalty`: The frequency_penalty parameter adjusts the penalty for frequently occurring tokens in the completion. A higher value increases the penalty and reduces the likelihood of repetitive completions. Here, it is set to 0.0.
presence_penalty`: The presence_penalty parameter adjusts the penalty for tokens that are already present in the prompt. A higher value discourages the repetition of tokens from the prompt in the generated completion. Here, it is set to 0.0.

prompt = "Generate a SQL query to retrieve data from the youtube_data table."
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=50,
        n=1,
        stop=None,
        temperature=0.5,
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0
    )
    sql_query = response.choices[0].text.strip()

Upon acquiring the SQL query, we execute it with the cursor object, specifically crafted to fetch data from the device_type_chart table in response to the user's inquiry. Subsequently, we collect the query's result using the fetchall() method.

In the concluding phase, we present the user's question alongside the generated SQL query, the obtained SQL result, and the response generated by OpenAI. This comprehensive display provides insight into the interaction among the user's question, the generated query, and the corresponding result.

Question: the ttal number of device
SQL Query: SELECT COUNT(*) FROM device_type_chart;
SQL Result: [(5116,)]
Answer: SELECT COUNT(*) FROM device_type

Redash

Redas installation

领英推荐

Transforming Business Intelligence: The AI and ML…

Rajoo Jha 1 年前

IBM Generative AI for Data Analysts Specialization

Richard Wadsworth 4 个月前

5 Best AI Tools for Data Analysts

Blockchain Council 11 个月前

Redash boasts a robust API, offering users a programmatic means to interact with their data. Following RESTful principles, the API facilitates operations like query execution, dashboard management, and integration with external tools. To streamline data exploration, Python has been employed to automate the conversion of SQL queries into Redash queries. Below is a simple example demonstrating how you can effortlessly retrieve the list of devices from the "device_chart_table" using the query "SELECT * from device_chart_table."

Query CreationWe've streamlined the creation of Redash queries using a straightforward Python script. This script leverages the Redash API to seamlessly send the SQL query text and data source ID in a single step. Witness the simplicity in action

import requests

api_url = "https://your-redash-url/api/queries"
headers = {"Authorization": "Key your_api_key"}

query_data = {
    "query": "SELECT * from device_chart_table",
    "data_source_id": your_data_source_id,
}

response = requests.post(api_url, headers=headers, json=query_data)
query_id = response.json()["id"]

2. Querying

This Python script takes care of the complex tasks, generating a new Redash query and assigning a unique ID for future use. It's a quick and efficient process that makes our interaction with Redash smoother.

3. Query on Redash

Upon visiting your Redash dashboard, you'll discover our freshly generated query all set for execution. Redash provides a comprehensive display, showcasing the query text, its current status, and the pivotal result set, including details like the list of devices sourced from the device_type table

4. Creating the visual

With a Python script, we seamlessly transform query results into engaging visualizations. The script leverages the POST endpoint of the Redash API, specifically targeting the /api/visualizations path.

api_url = f"https://your-redash-url/api/queries/{query_id}/visualizations"
headers = {"Authorization": "Key your_api_key"}

visualization_data = {
    "name": "Your Visualization",
    "type": "TABLE", #"CHART",   
    # other necessary details...
}

response = requests.post(api_url, headers=headers, json=visualization_data)
visualization_id = response.json()["id"]

Challenge

As we embark on the journey of automating our data exploration endeavors, an intriguing challenge presents itself: the creation of automatic visualizations through the Redash API. This endeavor requires the adept utilization of Python to seamlessly generate visual representations derived from our queries. The intricacy of this task resides in the nuanced interpretation of pivotal elements such as the x-axis, y-axis, and various other parameters, all extracted from our English instructions.

Future Work

The project was completed promptly, and I'm eager to pursue further enhancements in the future. Plans involve refining automatic visualizations on Redash. Your contributions from the GitHub community are greatly appreciated. Feel free to explore the repository through the provided link. Github

References

#DataScience #MachineLearning #Python #SQL #DeepLearning #TechSkills #CareerDevelopment#Data ngineering#Data analyst.

Mahlet Mitiku Desalegn

M.D|Internist|Data analyst

1 年

Thank you again Aaron Gebremariam for your brilliant breakdown of the subject matter!

要查看或添加评论，请登录

Aaron Gebremariam的更多文章

Build Contract Advisor RAG-System

2024年2月24日

Build Contract Advisor RAG-System

Introduction: Our endeavor aims to pioneer the evolution of contract law through the utilization of Hybrid Large…
Semantic Image and Text Alignment: Automated Storyboard Synthesis for Digital Advertising.

2024年2月17日

Semantic Image and Text Alignment: Automated Storyboard Synthesis for Digital Advertising.

Introduction Welcome to my blog! Today, we embark on an exhilarating journey into the realm of Semantic Image and Text…
Location-Based Refund Smart Contract with Web3.0

2024年2月10日

Location-Based Refund Smart Contract with Web3.0

Introduction Decentralized Location Tracking Application (dApp GPS Tracker) operates on the blockchain network…
Prompt Engineering: Retrieval Augmented Generation(RAG).

2024年1月20日

Prompt Engineering: Retrieval Augmented Generation(RAG).

Introduction Welcome to our blog, where we delve into the forefront of cutting-edge AI-driven solutions brought to you…
End-to-End Web3 dApps

2024年1月13日

End-to-End Web3 dApps

Introduction This article delves into web3 technology, but before we explore its intricacies, let's first examine the…
Data warehouse tech stack with PostgreSQL, DBT, and Airflow.

2023年12月23日

Data warehouse tech stack with PostgreSQL, DBT, and Airflow.

Introduction Real-world projects often use different tools for purposes, and integrating these tools seamlessly can be…
Exploring the Power of Linux Commands in Data Science.

2023年11月22日

Exploring the Power of Linux Commands in Data Science.

Hey LinkedIn community! ?? Today, let's dive into the incredible world of Linux commands and how they empower us in the…
The Four Pillars of Effective Communication Design for Data Science.

2023年11月17日

The Four Pillars of Effective Communication Design for Data Science.

In the realm of data science, the ability to communicate findings clearly and compellingly is as crucial as the…
?? Elevate Your Data Science Interview Game with These In-Depth Cheat Sheets! ????

2023年11月13日

?? Elevate Your Data Science Interview Game with These In-Depth Cheat Sheets! ????

Embarking on a Data Science interview journey? ?? Here's a curated list of comprehensive cheat sheets to help you…
?? Staying Ahead of the Curve: Keeping Current in Software Development and Data Science

2023年11月8日

?? Staying Ahead of the Curve: Keeping Current in Software Development and Data Science

In the ever-changing world of technology, staying relevant in software development and data science is not just a goal;…

See all articles

Redash chatbot add-on: LLM-based chatbot for Data Analytics, Visualisation, and Automated Insight Extraction

Aaron Gebremariam

Data Scientist | Generative AI Engineer | Machine Learning Engineer | Python Developer

Tech-stack used

Data source

Generating SQL Query with OpenAI

领英推荐

Challenge

Future Work

References

Aaron Gebremariam的更多文章

社区洞察

其他会员也浏览了

Synthetic Data Generator

Preparing data for AI: A guide for data engineers

Using Databases and Data Warehouses as Vector Databases for AI Agents

Top 5 Crystal Alternative Modern BI for Transforming Data Analytics in 2024

Deconstructing Unstructured Data: Strategies for Analysis and Insights

10 Emerging Trends in Data Analytics That Will Shape Your Career in 2025

Generative AI: Transforming the Job Market and Revolutionizing Data and Analytics

The In-Sourced Analytics Revolution: Leveraging Popular Tech for Sales & Marketing Transformation

Role of AI in BI – PoV

Retrieval Augmented Generation (RAG) for Structured Data Processing

Tech-stack used

Data source

Generating SQL Query with OpenAI

领英推荐

Challenge

Future Work

References

Aaron Gebremariam的更多文章

Build Contract Advisor RAG-System

Semantic Image and Text Alignment: Automated Storyboard Synthesis for Digital Advertising.

Location-Based Refund Smart Contract with Web3.0

Prompt Engineering: Retrieval Augmented Generation(RAG).

End-to-End Web3 dApps

Data warehouse tech stack with PostgreSQL, DBT, and Airflow.

Exploring the Power of Linux Commands in Data Science.

The Four Pillars of Effective Communication Design for Data Science.

?? Elevate Your Data Science Interview Game with These In-Depth Cheat Sheets! ????

?? Staying Ahead of the Curve: Keeping Current in Software Development and Data Science

社区洞察

其他会员也浏览了

Synthetic Data Generator

Preparing data for AI: A guide for data engineers

Using Databases and Data Warehouses as Vector Databases for AI Agents

Top 5 Crystal Alternative Modern BI for Transforming Data Analytics in 2024

Deconstructing Unstructured Data: Strategies for Analysis and Insights

10 Emerging Trends in Data Analytics That Will Shape Your Career in 2025

Generative AI: Transforming the Job Market and Revolutionizing Data and Analytics

The In-Sourced Analytics Revolution: Leveraging Popular Tech for Sales & Marketing Transformation

Role of AI in BI – PoV

Retrieval Augmented Generation (RAG) for Structured Data Processing