Redash chatbot add-on: LLM-based chatbot for Data Analytics, Visualisation, and Automated Insight Extraction

Introduction

In a bold and visionary quest to revolutionize data analysis capabilities, our company is strategically immersing itself in the expansive landscape of YouTube data exploration. At the heart of this transformative initiative lies the ambitious goal of pioneering the development of an avant-garde Redash chat add-on. This innovative tool empowers our organization, facilitating the extraction of valuable insights effortlessly from a myriad of interconnected Redash dashboards and databases through the natural language medium.

The core of this groundbreaking chat add-on creates a dynamic and interactive space, fostering conversations in a fluid question-and-answer format. It unlocks the potential for autonomous knowledge discovery, providing users with a seamless experience. Queries span a spectrum, ranging from inquiries about information displayed on dashboards to those necessitating the generation of SQL queries using sophisticated Large Language Models (LLMs). These queries are then executed against our interconnected databases. The comprehensive end-to-end system we are embarking on building is poised to empower our company, enabling the extraction of profound, meaningful, and actionable insights from our Business Intelligence (BI) platforms.

Our company's BI dashboards serve a dual purpose; they act as powerful monitors of our business processes and transformative tools that convert data collected from YouTube into actionable insights. These insights are pivotal in steering strategic decisions and providing a competitive edge in understanding digital content consumption trends. As we embark on this data-driven journey, we envision a future where our organization effortlessly navigates the intricate landscape of data analytics, unlocking new dimensions of understanding and strategic advantage.

Tech-stack used

Data source

Initially, they give data gathering raw data from YouTube, which is meticulously curated and stored in CSV format. Before its integration into the database, a crucial phase unfolds in the data analytics journey. This involves the thorough cleansing and refining of the data within the Python script. This meticulous data-cleaning process ensures that the dataset is pristine and well-prepared, setting the stage for a robust and reliable foundation for subsequent analytical endeavors.

import pandas as pd

class DataFrameManipulator:
    def __init__(self, df):
        self.df = df

    def rename_columns(self, new_column_names):
        """
        Rename columns in the Pandas DataFrame.

        Parameters:
        - new_column_names: Dictionary of old-to-new column names

        Returns:
        - Modified DataFrame
        """
        return self.df.rename(columns=new_column_names)

    def drop_rows(self, rows_to_drop, by_index=True):
        """
        Drop rows from the Pandas DataFrame.

        Parameters:
        - rows_to_drop: List of index labels or row numbers to drop
        - by_index: If True, drop rows by index labels; if False, drop rows by row numbers

        Returns:
        - Modified DataFrame
        """
        if by_index:
            return self.df.drop(rows_to_drop)
        else:
            return self.df.drop(self.df.index[rows_to_drop])

    def drop_columns(self, columns_to_drop):
        """
        Drop columns from the Pandas DataFrame.

        Parameters:
        - columns_to_drop: List of column names to drop

        Returns:
        - Modified DataFrame
        """
        return self.df.drop(columns=columns_to_drop)        

Designing tables in a schema with a Python script is a precise process involving configuring a structured database. The script efficiently constructs the schema, giving life to well-crafted tables tailored to specific data model requirements.

from sqlalchemy import create_engine, Column, Integer, String, Float, Date
from sqlalchemy.orm import declarative_base
from sqlalchemy.orm import sessionmaker
from postgres_conn import ConnectToPostgres
from datetime import datetime

Base = declarative_base()

class Views(Base):
    
    __tablename__ = 'view_chart'
    date = Column('date', datetime)
    city_id = Column('city_id',String,primary_id=True)
    city_name = Column('city_name', String)
    views = Column('views',Integer)

    def __init__(self,date,city_id,city_name,views):
      self.date = date
      self.city_id = city_id
      self.city_name = city_name
      self.views = views
class Device_type(Base):
    __tablename__ = "device_type"

    date = Column('date', datetime)
    device_type = Column("device_type", String)
    Views = Column('Views',Integer)

    def __init__(self, date, device_type, Views):
        self.date= date
        self.device_type = device_type
        self.Views = Views        

Generating SQL Query with OpenAI

We employ OpenAI's text-davinci-003 model to generate SQL queries in response to user questions. By providing the question as a prompt to the model, we extract the resulting SQL query. The model's operation is rooted in the Transformer architecture, a deep learning model utilizing attention mechanisms to capture relationships between different words or tokens in a given text. Leveraging self-attention, the model focuses on various parts of the input text throughout the generation process, enabling a nuanced understanding of context and the generation of accurate, contextually relevant responses.

For SQL query responses based on user questions, we utilize OpenAI's text completion API, specifying parameters in the openai.Completion. create() function.

  1. engine: The engine parameter specifies the language model to use for text completion. In this case, "text-davinci-003" is the engine being used. It is a specific version of the GPT-3 model.
  2. prompt`: The prompt parameter is the starting text or context given to the language model. It describes the desired SQL query based on the user's question. In this code, the prompt is dynamically generated using the user's input.
  3. max_tokens: The max_tokens parameter determines the maximum number of tokens in the generated completion. It limits the length of the generated SQL query. In this code, it is set to 50 tokens.
  4. n: The n parameter specifies the number of completions to generate. Here, it is set to 1, so only one completion will be generated.
  5. stop: The stop parameter can be used to specify a stopping condition for the completion. If a specific text pattern is provided, the completion will stop when that pattern is encountered. In this code, None is used, meaning the completion will continue until the max_tokens limit is reached.
  6. temperature: The temperature of the parameter controls the randomness of the generated completion. A higher value, such as 1.0, will result in more diverse and creative completions, while a lower value, such as 0.5, will make the completions more focused and deterministic. Here, it is set to 0.5.
  7. top_p: The top_p parameter sets a threshold for the cumulative probability distribution of the generated completion. Tokens with cumulative probabilities up to the top_p value are considered. A value of 1.0 means all tokens are considered.
  8. requency_penalty`: The frequency_penalty parameter adjusts the penalty for frequently occurring tokens in the completion. A higher value increases the penalty and reduces the likelihood of repetitive completions. Here, it is set to 0.0.
  9. presence_penalty`: The presence_penalty parameter adjusts the penalty for tokens that are already present in the prompt. A higher value discourages the repetition of tokens from the prompt in the generated completion. Here, it is set to 0.0.

prompt = "Generate a SQL query to retrieve data from the youtube_data table."
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=50,
        n=1,
        stop=None,
        temperature=0.5,
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0
    )
    sql_query = response.choices[0].text.strip()        

Upon acquiring the SQL query, we execute it with the cursor object, specifically crafted to fetch data from the device_type_chart table in response to the user's inquiry. Subsequently, we collect the query's result using the fetchall() method.

In the concluding phase, we present the user's question alongside the generated SQL query, the obtained SQL result, and the response generated by OpenAI. This comprehensive display provides insight into the interaction among the user's question, the generated query, and the corresponding result.

Question: the ttal number of device
SQL Query: SELECT COUNT(*) FROM device_type_chart;
SQL Result: [(5116,)]
Answer: SELECT COUNT(*) FROM device_type        

  1. Redash

Redas installation

Redash boasts a robust API, offering users a programmatic means to interact with their data. Following RESTful principles, the API facilitates operations like query execution, dashboard management, and integration with external tools. To streamline data exploration, Python has been employed to automate the conversion of SQL queries into Redash queries. Below is a simple example demonstrating how you can effortlessly retrieve the list of devices from the "device_chart_table" using the query "SELECT * from device_chart_table."

  1. Query CreationWe've streamlined the creation of Redash queries using a straightforward Python script. This script leverages the Redash API to seamlessly send the SQL query text and data source ID in a single step. Witness the simplicity in action

import requests

api_url = "https://your-redash-url/api/queries"
headers = {"Authorization": "Key your_api_key"}

query_data = {
    "query": "SELECT * from device_chart_table",
    "data_source_id": your_data_source_id,
}

response = requests.post(api_url, headers=headers, json=query_data)
query_id = response.json()["id"]


        

2. Querying

This Python script takes care of the complex tasks, generating a new Redash query and assigning a unique ID for future use. It's a quick and efficient process that makes our interaction with Redash smoother.

3. Query on Redash

Upon visiting your Redash dashboard, you'll discover our freshly generated query all set for execution. Redash provides a comprehensive display, showcasing the query text, its current status, and the pivotal result set, including details like the list of devices sourced from the device_type table

4. Creating the visual

With a Python script, we seamlessly transform query results into engaging visualizations. The script leverages the POST endpoint of the Redash API, specifically targeting the /api/visualizations path.

api_url = f"https://your-redash-url/api/queries/{query_id}/visualizations"
headers = {"Authorization": "Key your_api_key"}

visualization_data = {
    "name": "Your Visualization",
    "type": "TABLE", #"CHART",   
    # other necessary details...
}

response = requests.post(api_url, headers=headers, json=visualization_data)
visualization_id = response.json()["id"]        
Pie chart

Challenge

As we embark on the journey of automating our data exploration endeavors, an intriguing challenge presents itself: the creation of automatic visualizations through the Redash API. This endeavor requires the adept utilization of Python to seamlessly generate visual representations derived from our queries. The intricacy of this task resides in the nuanced interpretation of pivotal elements such as the x-axis, y-axis, and various other parameters, all extracted from our English instructions.

Future Work

The project was completed promptly, and I'm eager to pursue further enhancements in the future. Plans involve refining automatic visualizations on Redash. Your contributions from the GitHub community are greatly appreciated. Feel free to explore the repository through the provided link. Github

References


#DataScience #MachineLearning #Python #SQL #DeepLearning #TechSkills #CareerDevelopment#Data ngineering#Data analyst.


Mahlet Mitiku Desalegn

M.D|Internist|Data analyst

1 年

Thank you again Aaron Gebremariam for your brilliant breakdown of the subject matter!

回复

要查看或添加评论,请登录

Aaron Gebremariam的更多文章

社区洞察

其他会员也浏览了