SQL with LLMs: Chat with Your Data with Francesco Puppini
Howard Diesel
Chief Data Officer @ Modelware Systems | CDMP Master | Data Management Advisor
Executive Summary?
This webinar explores the integration of Artificial Intelligence (AI) and Large Language Models (LLMs) in business intelligence and writing SQL queries. It highlights the role of SQL queries in data analytics, data warehousing, and database interactions. Francesco Puppini ppini addresses the challenges faced by text-to-SQL models, the significance of context in question answering, and the importance of understanding constraints, joins, and semantic layers in database management.??
The webinar discusses conceptual modelling, fan traps, and the potential of AI in enhancing vendor sales and task-solving. The role of data engineering and user experience in data analysis is emphasised, along with challenges and strategies in AI development, while also outlining the benefits of platforms like Snowflake and Cortex in data management and analysis.
Artificial Intelligence in Business Intelligence?
Howard Diesel opens the webinar and highlights the impact of Franceso Puppini’s book on unified star schema, co-authored by Bill Inmon. The book has transformed modelling approaches in regions like South Africa, particularly at Medi Clinic. Attendees are encouraged to engage with Francesco, the originator of the idea, through questions during the session, which has a relaxed and friendly atmosphere.??
Francesco acknowledges that many participants may have more expertise in artificial intelligence than he does. He goes on to share his insights, which are rooted in traditional business intelligence and SQL. Additionally, Francesco notes that the collaborative effort emphasises the importance of merging knowledge from various backgrounds to tackle the challenges in larger models and AI.?
Integration of Large Language Models and AI in SQL Writing?
The focus of the webinar is on the potential of large language models (LLMs) to assist in writing SQL and code in general. Francesco acknowledges the hype surrounding LLMs, with some companies speculating about job reductions due to AI capabilities. However, Francesco states that he has a balanced perspective of both optimism and scepticism, asserting that while LLMs can be effective tools, collaboration and effort are essential to realising AI’s full potential in the workplace.?
The Role of Data Analytics in Business Intelligence and Language Models?
Francesco shares that he has become known for his frequent presentations. He adds that he specialises in data analytics with a focus on the "last mile," which refers to the crucial steps taken after data is stored in a warehouse.
Unlike data modellers who concentrate on the initial data storage processes, Francesco emphasises that accessing and utilising data effectively requires additional effort beyond the warehouse, often involving further coding and tools like Alteryx designed for business users. Recognising the gap in data readiness, Francesco advocates for applying the same enabling strategies used in traditional business intelligence to large language models, highlighting the growing necessity for this approach.?
Figure 2 "About Me," Francesco Puppini?
The Potential of Data Warehousing and Database Interactions?
The focus moves on to the potential for interacting with databases through language models (LMs) and large language models (LLMs), particularly in the context of data warehousing. The attendees share on their experiences in this regard. Francesco notes that these experiences highlight the evolution from tools like Power BI's Q&A feature to the current ability to leverage LLMs for generating visualisations and developing data models or ontologies.??
While some organisations have yet to fully integrate LLMs with their data, they express optimism about the possibilities, emphasising the importance of data privacy and the careful selection of LLMs to avoid exposing sensitive information. Francesco expresses a blend of curiosity and caution regarding the future of data interaction using advanced language technologies.?
Use of LLMS in Business Analysis?
To enhance the interaction with LLMs for business analysis, there's a need for a more natural flow in queries. For instance, instead of asking for specific SQL commands, users should be able to pose high-level questions, such as analysing sales trends over the past year. Ideally, an LLM agent would autonomously identify factors contributing to revenue decline and propose actionable insights based on the data. While current LLMs can assist by suggesting relevant analyses and generating SQL queries, users often have to manage significant intermediate steps themselves. Streamlining this process would allow for a more intuitive and efficient experience in data analysis.?
Challenges of Text-to-SQL Models?
Ndivhuwo, an attendee and a research scientist at IBM Research in South Africa, is working on developing test-to-SQL models using both large models like ChatGPT and Gemini, as well as small specialised models from universities and IBM. Ndivhuwo shares that while academic benchmarks show high scores for these models, real-world data sets reveal their limitations, particularly with complex SQL queries.??
Francesco shares that by implementing interactive refining of questions, the generated SQL can improve accuracy. However, he expresses scepticism about their reliability, citing issues like 'hallucinations' and incorrect table or column names. Despite the potential in other domains, like generative AI for images and videos, Ndivhuwo finds that current SQL generation capabilities are not yet dependable.?
Case Study on Database Efficiency and Time Saving in Uber's Environment?
A case study published by Uber is shared. Francesco notes that this case study highlights the time savings achieved through their operational strategies. While recognizing the advancements in large language models (LLMs) that enable summarizing of text from books, Francesco counters this with their downside of occasionally being unable to execute simple queries that require joining two tables.??
?
The Importance of Context in Question Answering?
Context is crucial for effectively engaging with a language model (LLM), similar to the process of onboarding a new employee in a company. A prompt engineer must frame questions in a way that provides substantial context, thereby significantly increasing the chances of success in obtaining meaningful responses. For instance, instead of asking a generic question like, "What was our yearly revenue in the past 10 years?" which may leave the LLM confused, one should supply additional relevant information and metadata. This not only helps the LLM understand the specifics of the inquiry, but also mirrors the learning curve of a new employee who needs to become familiar with company culture, processes, and data. Ultimately, supplying the right context—such as company jargon, relevant metrics, and client specifics—can dramatically improve interaction outcomes.?
Data Management and Analytics in Contexts?
Ensuring a language model (LLM) effectively assists in a workplace setting, key considerations must be in place These include understanding the relevant language, country, business context, and industry terminology. Additionally, these can differ significantly across cultures and companies. The challenge arises from the diverse interpretations of the same words and the complexity of cleaning historical data, which may have been input inconsistently over time. To enhance the LLM's capacity to retrieve and make sense of data, a focus on collecting and enriching metadata regarding the database structure is essential. This will ultimately improve the accuracy and relevance of the model's outputs.?
To accurately interpret data captured from various sources, it is imperative to understand the context of data within business processes. This is why, an attendee notes, there is a need for robust metadata that describes the origins and processes influencing the data. An example of this is distinguishing between "churn" metrics derived from sales versus marketing.??
Challenges lie in the evolving nature of data storage across different systems, which leads to potential misinterpretations when data is consolidated. It's essential to ensure clarity about data sources and lineage, requiring well-structured data warehouses that reflect the nuances of business processes. This is because companies are aiming to facilitate self-service analytics. Continued improvement in managing this complexity is necessary to enable meaningful insights and avoid confusion in analytics.?
?
Understanding Constraints and Joins in Database Management?
Francesco moves on to focus on the importance of Data Definition Language (DDL) in creating and maintaining a data warehouse. He shares that the schema evolves over time rather than being established by a single script and, highlights the necessity of distinguishing between metadata, which can be shared, and the actual data. The actual data should remain secure to mitigate risks of data leakage, particularly in light of external threats such as those from social media and international data policies.??
The use of constraints in databases is explored, and Francesco notes that while they may define relationships between tables, they are often not enforced in modern data lakes, where schemas are more fluid. He clarifies that constraints are not equivalent to joins, as there can be more constraints than actual joins in a dataset.?
When working with data models, it’s important to understand that there can be more constraints than available joins, which can lead to decisions that may not align with your intended analysis. For instance, if you have order headers from SAP and customer data, you might want to join these tables using different keys—such as sold to, bill to, or shipped to customer IDs—depending on the department's needs. The sales department might favour the sold to join, while finance might prefer bill to, and logistics may opt for shipping to.??
It’s crucial to control how tables are joined, as different join choices can yield significantly different results. While LLMs can enable dialogue when multiple options exist, there’s a risk they may default to the most probable option rather than consulting the user, which underscores the need for careful oversight in join decisions, particularly when dealing with inactive relationships in the data model.?
?
?
?
?
Understanding the Use of alias and multi-paths in Business Intelligence?
In business intelligence, it's common to join the same physical table, such as a "Countries" table, multiple times for different purposes, like linking to clients or suppliers. To handle this, different aliases are assigned to the same table, allowing for multiple joins within a query. For instance, one alias might be "countries_CL" for clients and another "countries_Supplier" for suppliers. However, while traditional business intelligence tools facilitate this, SQL restricts using the same table with the same alias concurrently, posing a challenge. Therefore, there’s a need for language models (LLMs) to mimic this alias creation process as BI tools do, following guidance from data experts.?
?
?
Challenges of Semantic Layers in Database Management?
The definition of measures is crucial in working with semantic layers, particularly when it comes to calculating specific metrics like sales amount in a database such as Snowflake's TPCH. Since the database lacks a direct "Sales Amount" column, users must combine quantity, price, and discount into a defined expression, which can be complex due to varying interpretations of revenue. For instance, gross revenue could be calculated as quantity multiplied by price minus discount, while net revenue might require additional considerations like tax at the row level. It's essential to establish clear definitions within the semantic layer, as discrepancies about which revenue to report can lead to confusion when querying data.?
?
?
Understanding Conceptual Modeling and Context in Data Warehousing?
The definition of an employee can vary significantly depending on the context, often requiring conceptual modelling during data warehouse creation and reporting processes. Important criteria, such as a valid status or employment type, define an employee, and these definitions must be documented in a data dictionary to ensure clarity. Additionally, metadata about table and column names is crucial for systems like LLMS to interpret business data accurately, especially when dealing with less common databases or customised fields. Understanding the specifics of a business, including unique offerings—such as a company that writes messages on potatoes—becomes essential for effective data management, as it provides the necessary context for interpreting information correctly.?
In the case of effectively querying sales data, it's important to consider both sold and unsold products. A typical query would use an inner join to display sold items, but a right join may be necessary to include those with zero sales. Business users may not always communicate specific conditions, as seen in a scenario where a retail employee wanted sales broken down by resale while neglecting wholesales. This highlights the need for clarity and context in data requests. Miscommunication can lead to reliance on potentially inaccurate figures generated by LLMs (Language Learning Models), emphasising the importance of thorough understanding and communication in business intelligence to ensure data reliability. Furthermore, it's essential to assess whether LLMs can construct complex SQL queries, such as using a "between" statement.?
?
?
?
领英推荐
?
?
?
?
Fan Traps in Database Management?
The concept of a "fan trap" describes a scenario where a numeric value that should only appear once becomes duplicated due to improper joining of tables in a database query. For instance, if an invoice of $100 is linked to a client with multiple email addresses, the same invoice amount can erroneously appear multiple times in the results. This issue arises in SQL queries, particularly when there are one-to-many relationships, such as between sales and shipments. While the syntax of a query may be correct, it can yield misleading sums due to these duplications, leading to inflated total amounts. This phenomenon highlights the risk of relying on language models that mimic human patterns, as they may propagate these common errors in data handling.?
?
?
?
?
?
?
?
?
?
?
Informing Business Questions for Large Language Models (LLMs)?
To enhance the effectiveness of Large Language Model (LLM) systems in business contexts, it's crucial to provide them with comprehensive metadata supplied by the IT department, which includes definitions of measures, dimensions, and table joins, ensuring clarity and consistency in data interpretation. For example, at Potato Parcel, which specialises in sending love messages along with potatoes, it's essential to prepare and upload a text file containing this metadata.?
When questions arise, such as differentiating between the country of clients and suppliers in sales queries, the LLM system should seek clarification from users to avoid ambiguity. Ultimately, this process must incorporate a validation phase where business users collaborate with data experts from IT to verify the accuracy of any SQL generated by the LLM system, ensuring reliable data-driven decision-making.?
?
Potential of Artificial Intelligence in Vendor Sales and Task Solving?
Francesco opens the webinar up to discussion and quickly lands on the topic of the limitations of vendors claiming to integrate artificial intelligence into their systems without delivering tangible results. An attendee expresses the opinion that there is a need for accountability among vendors, as they often misrepresent capabilities.??
Another attendee highlights the concept of agentic AI, advocating for the use of multiple smaller language models (LLMs) tailored for specific tasks rather than relying on a single large model. They share that tasks such as writing SQL queries require decomposition into subtasks, including schema linking, to ensure relevant tables and columns are identified efficiently, thereby enhancing performance and minimising errors.?
Challenges and Future of AI?
The attendees and Francesco discuss the challenges and potential of generating SQL queries using language models (LLMs) while emphasising the need to verify both syntax and semantics to align with user intent. It highlights the distinction between different user types, such as data engineers and non-technical business users, stressing the importance of translating technical queries into accessible language.??
Francesco moves on to note that while LLMs can provide significant assistance, they cannot replace data engineers or analysts, particularly in complex scenarios that require critical thinking and nuanced understanding. Furthermore, the potential of using multiple specialised agents to collaborate on query components was suggested, along with the possibility of learning from diverse perspectives to enhance problem-solving capabilities. Overall, the sentiment reflects a balanced view on leveraging AI while recognising the irreplaceable value of skilled professionals in data management.?
?
?
Role of Data Engineering and Business User Experience in Data Analysis?
To effectively develop a data pipeline that transfers data from source A to business intelligence (BI), it is helpful to adopt a methodical approach akin to a Q&A process that gradually clarifies user queries. Business users typically need to articulate specific questions about data, such as sales statistics in their region, leading to a sequence of prompts that delve deeper into their requirements. This iterative process ensures comprehensive analysis, as users specify what they want to visualise and how. While standard dashboards and automated reporting can simplify access to recurring data, the reality is that achieving meaningful insights requires thoughtful consideration and gradual exploration of questions rather than expecting instantaneous answers with a simple click.?
When analysing data for business cases, it's crucial to take incremental steps rather than attempting to tackle everything at once. A well-structured and organised dataset, ideally presented in a single table format or one large CSV, can enhance performance and avoid issues associated with joins.??
CSV files often yield better results than traditional dimensional or relational models because they directly address specific questions. Additionally, they require tailored coding for different queries. Therefore, it's essential to determine what information is needed to support business decisions, gather that data, and then methodically combine it for comprehensive analysis, emphasising the importance of metadata in ensuring effective outcomes.?
Challenges and Strategies in AI Development?
Francesco highlights the challenges in implementing self-service solutions for product owners and experts, emphasising that without foundational data management—such as consistent terminology, well-defined databases, and proper lineage—expecting AI systems like LLMs to navigate varied definitions is unrealistic.??
The discussion moves on to suggest a gradual approach by integrating LLMs with self-service features, allowing the technology to learn from real user interactions. An attendee suggests the importance of establishing machine-readable ontologies and knowledge graphs to guide these systems, advocating for the development of specialised agents that can improve in targeted areas. Francesco notes that the key takeaway is to start small, ensuring that foundational elements are in place before scaling up AI capabilities.?
Benefits of Snowflake's Cortex in Data Management?
Attendees were intrigued by Snowflake's Cortex’s accessibility and functionality. Nora confirmed that Cortex is available to all users, though region restrictions apply, particularly for accounts based in Ireland. Users can create semantic rules through YAML and provide complex SQL queries for better interpretation by the language model.??
Snowflake's Cortex is actively being utilised by customers in South Africa and is set for implementation with additional clients. The platform also allows the integration of unstructured files to enhance AI-driven insights. Francesco goes on to note the importance of a unified star schema to streamline data queries, acknowledging that while this approach addresses a majority of business questions, it may not fully satisfy all user interactions. Connections and collaboration were encouraged, with participants invited to connect on LinkedIn to share insights and experiences.?
The discussion then moves on to the integration of data orders with clients, emphasising the necessity of selecting a default tool to streamline operations. It highlights the limitations of a unified schema, which can handle basic queries but struggles with more complex tasks like machine learning model scoring.??
Nora notes that Snowflake expresses a commitment to empowering business users by providing essential data frameworks, drawing parallels between current challenges faced by human consultants and those likely to arise with new technologies. Additionally, a demo of the Cortex analyst tool is proposed to illustrate its capabilities, with the assurance that it can be up and running quickly using Snowflake’s tutorials.?
Data Analysis with Snowflake and Cortex?
The webinar wraps up with a demonstration of getting started with Cortex and Augment AI. A Streamlit app has been developed to showcase daily revenue data, incorporating dimensions such as product and region. By setting up a semantic layer through a YAML file, key attributes like revenue subject area, time dimensions, and synonyms for terms like "daily revenue" are defined, enhancing data context. Interactive features allow users to query for insights, such as identifying the lowest daily revenue for each month and viewing results both in table and chart formats while checking the generated SQL queries. The effectiveness of this setup is tied to the quality of the data structure and ontology, emphasising the need for clear descriptions of data components to optimise performance. Notably, the app's use of a windowing function in SQL demonstrates an advanced capability for automated data analysis.?
?
?
?
?
?
A huge thank you to everyone who contributed to this webinar!?
Senior Data Steward Partner, CDMP
11 小时前Could you send the recordings please.
Data Management Specialist
1 天前Hello Howard Diesel , hope you are well. Been a minute. I would like to receive the recording please.
Database Administrator | Disaster Recovery project manager|Master’s Computer Science | Master’s Artificial Intelligence| ITIL|
1 周Hi Howard Diesel how can we get the recorded session
M.A. | CDMP Certified | Data Governance Specialist | Enabling Data-Driven Success | Independent Consultant
1 周I missed the session, but I’d love to receive the recording. Thank you Howard Diesel
Sr. Manager | BI Implementation & Strategy | Master's in Management Analytics | PMP
1 周This is really interesting Howard Diesel. Could you share the recording pls.