Evaluating Snowflake for Generative AI Solutions: A Journey from Novice to Practitioner

Evaluating Snowflake for Generative AI Solutions: A Journey from Novice to Practitioner

Starting with zero knowledge of Snowflake , I approached the evaluation with a background deeply rooted in NoSQL ( Cloudant , ArangoDB ). For years, I had been using XQuery for XML databases (Berkeley DB XML) and AQL for ArangoDB, with my SQL skills lying dormant for over 15 years. However, my extensive experience in Generative AI, creating bots, automation tools, and other advanced systems leveraging large language models (LLMs), provided a solid foundation to build on. My proficiency in Python and numerous other programming languages was an added advantage.

Andreas Schubert from Snowflake provided an invaluable overview of the platform’s capabilities. As we conversed, I began mentally mapping the building blocks for a possible solution. It felt like unwrapping a new, exciting toy—eager to dive in and explore its potential.

Diving Into Snowflake

Despite my proficiency in various programming languages and technologies, my SQL skills were rusty, having not used them for over 15 years. My background lies in Generative AI, creating sophisticated bots, automation tools, and applications using large language models (LLMs). With a solid foundation in Python and numerous other languages, I had also utilized Streamlit to develop custom applications with WebSockets and heavy data backends.

Armed with a basic sample file to avoid spending too much time on the initial evaluation, I consulted Snowflake documentation and sought briefings from my multiple AI assistants—used for various tasks, including coding. The anticipation of exploring a new tool that could potentially revolutionize my project was palpable.

To evaluate Snowflake, I began with a basic sample file with a flat structure containing fields such as Error Code, Issue, Reason, and Solution. These fields were duplicated in several languages. The typical process involved looking up the error code and then reading the corresponding descriptions. However, this approach was limiting if the user did not speak one of the predefined languages.

Leveraging Snowflake's Capabilities

I loaded the sample data into a Snowflake table ERROR_CODES without modifications of content or structure. The next steps involved transforming the table into vector-type columns for the issue, reason, and solution fields in English, ignoring other languages for simplicity. I chose a dimension size of 784 for the vectors.

To perform free-text searches for error codes and solutions in any language, I needed to enhance the table with vector-type columns for the ISSUEEN, CAUSEEN, and SOLUTIONEN fields. This enabled me to store text embeddings and perform similarity searches. Here's the SQL to alter the table:

ALTER TABLE ERROR_CODES
    ADD COLUMN IF NOT EXISTS ISSUE_VEC VECTOR(FLOAT, 768);

ALTER TABLE ERROR_CODES
    ADD COLUMN IF NOT EXISTS CAUSE_VEC VECTOR(FLOAT, 768);

ALTER TABLE ERROR_CODES
    ADD COLUMN IF NOT EXISTS SOLUTION_VEC VECTOR(FLOAT, 768);        

Next, I generated embeddings for my text content using the following query:

UPDATE ERROR_CODES
SET ISSUE_VEC = SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', ISSUEEN),
    CAUSE_VEC = SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', CAUSEEN),
    SOLUTION_VEC = SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', SOLUTIONEN);        

Brief Introduction to Embeddings and Dimensionality

Embeddings are a powerful technique in machine learning and natural language processing (NLP) that transform text into numerical representations. These representations capture the semantic meaning of words or phrases, enabling machines to understand and process language more effectively. In the context of Snowflake's Cortex AI, embeddings are used to convert textual data into vectors—arrays of numbers—that can be compared for similarity.

The choice of dimensionality in embeddings—essentially the size of these vectors— is crucial. Higher dimensionality can capture more nuanced details and complex relationships within the data, but it also requires more computational resources and storage space. Conversely, lower dimensionality is more efficient but might lose some of the finer-grained details.

When deciding on the dimensionality for my embeddings, I chose 784 dimensions for several reasons:

  1. Balance of Detail and Efficiency: 784 dimensions strike a balance between capturing detailed semantic relationships and maintaining computational efficiency. While higher dimensions like 1024 or 2048 could capture even more nuances, they would also significantly increase the computational load and storage requirements.
  2. Complexity of the Data: The data I was working with, including fields like issue, reason, and solution, required a level of detail that could be effectively captured with 784 dimensions. This choice ensured that the embeddings were rich enough to understand and differentiate between various issues and solutions without overwhelming the system.
  3. Performance Considerations: During my evaluation, I experimented with different dimensionalities. I found that 384 dimensions, while faster and less resource-intensive, did not capture enough detail for accurate similarity searches in my specific use case. 784 dimensions provided a noticeable improvement in the quality of the results, justifying the additional computational cost.
  4. Future Scalability: Choosing 784 dimensions also allows for future scalability. As the dataset grows and the complexity of the queries increases, having a higher dimensionality ensures that the system can handle more intricate relationships and provide accurate results.

Querying by Vectors

To test the vector search functionality, I converted my free text query "There is something wrong with my outside sensor" to vectors using the snowflake-arctic-embed-m model. Here’s the query and the results:

SELECT
  ERRORCODE,
  ISSUEEN,
  VECTOR_COSINE_SIMILARITY(
    ISSUE_VEC,
    SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'There is something wrong with my outside sensor')
  ) AS similarity
FROM ERROR_CODES
ORDER BY similarity DESC
LIMIT 5;        

Multilingual Querying

I extended the query to German using Google Translate, comparing the results with searches in the German language field ISSUEDE:

SELECT
  ERRORCODE,
  ISSUEEN,
  VECTOR_COSINE_SIMILARITY(
    ISSUE_VEC,
    SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'Mit meinem Au?ensensor stimmt etwas nicht')
  ) AS similarity
FROM ERROR_CODES
ORDER BY similarity DESC
LIMIT 5;        

And let's compare results, if I will search within the German language field ISSUEDE.

Looks the same to me. But not the same as I searched within the ISSUEEN field. Okay, we will deal with it later.

Had to kill my curiosity about how translation works via Cortex AI available within Snowflake:

SELECT SNOWFLAKE.CORTEX.TRANSLATE('Mit meinem Au?ensensor stimmt etwas nicht', '', 'en')        

It produced "Something is wrong with my outdoor sensor". Interesting, so it does work. Let's try now combining embeddings, vector search and translations and see if we will get anywhere closer to my search over the ISSUEEN field.

Enhancing Search Accuracy

I refined the process by combining embeddings, vector search, and translations:

SELECT
  ERRORCODE,
  ISSUEEN,
  VECTOR_COSINE_SIMILARITY(
    ISSUE_VEC,
    SNOWFLAKE.CORTEX.EMBED_TEXT_768(
        'snowflake-arctic-embed-m', 
        SNOWFLAKE.CORTEX.TRANSLATE('Mit meinem Au?ensensor stimmt etwas nicht', '', 'en')
    )
  ) AS similarity
FROM ERROR_CODES
ORDER BY similarity DESC
LIMIT 5;        

To avoid duplicate values, I ensured unique results with this query:

SELECT 
    ERRORCODE, 
    MAX(ISSUEEN) AS ISSUEEN, -- Ensure ISSUEEN matches the max similarity
    MAX(similarity) AS similarity
FROM (
    SELECT 
        ERRORCODE, 
        ISSUEEN,
        VECTOR_COSINE_SIMILARITY(
            ISSUE_VEC, 
            SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', SNOWFLAKE.CORTEX.TRANSLATE('There is something wrong with my outside sensor', '', 'en'))
        ) AS similarity
    FROM ERROR_CODES
)
GROUP BY ERRORCODE
ORDER BY similarity DESC
LIMIT 5;        

The query returned relevant results with high similarity scores, demonstrating Snowflake's powerful search capabilities.

Until now, I managed to have my data, queries with embeddings, and translations working. How do I wrap it all as an app and show it to others?

Okay, let's build our app. I want it like a simple Chat UI, where the user can type the question or issue, and the app will show, let's say, an error code, the issue associated with it, the reason, and the solution. But I need to tweak my SQL query a bit to use common table expression (CTE).

WITH Similarity_CTE AS (
    SELECT 
        ERRORCODE, 
        ISSUEEN,
        CAUSEEN,
        SOLUTIONEN,
        VECTOR_COSINE_SIMILARITY(
            ISSUE_VEC, 
            SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', SNOWFLAKE.CORTEX.TRANSLATE('There is something wrong with my outside sensor', '', 'en'))
        ) AS similarity
    FROM ERROR_CODES
)
SELECT 
    ERRORCODE, 
    MAX(ISSUEEN) AS ISSUEEN, -- Ensure ISSUEEN matches the max similarity
    MAX(CAUSEEN) AS CAUSEEN, -- Ensure CAUSEEN matches the max similarity
    MAX(SOLUTIONEN) AS SOLUTIONEN, -- Ensure SOLUTIONEN matches the max similarity
    MAX(similarity) AS similarity
FROM Similarity_CTE
GROUP BY ERRORCODE
ORDER BY similarity DESC
LIMIT 5;        

That would produce something like this:

Building the Application with Streamlit

Now, let’s talk about Streamlit. Imagine taking all that power of Snowflake and wrapping it in a sleek, interactive web app that anyone can use. Streamlit is an open-source Python library that makes it incredibly easy to create and share custom web apps for machine learning and data science. Think of it as the bridge between complex backend processes and a user-friendly interface.

Streamlit’s integration with Snowflake allows developers to securely build, deploy, and share Streamlit apps on Snowflake’s data cloud. This means you can create applications that process and utilize data in Snowflake without worrying about moving data or application code to an external system. It’s all seamless, efficient, and incredibly powerful.

Streamlit is available in Snowflake Console (https://app.snowflake.com/) under Projects section.


I wanted to create an interactive application with a simple Chat UI, where the user could type a question or describe an issue, and the app would display the error code, issue, reason, and solution. Here’s how I did it:

First, I designed a function to convert similarity scores into color codes—green for high probability, yellow for moderate probability, and red for lower probability. This color-coding helps users quickly gauge the relevance of the results.

Then, I wrote the main Streamlit app code. This included setting up the user interface, handling user input, and executing SQL queries in Snowflake to fetch the relevant data.

Here’s the complete Python code for the Streamlit app:

# Import python packages
import streamlit as st
import matplotlib.colors as mcolors
from snowflake.snowpark.context import get_active_session


# Function to convert similarity score to a color
def get_color(score):
    return mcolors.to_hex((1 - score, score, 0))

# Function to generate a colored text representation based on similarity
def get_color_text(score):
    if score > 0.90:
        return "??"  # Green
    elif score > 0.78:
        return "??"  # Yellow
    else:
        return "??"  # Red

# Write directly to the app
st.title("Issues and Error codes Lookup")
st.write(
    """
    The app demonstrates how to search for error codes by free text
    """
)

# Get the current credentials
session = get_active_session()

prompt = st.chat_input("Write your question")
if prompt:
    st.caption(f"Issue: {prompt}")
    st.divider()

    with (st.spinner('Looking for answers...')):
        if prompt.startswith("/"):
            if prompt == "/help":
                with st.chat_message("assistant"):
                    st.markdown("**AI:** I can answer questions about Error codes.")
        else:
            reply_placeholder = st.empty()

            sql=f"""
            WITH Similarity_CTE AS (
                SELECT 
                    ERRORCODE, 
                    ISSUEEN,
                    CAUSEEN,
                    SOLUTIONEN,
                    VECTOR_COSINE_SIMILARITY(
                        ISSUE_VEC, 
                        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 
                        SNOWFLAKE.CORTEX.TRANSLATE('{prompt}', '', 'en'))
                    ) AS similarity
                FROM ERROR_CODES
            )
            SELECT 
                ERRORCODE, 
                MAX(ISSUEEN) AS ISSUEEN, -- Ensure ISSUEEN matches the max similarity
                MAX(CAUSEEN) AS CAUSEEN, -- Ensure CAUSEEN matches the max similarity
                MAX(SOLUTIONEN) AS SOLUTIONEN, -- Ensure SOLUTIONEN matches the max similarity
                MAX(similarity) AS similarity
            FROM Similarity_CTE
            GROUP BY ERRORCODE
            ORDER BY similarity DESC
            LIMIT 5;
            """
            data = session.sql(sql).to_pandas()
            
            for row in data.itertuples(index=False):
                color_text = get_color_text(row.SIMILARITY)
                st.markdown(
                    (
                        f"{color_text} **ERR {row.ERRORCODE}** - {row.ISSUEEN}"
                        "\n\n**Possible Cause:**"
                        f"\n\n{row.CAUSEEN}"
                        "\n\n**Possible resolution:**"
                        f"\n\n{row.SOLUTIONEN}"
                        "\n\n---"
                    )
                )        

This is what the app shows me when I type `There is something wrong with my outside sensor`:


In this code, I utilized the powerful capabilities of Snowflake and Cortex AI to handle complex queries and vector embeddings. Streamlit provided the perfect platform to build an interactive and user-friendly application. The app takes user input, processes it using Snowflake’s advanced query functionalities, and displays the results in a clean, intuitive interface.

The Power of Simplicity

The simplicity of accessing data from Snowflake in Streamlit is remarkable. With just one line of code, you can access your data:

data = session.sql(my_sql_query)        

By replacing my_sql_query with your SQL query, you have immediate access to your data, ready to be used in your application. This seamless integration between Snowflake and Streamlit allows for efficient and powerful data processing, making it an ideal choice for developing sophisticated AI solutions.

Wrapping up

Snowflake's intuitive interface and robust capabilities made it remarkably easy for me to dive into a completely new environment and achieve significant results in a very short time. Despite my lack of prior knowledge about Snowflake, the platform’s comprehensive documentation and powerful features enabled me to start building a functional prototype within an hour.

From initially having no idea how to use Snowflake to loading data, manipulating it with SQL, leveraging the Cortex AI layer for advanced queries, and integrating everything into a Streamlit app, the journey was both swift and enlightening. The ease with which I could implement vector-based searches, handle multilingual data, and generate meaningful insights was impressive. This experience highlighted Snowflake's potential as a versatile tool for data-driven AI solutions.

Snowflake proved to be an excellent choice for developing my Generative AI product. Its seamless integration of data storage, SQL manipulation, and AI functionalities allowed me to quickly create a prototype that met my client's needs. The platform’s user-friendly nature and powerful capabilities make it an ideal solution for developers looking to build sophisticated data applications with minimal friction.

Carlos Carrero

Global Principal Architect - AI/ML - Partners

9 个月

Nice use case. There are lots of text data within Snowflake tables that now can be used to get insights using embeddings. Very cool.

Michael Gorkow

Principal Architect AI/ML, Field CTO Office EMEA at ??Snowflake??

9 个月

Thank you for sharing Evgenios Skitsanos ! Really well written article!

Andreas Schubert

Enterprise Account Executive for Energy & Utilities at Snowflake - The AI Data Cloud Platform

9 个月

That’s amazing, Evi. Pleased to see you succeed with your POC. I’m amazed by your pace - considering we first spoke on Friday.

要查看或添加评论,请登录

Evgenios Skitsanos的更多文章

社区洞察

其他会员也浏览了