登录查看更多内容

Unifying Data & Gen AI / LLM platforms

Debmalya Biswas

AI/Analytics @ Wipro | x- Nokia, SAP, Oracle | 50+ patents | PhD - INRIA

发布日期: 2024年10月16日

AI / Gen AI challenges for a Data platform

As a Data and AI/ML practitioner, I have always wondered as to why we have such a big disconnect between the business intelligence (BI) and AI/ML worlds.

Data is a key ingredient for both BI and AI/ML, and enterprise data provides the strategic differentiation for most use-cases. Given this, why do we still need separate platforms, tooling, managed by DataOps and MLOps pipelines, respectively?

The ideal world should look something like the reference architecture below:

Following the medallion architecture, source data (both structured and unstructured) is ingested into the Bronze layer, where it is cleansed and standardized into the Sliver layer, with further modeling and transformation into the Gold layer. The data is now ready for consumption by both BI — Dashboarding tools & machine learning (ML) pipelines.

In reality, however, we see that this curated / processed data is moved to another location, e.g., cloud storage buckets, or another data lake, where it is further transformed as part of ML training (LLM fine-tuning) and deployment.

So Fig. 1., in an enterprise landscape, looks something like Fig. 2 (below). The data (pre-)processing part of a ML pipeline focuses on moving data from the Source to ML model, without necessarily including how the model executes on the data itself.

Fig. 2: Data processing with separate BI-DataOps & ML-MLOps pipelines

Needless to say, this results in redundancy and a fragmentation of the BI and AI/ML pipelines. Snowflake has been leading the way here in terms of unifying the two worlds. In the rest of this article,

we deep dive into how Snowflake is bringing large language models (LLMs) to data, rather than the other way around - prevalent in most enterprise data and AI ecosystems today.

Snowflake's Gen AI capabilities, bringing LLMs to Governed data

Continuing its tradition of providing a user friendly platform with state-of-the-art data processing and governance capabilities, Snowflake has rolled out its integrated Data & AI / Gen AI platform - illustrated in Fig. 3. Cortex AI is the Snowflake's Gen AI/LLM platform with Snowflake ML catering to the more traditional AI (data science / predictive analytics) capabilities.

Fig. 3: Snowflake's unified Data & AI platform (Source: Snowflake)

We focus on Gen AI capabilities in this article, and show how easy it has become to build state-of-the-art LLM based use-cases on well governed and modeled enterprise data already present in Snowflake repositories.

Snowflake provides the full set of natural language processing (NLP) capabilities:

Readily available LLM functions to perform routine NLP tasks, e.g., Summarize, Translate.
The COMPLETE function to perform user specified custom NLP tasks leveraging a wide choice of LLMs..
And, finally the full capability to fine-tune LLMs using the Snowflake AI & ML Studio in just a few clicks.

We deep-dive into the above 3 LLM capabilities in the sequel.

LLM functions for routine NLP tasks

Snowflake provides the following LLM functions for routine NLP tasks (link to the full documentation ): The functions are available as SQL functions and can also be invoked in Python.

CLASSIFY_TEXT : Given a prompt, classifies it into one of the classes that you define.
EXTRACT_ANSWER : Given a question and unstructured data, returns the answer to the question if it can be found in the data.
PARSE_DOCUMENT : Given an internal or external stage with documents, returns an object that contains a JSON-formatted string with extracted text content using OCR mode, or the extracted text and layout elements using LAYOUT mode.
SENTIMENT : Returns a sentiment score, from -1 to 1, representing the detected positive or negative sentiment of the given text.
SUMMARIZE : Returns a summary of the given text.
TRANSLATE : Translates given text from any supported language to any other.
EMBED_TEXT_768 : Given a piece of text, returns a vector embedding of 768 dimensions that represents that text.
EMBED_TEXT_1024 : Given a piece of text, returns a vector embedding of 1024 dimensions that represents that text.

COMPLETE function for User specified NLP tasks

The COMPLETE function is a general purpose LLM function to perform user specified tasks. Users can choose from a wide range of LLMs (Fig. 4), and the function generates responses based on a given prompt.

Fig. 4: LLMs supported by the COMPLETE function

Below is an example the of the COMPLETE function call in SQL to analyze the sentiment of product reviews stored in CONTENT column of REVIEW_DATA table, and benchmark them with respect to manually assessed sentiments:

select CONTENT, SENTIMENT as original_sentiment, 
SNOWFLAKE.CORTEX.COMPLETE (
     ‘llama2-70b-chat’, CONCAT('Check the column CONTENT and answer if the review is "positive" or "negative". Here is the product review: ', CONTENT)

from CORTEX_DB.REVIEW_TEST_DATASET.REVIEW_DATA;

This shows how easy it is to build custom NLP functions leveraging state-of-the-art LLMs using only SQL.

Fine-tune LLMs using the Snowflake AI & ML Studio

Finally, we discuss fine-tuning LLMs using enterprise data to build task specific LLMs. We know that foundational LLMs are pre-trained on public data. Fine-tuning provides the ability to contextualize LLMs with (and restrict their responses to) enterprise knowledge captured in the form of documents, wikis, business processes, etc.

Fine-tuning entails taking a pre-trained LLM, and re-training it with (smaller) enterprise data. Technically, this implies updating the weights of the last layer(s) of the trained neural network to reflect the enterprise data and task. So fine-tuning used to be a complex process restricted to technical and engineering teams.

Thankfully, Snowflake has democratized this process completely, where it is now possible to fine-tune state-of-the-art LLMs using their AI & ML Studio in a few clicks. Fig. 5 shows a snapshot of the Snowflake AI & ML Studio LLM fine-tuning entry screen:

Fig. 5: Snowflake AI & ML LLM Fine-tuning screen (Source: Snowflake)

followed by a guided set of steps to:

Select the base LLM

Fig. 6a: AI & ML Studio 'Base model selection' screenshot (Source: Snowflake)

Select the table that you would like to use for training

Fig. 6b: AI & ML Studio 'Select training data' screenshot (Source: Snowflake)

Select the right input and output column to use for fine tuning

Fig. 6c: AI & ML Studio 'Identify prompt and completion columns' screenshot (Source: Snowflake)

Finally, select the validation data

Fig. 6d: AI & ML Studio 'Select validation data' screenshot (Source: Snowflake)

And, that's it - your LLM fine-tuning is in progress!

Fig. 6e: AI & ML Studio 'Fine-tuning job created' screenshot (Source: Snowflake)

To conclude, the lines between Data and AI / Gen AI platforms are blurring, and companies like Snowflake are making it ever so easier to leverage Gen AI / LLM capabilities on secure and governed data already stored in Snowflake. So it is high time to give Snowflake's Cortex AI a shot for your strategic Gen AI use-cases.