登录查看更多内容

Tackling complex vector functions in SQL, Our Way or the Highway!

Sukki S. Sandhar

VP International Presales | Technical Sales and Customer Success | Databases, Real-time Solutions, Transactional and Analytical Machine Learning | Integration and Opensource | Technical GTM | Expertise

发布日期: 2023年2月28日

The Power of Vector Functions

Database vector functions are critical for processing large datasets and extracting meaningful insights from them. By using vector functions, databases can perform complex mathematical operations on large arrays of data quickly and efficiently, allowing for faster and more accurate analysis.?

These functions also enable database systems to support advanced machine learning and artificial intelligence algorithms, providing businesses with valuable tools for data-driven decision-making in various fields, including finance, healthcare, and e-commerce.

But can this be done in SQL? Let’s explore this idea…

The Perceived Limitations of Maths Algorithms in SQL

Performing complex SQL functions for maths, stats, and ML analysis on big data can be challenging, depending on several factors, such as the size of the data, the complexity of the algorithms, and the hardware and software infrastructure available for analysis.

SQL is a powerful tool for data analysis, but it has some limitations when it comes to handling large datasets and complex algorithms. For instance, some SQL implementations may not support certain mathematical or statistical functions or may have limited computational capabilities. Additionally, SQL may not be the best tool for certain types of ML analysis, such as deep learning, which require specialised frameworks like TensorFlow or PyTorch.

However, several approaches can help overcome these challenges, but this means deploying yet other technologies and involve data movement to fill the gap:

One option is to use distributed SQL engines like Apache Hive, Spark SQL, or Presto, which allow SQL queries to be executed in a distributed environment across multiple nodes, enabling large-scale data analysis - BUT it comes with drawbacks in terms of price-performance & latency-poor business and customer experience.?
Another option is to use SQL extensions or libraries that add extra functionality to SQL, such as window functions, user-defined functions (UDFs), or machine learning libraries.? I’m not going to speak about SingleStoreDB’s Code Engine - Powered by Wasm, that's for another day! Today’s post is about using highly optimised Vector functions within SQL.??

Ultimately, the ease of performing complex SQL functions on big data depends on several factors, such as the complexity of the analysis, the size and structure of the data, the infrastructure available, and the expertise of the data analysts and engineers involved in the analysis.

What if you challenged yourself to do this in SQL anyway?

You could try and perform complex SQL functions for maths, stats, and ML analysis on big data, but it requires a combination of distributed computing, specialised libraries, and expert knowledge.

It means that you must invest in the infrastructure necessary to support large-scale data analysis, including distributed computing platforms, high-performance hardware, and specialised software libraries. It also means that you must recruit and retain highly skilled data analysts and engineers who are familiar with the latest technologies and techniques for big data analysis.

You may wonder if it’s really worth the cost/investment.

A cost effective and beautifully simple choice

The way to go around all of these challenges is to think about SQL differently, and most importantly the database architecture you’re selecting to use. In this instance one of the ways that SingleStore achieves this high performance is through the use of SIMD (Single Instruction Multiple Data) instruction set optimisations.

SIMD is a type of parallel processing that allows multiple operations to be performed simultaneously on multiple data elements. This can be especially useful for database operations that involve large data sets, such as aggregations or joins.

SingleStore has implemented SIMD optimisations in several key areas of its database engine, including:

Vectorized query execution: SingleStore's query execution engine is designed to take advantage of SIMD instructions by processing multiple rows of data in parallel. This can significantly improve query performance, especially for operations that involve aggregations or other computations across large data sets.
Compression and decompression: SingleStore uses a compressed columnar storage format to minimise disk I/O and improve query performance. SIMD instructions are used to accelerate the compression and decompression of data during read and write operations.
Hash table operations: Hash tables are a fundamental data structure used in many database operations, such as joins and group-bys. SingleStore has optimised its hash table implementation to take advantage of SIMD instructions for faster lookups and updates.

领英推荐

Mastering SQL in BigQuery: From Zero to Hero!

Free Online Courses With Certificates 1 年前

Parquet file format – everything you need to know!

Nikola Ilic 1 年前

SQL Insights: In Conversation With Ignacio Spreafico

LearnSQL.com 9 个月前

Some of the key vector functions offered by SingleStoreDB include:

vectorized aggregation, which allows for the processing of large arrays of data in parallel
vectorized maths functions, which can be used for everything from simple arithmetic to more complex mathematical operations like trigonometric functions and logarithms.
vectorized string functions, which can be used to perform operations like substring extraction, string concatenation, and pattern matching on large sets of text data
vectorized date and time functions, making it easy to perform calculations and manipulations on time-series data.

Overall, SingleStoreDB's vector functions provide a powerful set of tools for data scientists and analysts who need to process and analyse large datasets quickly and efficiently. By leveraging the latest advancements in CPU and GPU technology, SingleStoreDB is able to deliver a high-performance database management system that is ideal for a wide range of use cases relying on complex vector functions.

SingleStoreDB’s unique and simple architecture enables you to deliver this performance and speed at a fraction of the cost of what you’d need to deploy yourself to achieve the same results.

Some examples of vector functions in SingleStoreDB

Let’s now look at vectorized functions in action in SingleStoreDB. In this example, we create a table to store sales data and insert some sample data. We then use vectorized aggregation to calculate the total sales for each product, a vectorized maths function to calculate the natural logarithm of the total sales for each product, and a vectorized date function to extract the day of the week from each sales date.?

-- Create a table to store sales dat

CREATE TABLE sales (

??id INT PRIMARY KEY,

??product VARCHAR(255),

??sales_date DATE,

??amount DECIMAL(10, 2)

);

-- Insert some sample data

INSERT INTO sales VALUES

??(1, 'Product A', '2022-01-01', 100.00),

??(2, 'Product B', '2022-01-02', 200.00),

??(3, 'Product C', '2022-01-03', 150.00),

??(4, 'Product A', '2022-01-04', 175.00),

??(5, 'Product B', '2022-01-05', 225.00),

??(6, 'Product C', '2022-01-06', 250.00);

-- Calculate the total sales for each product using vectorized aggregation

SELECT

??product,

??SUM(amount) AS total_sales

FROM sales

GROUP BY product;

-- Calculate the natural logarithm of the total sales for each product using a vectorized maths function

SELECT

??product,

??LN(SUM(amount)) AS ln_total_sales

FROM sales

GROUP BY product;

-- Extract the day of the week from each sales date using a vectorized date function

SELECT

??sales_date,

??DAYOFWEEK(sales_date) AS day_of_week

FROM sales;

These queries demonstrate the power and versatility of SingleStoreDB's vector functions in SQL.

Here's an example of how you could use the DOT_PRODUCT vector function in SingleStoreDB SQL:

-- Create two tables to store vector

CREATE TABLE vector1 (

??id INT PRIMARY KEY,

??x FLOAT,

??y FLOAT,

??z FLOAT

);

CREATE TABLE vector2 (

??id INT PRIMARY KEY,

??x FLOAT,

??y FLOAT,

??z FLOAT

);

-- Insert some sample data into the tables

INSERT INTO vector1 VALUES

??(1, 1.0, 2.0, 3.0),

??(2, 4.0, 5.0, 6.0),

??(3, 7.0, 8.0, 9.0);

INSERT INTO vector2 VALUES

??(1, 10.0, 11.0, 12.0),

??(2, 13.0, 14.0, 15.0),

??(3, 16.0, 17.0, 18.0);

-- Calculate the dot product of the two vectors

SELECT DOT_PRODUCT(vector1.x, vector1.y, vector1.z, vector2.x, vector2.y, vector2.z) AS dot_product

FROM vector1

INNER JOIN vector2 ON vector1.id = vector2.id;

In this example, we create two tables to store vectors (vector1 and vector2) and insert some sample data into them. We then use the DOT_PRODUCT vector function to calculate the dot product of the two vectors. This function takes six arguments: the x, y, and z components of each vector. We join the two tables on their respective IDs to ensure that we calculate the dot product for the corresponding vectors in each table.

This example demonstrates how you can use SingleStore's vector functions to perform complex mathematical operations on large datasets. By leveraging the power of modern CPUs and GPUs, SingleStore is able to deliver high-performance vectorized functions that make it easy to work with large sets of numerical data.

If you’d like to dig in further, here is the full technical documentation on SingleStoreDB’s vector functions and another example Using Vector Functions for Image Matching in SQL with SingleStoreDB.

Taking excitement and complexity to the next level….?

In the case of using SQL to drive facial recognition / image matching capabilities, I’d recommend watching a recent webinar on DZone for a deep dive. In it, you’ll learn how to use the dot_product function (cosine similarity) to achieve facial image matching from 16 million records in just 5 milliseconds, using SQL within SingleStoreDB.

Watch the recording here. It’s a fantastic demo that goes into a great depth of details.

I hope this gives you an in-depth understanding of how you can actually perform complex vector maths functions on big data using SQL. If you’re interested in putting it to the test yourself, just give it a go here.

#sql #developer #dataanalytics #database #datascience #machinelearning #ML #AI #artificialintelligence

Abhinav Anand Srivastava

Technical Product Marketing and GTM | Data & AI

2 年

Nice to see this Sukki!

要查看或添加评论，请登录

Sukki S. Sandhar的更多文章

External Tables vs. Data Copying

2023年11月25日

External Tables vs. Data Copying

The Pros and Cons of Data Integration: External Tables vs. Data Copying Introduction: I've been at it for a while ?? In…
?? Unlocking the Full Potential ????

2023年7月28日

?? Unlocking the Full Potential ????

?? As we delve into industry-specific applications, we realise that the generalisation of foundational models are not…

6 条评论
Navigating the Fundamental Shift: Permanence of LLM Adoption in Business

2023年6月23日

Navigating the Fundamental Shift: Permanence of LLM Adoption in Business

Introduction: In the ongoing discussions surrounding the adoption of Large Language Models (LLMs) within enterprise…
"Graph or Vector: Contrasting Approaches for Data Modelling, Querying and let’s not forget Business Value

2023年5月31日

"Graph or Vector: Contrasting Approaches for Data Modelling, Querying and let’s not forget Business Value

"What are the key considerations and trade-offs in choosing between a vectored database and a graph database for a…

5 条评论
The Need for Guard Rails…The Model and the Data

2023年5月19日

The Need for Guard Rails…The Model and the Data

Introduction In the ever-evolving landscape of technology, the rapid progress of AI-ML has been both exhilarating and…

2 条评论
JSON is no Beast: Transforming Data Handling for High-Performance & Real-Time Applications

2023年3月31日

JSON is no Beast: Transforming Data Handling for High-Performance & Real-Time Applications

I have the pleasure of interacting with talented DBAs and developers on a daily basis. One of the common topics we…

1 条评论
Maximising the Value of Your Data: A Deep Dive into streaming Geospatial and Timeseries Analysis

2023年2月17日

Maximising the Value of Your Data: A Deep Dive into streaming Geospatial and Timeseries Analysis

As a technical expert and evangelist, I often receive enquiries about the advantages of using SingleStore to manage…

2 条评论
Insights from the CDAO Discussion in London: unlocking the Power of Real-Time Analytics

2023年2月10日

Insights from the CDAO Discussion in London: unlocking the Power of Real-Time Analytics

Earlier this week, I animated a discussion group with Dizem O. from BNP Paribas at the #CDAOUK event in London.

1 条评论

See all articles

Tackling complex vector functions in SQL, Our Way or the Highway!

Sukki S. Sandhar

VP International Presales | Technical Sales and Customer Success | Databases, Real-time Solutions, Transactional and Analytical Machine Learning | Integration and Opensource | Technical GTM | Expertise

领英推荐

Sukki S. Sandhar的更多文章

社区洞察

其他会员也浏览了

Mastering SQL?Joins: A Comprehensive Guide for Data?Science

Unleashing the Power of AI in SQL Server: A Practical Guide for Data-Driven Insights

WHY SQL IS YOUR FOUNDATION FOR PRACTICING DATA SCIENCE

SQL in Data Science: Why It’s Still Essential in 2025

Graph Database and Query Language 101: Speed & Simplicity

SQL: The Basics for Data Science Newbies | Learnbay

Mastering SQL: IN vs. EXISTS – Which Should You Use and When?

Meet Ultipa Manager: Toolkits for Data Scientists

How Structured Is Your Data?

领英推荐

Sukki S. Sandhar的更多文章

External Tables vs. Data Copying

?? Unlocking the Full Potential ????

Navigating the Fundamental Shift: Permanence of LLM Adoption in Business

"Graph or Vector: Contrasting Approaches for Data Modelling, Querying and let’s not forget Business Value

The Need for Guard Rails…The Model and the Data

JSON is no Beast: Transforming Data Handling for High-Performance & Real-Time Applications

Maximising the Value of Your Data: A Deep Dive into streaming Geospatial and Timeseries Analysis

Insights from the CDAO Discussion in London: unlocking the Power of Real-Time Analytics

社区洞察

其他会员也浏览了

Mastering SQL?Joins: A Comprehensive Guide for Data?Science

Unleashing the Power of AI in SQL Server: A Practical Guide for Data-Driven Insights

WHY SQL IS YOUR FOUNDATION FOR PRACTICING DATA SCIENCE

SQL in Data Science: Why It’s Still Essential in 2025

Graph Database and Query Language 101: Speed & Simplicity

SQL: The Basics for Data Science Newbies | Learnbay

Mastering SQL: IN vs. EXISTS – Which Should You Use and When?

Meet Ultipa Manager: Toolkits for Data Scientists

How Structured Is Your Data?