Tackling complex vector functions in SQL, Our Way or the Highway!
Sukki S. Sandhar
VP International Presales | Technical Sales and Customer Success | Databases, Real-time Solutions, Transactional and Analytical Machine Learning | Integration and Opensource | Technical GTM | Expertise
The Power of Vector Functions
Database vector functions are critical for processing large datasets and extracting meaningful insights from them. By using vector functions, databases can perform complex mathematical operations on large arrays of data quickly and efficiently, allowing for faster and more accurate analysis.?
These functions also enable database systems to support advanced machine learning and artificial intelligence algorithms, providing businesses with valuable tools for data-driven decision-making in various fields, including finance, healthcare, and e-commerce.
But can this be done in SQL? Let’s explore this idea…
The Perceived Limitations of Maths Algorithms in SQL
Performing complex SQL functions for maths, stats, and ML analysis on big data can be challenging, depending on several factors, such as the size of the data, the complexity of the algorithms, and the hardware and software infrastructure available for analysis.
SQL is a powerful tool for data analysis, but it has some limitations when it comes to handling large datasets and complex algorithms. For instance, some SQL implementations may not support certain mathematical or statistical functions or may have limited computational capabilities. Additionally, SQL may not be the best tool for certain types of ML analysis, such as deep learning, which require specialised frameworks like TensorFlow or PyTorch.
However, several approaches can help overcome these challenges, but this means deploying yet other technologies and involve data movement to fill the gap:
Ultimately, the ease of performing complex SQL functions on big data depends on several factors, such as the complexity of the analysis, the size and structure of the data, the infrastructure available, and the expertise of the data analysts and engineers involved in the analysis.
What if you challenged yourself to do this in SQL anyway?
You could try and perform complex SQL functions for maths, stats, and ML analysis on big data, but it requires a combination of distributed computing, specialised libraries, and expert knowledge.
It means that you must invest in the infrastructure necessary to support large-scale data analysis, including distributed computing platforms, high-performance hardware, and specialised software libraries. It also means that you must recruit and retain highly skilled data analysts and engineers who are familiar with the latest technologies and techniques for big data analysis.
You may wonder if it’s really worth the cost/investment.
A cost effective and beautifully simple choice
The way to go around all of these challenges is to think about SQL differently, and most importantly the database architecture you’re selecting to use. In this instance one of the ways that SingleStore achieves this high performance is through the use of SIMD (Single Instruction Multiple Data) instruction set optimisations.
SIMD is a type of parallel processing that allows multiple operations to be performed simultaneously on multiple data elements. This can be especially useful for database operations that involve large data sets, such as aggregations or joins.
SingleStore has implemented SIMD optimisations in several key areas of its database engine, including:
领英推荐
Some of the key vector functions offered by SingleStoreDB include:
Overall, SingleStoreDB's vector functions provide a powerful set of tools for data scientists and analysts who need to process and analyse large datasets quickly and efficiently. By leveraging the latest advancements in CPU and GPU technology, SingleStoreDB is able to deliver a high-performance database management system that is ideal for a wide range of use cases relying on complex vector functions.
SingleStoreDB’s unique and simple architecture enables you to deliver this performance and speed at a fraction of the cost of what you’d need to deploy yourself to achieve the same results.
Some examples of vector functions in SingleStoreDB
Let’s now look at vectorized functions in action in SingleStoreDB. In this example, we create a table to store sales data and insert some sample data. We then use vectorized aggregation to calculate the total sales for each product, a vectorized maths function to calculate the natural logarithm of the total sales for each product, and a vectorized date function to extract the day of the week from each sales date.?
-- Create a table to store sales dat
CREATE TABLE sales (
??id INT PRIMARY KEY,
??product VARCHAR(255),
??sales_date DATE,
??amount DECIMAL(10, 2)
);
-- Insert some sample data
INSERT INTO sales VALUES
??(1, 'Product A', '2022-01-01', 100.00),
??(2, 'Product B', '2022-01-02', 200.00),
??(3, 'Product C', '2022-01-03', 150.00),
??(4, 'Product A', '2022-01-04', 175.00),
??(5, 'Product B', '2022-01-05', 225.00),
??(6, 'Product C', '2022-01-06', 250.00);
-- Calculate the total sales for each product using vectorized aggregation
SELECT
??product,
??SUM(amount) AS total_sales
FROM sales
GROUP BY product;
-- Calculate the natural logarithm of the total sales for each product using a vectorized maths function
SELECT
??product,
??LN(SUM(amount)) AS ln_total_sales
FROM sales
GROUP BY product;
-- Extract the day of the week from each sales date using a vectorized date function
SELECT
??sales_date,
??DAYOFWEEK(sales_date) AS day_of_week
FROM sales;
These queries demonstrate the power and versatility of SingleStoreDB's vector functions in SQL.
Here's an example of how you could use the DOT_PRODUCT vector function in SingleStoreDB SQL:
-- Create two tables to store vector
CREATE TABLE vector1 (
??id INT PRIMARY KEY,
??x FLOAT,
??y FLOAT,
??z FLOAT
);
CREATE TABLE vector2 (
??id INT PRIMARY KEY,
??x FLOAT,
??y FLOAT,
??z FLOAT
);
-- Insert some sample data into the tables
INSERT INTO vector1 VALUES
??(1, 1.0, 2.0, 3.0),
??(2, 4.0, 5.0, 6.0),
??(3, 7.0, 8.0, 9.0);
INSERT INTO vector2 VALUES
??(1, 10.0, 11.0, 12.0),
??(2, 13.0, 14.0, 15.0),
??(3, 16.0, 17.0, 18.0);
-- Calculate the dot product of the two vectors
SELECT DOT_PRODUCT(vector1.x, vector1.y, vector1.z, vector2.x, vector2.y, vector2.z) AS dot_product
FROM vector1
INNER JOIN vector2 ON vector1.id = vector2.id;
In this example, we create two tables to store vectors (vector1 and vector2) and insert some sample data into them. We then use the DOT_PRODUCT vector function to calculate the dot product of the two vectors. This function takes six arguments: the x, y, and z components of each vector. We join the two tables on their respective IDs to ensure that we calculate the dot product for the corresponding vectors in each table.
This example demonstrates how you can use SingleStore's vector functions to perform complex mathematical operations on large datasets. By leveraging the power of modern CPUs and GPUs, SingleStore is able to deliver high-performance vectorized functions that make it easy to work with large sets of numerical data.
If you’d like to dig in further, here is the full technical documentation on SingleStoreDB’s vector functions and another example Using Vector Functions for Image Matching in SQL with SingleStoreDB.
Taking excitement and complexity to the next level….?
In the case of using SQL to drive facial recognition / image matching capabilities, I’d recommend watching a recent webinar on DZone for a deep dive. In it, you’ll learn how to use the dot_product function (cosine similarity) to achieve facial image matching from 16 million records in just 5 milliseconds, using SQL within SingleStoreDB.
Watch the recording here. It’s a fantastic demo that goes into a great depth of details.
I hope this gives you an in-depth understanding of how you can actually perform complex vector maths functions on big data using SQL. If you’re interested in putting it to the test yourself, just give it a go here.
Technical Product Marketing and GTM | Data & AI
2 年Nice to see this Sukki!