Decoding the Cost and Performance of Modern Data Warehouses: BigQuery, Snowflake, Redshift, and ClickHouse
Mohamed Ali
Data Engineering & Analytics | Team builder | Passionate about designing & implementing effective data strategies that align with business goals and drive value.
Introduction
In today's data-driven world, the selection of a data warehouse that balances cost and performance is critical for businesses striving to optimize their data pipelines. This document delves into a standardized approach to compare four prominent data warehouses—BigQuery, Snowflake, Redshift, and ClickHouse—focusing on their computational costs and performance for complex SQL queries. Storage costs, while relevant, are generally minimal, making computation the key cost driver. By running a complex SQL query with intensive mathematical functions, we benchmark the performance and cost across these platforms.
Note: this document is not testing caching performance or complex joins as other benchmark doing.
The Players: Claims and Key Features
1. BigQuery:
BigQuery, offered by Google Cloud, is renowned for its serverless, highly scalable, and cost-efficient architecture. It operates on a pay-as-you-go model, charging for the compute resources used during query execution. BigQuery’s reputation lies in its simplicity—users do not need to manage infrastructure or scale systems manually. The pricing is transparent and linked directly to query runtime, making it attractive for businesses looking for low operational overhead.
2. Snowflake:
Snowflake provides a cloud-native, fully managed SaaS data warehouse, known for its ability to separate storage and compute. This allows users to scale both independently based on their needs. Snowflake excels in ease of use, multi-cloud support, and concurrency, boasting near-infinite scalability. However, these advantages come with a higher price point, especially in computational-heavy workloads. Snowflake uses credits as the unit of consumption, with the smallest computational tier (XS) used in this comparison.
3. Redshift:
Amazon Redshift offers two deployment models: Redshift serverless and Redshift clusters. Redshift provides high-performance querying capabilities, tightly integrated with the AWS ecosystem. However, Redshift can become complex to manage and scale, especially when dealing with varied workloads. Its pricing structure is based on node types for clusters and RPUs (Redshift Processing Units) for serverless. Redshift clusters require careful optimization to get the best performance for the lowest cost.
4. ClickHouse:
ClickHouse is an open-source columnar database designed for real-time analytics. ClickHouse’s performance is often praised for its high throughput on OLAP (Online Analytical Processing) queries. Unlike the managed services like Snowflake and BigQuery, ClickHouse requires users to manage their own infrastructure, typically making it more cost-efficient in raw compute terms. In this comparison, ClickHouse was run on AWS's m6i.xlarge instance, closely mirroring Redshift's hardware configuration.
Benchmarking SQL Query:
The SQL query we used to benchmark computation across all platforms includes complex mathematical operations like sin(), power(), and log(), along with string manipulations, designed to simulate a real-world, computation-heavy workload. The following is the Pseudo SQL:
-- Generate a series of numbers to simulate a dataset
WITH numbers AS (
SELECT
ROW_NUMBER() OVER () AS id -- Generate a unique ID for each row
FROM
UNNEST(GENERATE_ARRAY(1, 1000000))
)
-- Perform calculations on the generated dataset
SELECT
CAST(id AS string) AS id, -- Convert the ID to a string
CONCAT('user_', LPAD(CAST(id AS string), 6, '0')) AS user_id, -- Create a user ID by padding the ID
ROUND(POWER(RANDOM() * 1000, 2), 2) AS random_square, -- Square a random number between 0 and 1000
ROUND(LOG(RANDOM() * 100 + 1), 2) AS random_log, -- Take the log of a random number between 1 and 100
ROUND(SIN(RANDOM() * 2 * 3.14), 2) AS sin_value, -- Calculate the sine of a random angle
CASE
WHEN MOD(id, 2) = 0 THEN 'even' -- Label even IDs as 'even'
ELSE 'odd' -- Label odd IDs as 'odd'
END AS number_type -- Classify each row as 'even' or 'odd'
FROM
numbers; -- From the generated dataset
Note all compatible sql are in the Appendix at the end of the article.
Performance and Cost Breakdown
Evaluation:
cost: cost of executing the query. The lower cost, the better solution.
time: time of executing the query. The lower time, the better performing solution.
cost-time: cost*time as one parameter for both cost and time. The lower cost-time, the better solution in terms of cost and time together.
1. BigQuery:
- Running Time: 20 seconds
- Slot Time Consumed: 40 seconds
- Cost Calculation:
- Pay-as-you-go cost at $0.04 per slot hour.
- Cost = 40 seconds 0.04 / (60 60) = $0.0004
- Cost-time = 20*0.0004 = 0.008 ($s)
2. Snowflake:
- Running Time: 20 seconds
- Instance Size: XS (smallest tier)
- Cost Calculation:
- XS instance costs 1 credit per hour, approximately $3/hour.
- Cost = 20 seconds * $3 / (60 * 60) = $0.0167.
- Cost-time = 20 * 0.0167 = 0.334 ($s)
Note: changing instance side doesn't affect the running time, accordingly cost is assessed on the lowest size.
3. Redshift:
Redshift Serverless:
- Running Time: 5 minutes (300 seconds)
领英推荐
- Cost Calculation:
- Cost is based on RPUs, where the query consumed 8 RPUs.
- Approximate cost = $3.00.
- Cost-time = 900 ($s)
Redshift Cluster (dc2.large):
- Running Time: 5 minutes
- Cost Calculation:
- Node cost = $0.33 per hour.
- Utilization = 20%.
- Cost = 0.2 * 300 seconds * $0.3 / (60 * 60) = $0.005.
- Cost-time = 300 * 0.005 = 1.5 $s
Note: changing instance side doesn't affect the running time, accordingly cost is assessed on the lowest size.
4. ClickHouse:
- Instance Type: AWS m6i.xlarge ($0.2354/hour)
- Running Time: 20 seconds
- Cost Calculation:
- CPU Utilization: < 5%.
- Cost = 0.05 * 20 seconds * $0.2354 / (60 60) = $0.00007
- Cost-time = 20 * 0.00007 = 0.0014 $s
Performance and Cost Comparison
Note we ignore here the setup and maintaince cost. For example, clickhouse requires efforts as it is opensource.
Cost:
Time:
Cost-time:
Conclusion
The cost-performance analysis shows that ClickHouse is the clear winner when both cost and performance are considered, with extremely low CPU utilization and cost. BigQuery offers a good balance of cost and performance, but it is slightly more expensive than ClickHouse. Snowflake, while fast, comes with a higher price tag due to its credit-based pricing model. Redshift (Cluster) provides decent pricing at lower utilization but struggles with higher query runtimes. Redshift Serverless, on the other hand, is significantly more expensive and slower.
While these benchmarks focus purely on the cost of computation, it’s important to note that operational costs (like ease of use, automation, and administrative overhead) are not included in this comparison. Snowflake, being a fully managed SaaS, might justify its higher costs with lower operational overhead, whereas ClickHouse requires more manual intervention but shines in raw performance efficiency.
Appendix
SQL for each datawarehouse
Bigquery
WITH numbers AS (
SELECT
ROW_NUMBER() OVER () AS id
FROM
UNNEST(GENERATE_ARRAY(1, 1000)) a
CROSS JOIN
UNNEST(GENERATE_ARRAY(1, 1000)) b
)
SELECT
CAST(id AS string) AS id
,
-- Complex String concatenation with explicit casting
CONCAT(
concat(CAST(id AS string),
cast(LPAD(CAST(id AS string), 10, '0') as string)),
-- CAST(id AS string),
CAST(ABS(CAST(rand() * 1000000 AS BIGINT)) AS string)
) AS user_id
,
-- Double Nested Mathematical Operations with randization
ROUND(POWER(rand() * 1000, 2) + LOG(CAST(rand() * 100 + 1 AS FLOAT64)) * SQRT(ABS(rand() * 1000)), 8) AS rand_float1,
ROUND(SQRT(POWER(rand() * 1000, 3) / (rand() * 10 + 1)) + EXP(rand()), 8) AS rand_float2,
-- Trigonometric Calculations Combined with Logarithmic and Exponential Functions
SIN(rand() * 2 * 3.4) + COS(LOG(CAST(ABS(rand() * 100 + 1) AS FLOAT64))) * EXP(SIN(rand() * 3.14 / 2)) AS trig_log_exp_value,
-- Nested Complex Math: Pow, Sqrt, and Logarithmic Operations Combined
ROUND(POWER(SQRT(ABS(rand() * 10000)), 3) / LOG(POWER(CAST(rand() + 1 AS FLOAT64), 2) + 1), 8) AS complex_math1,
ROUND(EXP(POWER(LOG(CAST(rand() * 1000 + 1 AS FLOAT64)), 2)) * COS(SQRT(ABS(rand() * 500))), 8) AS complex_math2
,
-- Complex String Manipulations and Conditions
CASE
WHEN MOD(id, 2) = 0
THEN
CAST(id AS string) ||
CAST(FLOOR(rand() * 10 + SIN(rand() * 2)) AS string) ||
LPAD(CAST(id AS string), 5, '0')
ELSE
CAST(id AS string) ||
CAST(FLOOR(rand() * 10 + COS(rand() * 3)) AS string) ||
LPAD(CAST(id AS string), 5, '0')
END AS segment
,
-- -- -- More complex string operations (explicit casting)
CONCAT(
SUBSTRING('category_', 1, CAST(FLOOR(rand() * 8) AS INT)),
LPAD(CAST(FLOOR(rand() * 1000) AS string), 5, '0')
) AS category,
-- Comparisons with Multiple Nested Conditions and Mathematical Operations
CASE
WHEN GREATEST(POWER(rand() * 100, 2), rand() * 200) > 150
THEN 'high'
WHEN LEAST(SQRT(rand() * 100), rand() * 200) < 50
THEN 'low'
ELSE CONCAT(
'medium_',
CAST(ABS(CAST(rand() * 100 AS BIGINT)) AS string)
)
END AS value_range
,
-- Double Trigonometric Calculations
ROUND(SIN(SIN(rand() * 2 * 3.14)) * TAN(COS(rand() * 3.14 / 2)) + EXP(SIN(rand())), 8) AS nested_trig_value
,
-- rand String Concatenation with More randized Components
CONCAT(
concat(CAST(FLOOR(rand() * 10) AS string),
CAST(FLOOR(rand() * 10) AS string)),
CAST(FLOOR(rand() * 10) AS string)
) AS rand_array
,
-- Additional Computations Involving Double Nesting and String Manipulation
ROUND((rand() * 1000 + POWER(rand() * 100, 2)) / SQRT(ABS(LOG(CAST(rand() * 100 + 1 AS FLOAT64)))), 8) AS extra_computation_1,
CONCAT(
concat(CAST(id AS string) ,
LPAD(CAST(id AS string), 5, '0')),
CAST(FLOOR(EXP(rand() * 10)) AS string)
) AS extra_computation_2
FROM numbers;
Snowflake
WITH numbers AS (
SELECT
ROW_NUMBER() OVER (ORDER BY SEQ4()) AS id
FROM
TABLE(GENERATOR(ROWCOUNT => 1000000)) -- Generate 1,000,000 rows
)
SELECT
id,
-- Concatenation of id with padded id and random number
CONCAT(id, LPAD(id::STRING, 10, '0'), ABS(CAST(RANDOM() * 1000000 AS BIGINT))) AS user_id,
-- Random float with adjusted LOG usage, using LN instead of LOG when only one argument is needed
(POWER(RANDOM() * 1000, 2) + LN(GREATEST(RANDOM() * 100 + 1, 0.0001)) * SQRT(ABS(RANDOM(1000) ))) AS rand_float1,
-- -- Random square root and exponential
ROUND(SQRT(ABS(POWER(RANDOM() * 10000, 3))) / (RANDOM() + 1) + EXP(RANDOM()), 8) AS rand_float2,
-- -- Trigonometric and logarithmic calculations, with corrected log function
SIN(RANDOM() * 2 * 3.14) + COS(LN(GREATEST(ABS(RANDOM() * 100 + 1), 0.0001))) * EXP(SIN(RANDOM() * 3.14 / 2)) AS trig_log_exp_value,
-- Complex mathematical operations with appropriate functions
ROUND(POWER(SQRT(ABS(RANDOM() * 10000)), 3) / LN(GREATEST(POWER(RANDOM() + 1, 2) + 1, 0.0001)), 8) AS complex_math1,
ROUND(EXP(POWER(LN(GREATEST(RANDOM() * 1000 + 1, 0.0001)), 2)) * COS(SQRT(ABS(RANDOM() * 500))), 8) AS complex_math2,
-- -- Conditional string concatenation based on id
CASE
WHEN MOD(id, 2) = 0 THEN CONCAT(id, FLOOR(RANDOM() * 10 + SIN(RANDOM() * 2)), LPAD(id::STRING, 5, '0'))
ELSE CONCAT(id, FLOOR(RANDOM() * 10 + COS(RANDOM() * 3)), LPAD(id::STRING, 5, '0'))
END AS segment,
-- Simple random category generation
CONCAT('category_', FLOOR(RANDOM() * 8), LPAD(FLOOR(RANDOM() * 1000)::STRING, 5, '0')) AS category,
-- Complex conditionals and mathematical operations
CASE
WHEN GREATEST(POWER(RANDOM() * 100, 2), RANDOM() * 200) > 150 THEN 'high'
WHEN LEAST(SQRT(ABS(RANDOM() * 100)), RANDOM() * 200) < 50 THEN 'low'
ELSE CONCAT('medium_', ABS(CAST(RANDOM() * 100 AS BIGINT)))
END AS value_range,
-- Nested trigonometric calculations
ROUND(SIN(SIN(RANDOM() * 2 * 3.14)) * TAN(COS(RANDOM() * 3.14 / 2)) + EXP(SIN(RANDOM())), 8) AS nested_trig_value,
-- Random array as a string
CONCAT(FLOOR(RANDOM() * 10), FLOOR(RANDOM() * 10), FLOOR(RANDOM() * 10)) AS rand_array,
-- Additional mathematical computations
ROUND((RANDOM() * 1000 + POWER(RANDOM() * 100, 2)) / SQRT(ABS(LN(GREATEST(RANDOM() * 100 + 1, 0.0001)))), 8) AS extra_computation_1,
-- String concatenation with random numbers
CONCAT(id, LPAD(id::STRING, 5, '0'), FLOOR(EXP(RANDOM() * 10))) AS extra_computation_2
FROM numbers;
Redshift
WITH numbers AS (
SELECT
ROW_NUMBER() OVER () AS id
FROM
(SELECT 1 FROM stv_blocklist LIMIT 1000) a
CROSS JOIN
(SELECT 1 FROM stv_blocklist LIMIT 1000) b
)
SELECT
CAST(id AS VARCHAR) AS id
,
-- Complex String concatenation with explicit casting
CONCAT(
concat(CAST(id AS VARCHAR),
cast(LPAD(CAST(id AS VARCHAR), 10, '0') as varchar)),
-- CAST(id AS VARCHAR),
CAST(ABS(CAST(RANDOM() * 1000000 AS BIGINT)) AS VARCHAR)
) AS user_id
,
-- Double Nested Mathematical Operations with Randomization
ROUND(POWER(RANDOM() * 1000, 2) + LOG(CAST(RANDOM() * 100 + 1 AS FLOAT)) * SQRT(ABS(RANDOM() * 1000)), 8) AS random_float1,
ROUND(SQRT(POWER(RANDOM() * 1000, 3) / (RANDOM() * 10 + 1)) + EXP(RANDOM()), 8) AS random_float2,
-- Trigonometric Calculations Combined with Logarithmic and Exponential Functions
SIN(RANDOM() * 2 * PI()) + COS(LOG(CAST(ABS(RANDOM() * 100 + 1) AS FLOAT))) * EXP(SIN(RANDOM() * PI() / 2)) AS trig_log_exp_value,
-- Nested Complex Math: Pow, Sqrt, and Logarithmic Operations Combined
ROUND(POWER(SQRT(ABS(RANDOM() * 10000)), 3) / LOG(POWER(CAST(RANDOM() + 1 AS FLOAT), 2) + 1), 8) AS complex_math1,
ROUND(EXP(POWER(LOG(CAST(RANDOM() * 1000 + 1 AS FLOAT)), 2)) * COS(SQRT(ABS(RANDOM() * 500))), 8) AS complex_math2
,
-- Complex String Manipulations and Conditions
CASE
WHEN MOD(id, 2) = 0
THEN
CAST(id AS VARCHAR) ||
CAST(FLOOR(RANDOM() * 10 + SIN(RANDOM() * 2)) AS VARCHAR) ||
LPAD(CAST(id AS VARCHAR), 5, '0')
ELSE
CAST(id AS VARCHAR) ||
CAST(FLOOR(RANDOM() * 10 + COS(RANDOM() * 3)) AS VARCHAR) ||
LPAD(CAST(id AS VARCHAR), 5, '0')
END AS segment
,
-- -- -- More complex string operations (explicit casting)
CONCAT(
SUBSTRING('category_', 1, CAST(FLOOR(RANDOM() * 8) AS INT)),
LPAD(CAST(FLOOR(RANDOM() * 1000) AS VARCHAR), 5, '0')
) AS category,
-- Comparisons with Multiple Nested Conditions and Mathematical Operations
CASE
WHEN GREATEST(POWER(RANDOM() * 100, 2), RANDOM() * 200) > 150
THEN 'high'
WHEN LEAST(SQRT(RANDOM() * 100), RANDOM() * 200) < 50
THEN 'low'
ELSE CONCAT(
'medium_',
CAST(ABS(CAST(RANDOM() * 100 AS BIGINT)) AS VARCHAR)
)
END AS value_range
,
-- Double Trigonometric Calculations
ROUND(SIN(SIN(RANDOM() * 2 * PI())) * TAN(COS(RANDOM() * PI() / 2)) + EXP(SIN(RANDOM())), 8) AS nested_trig_value
,
-- Random String Concatenation with More Randomized Components
CONCAT(
concat(CAST(FLOOR(RANDOM() * 10) AS VARCHAR),
CAST(FLOOR(RANDOM() * 10) AS VARCHAR)),
CAST(FLOOR(RANDOM() * 10) AS VARCHAR)
) AS random_array
,
-- Additional Computations Involving Double Nesting and String Manipulation
ROUND((RANDOM() * 1000 + POWER(RANDOM() * 100, 2)) / SQRT(ABS(LOG(CAST(RANDOM() * 100 + 1 AS FLOAT)))), 8) AS extra_computation_1,
CONCAT(
concat(CAST(id AS VARCHAR) ,
LPAD(CAST(id AS VARCHAR), 5, '0')),
CAST(FLOOR(EXP(RANDOM() * 10)) AS VARCHAR)
) AS extra_computation_2
FROM numbers;
Clickhouse
WITH numbers AS (
SELECT
ROW_NUMBER() OVER (ORDER BY number) AS id
FROM
numbers(1000000) -- Generate 1,000,000 rows
)
SELECT
id,
-- Concatenation of id with padded id and random number
CONCAT(id, LPAD(toString(id), 10, '0'), toString(ABS(CAST(rand() % 1000000 AS Int64)))) AS user_id,
-- Random float with adjusted LOG usage
(pow(rand() % 1000, 2) + log(GREATEST(rand() % 100 + 1, 0.0001)) * sqrt(ABS(rand() % 1000))) AS rand_float1,
-- Random square root and exponential
ROUND(sqrt(ABS(pow(rand() % 10000, 3))) / (rand() + 1) + exp(rand()), 8) AS rand_float2,
-- Trigonometric and logarithmic calculations
sin(rand() % 1000 * 2 * 3.14) + cos(log(GREATEST(ABS(rand() % 100 + 1), 0.0001))) * exp(sin(rand() % 1000 * 3.14 / 2)) AS trig_log_exp_value,
-- Complex mathematical operations
ROUND(pow(sqrt(ABS(rand() % 10000)), 3) / log(GREATEST(pow(rand() % 1000 + 1, 2) + 1, 0.0001)), 8) AS complex_math1,
ROUND(exp(pow(log(GREATEST(rand() % 1000 + 1, 0.0001)), 2)) * cos(sqrt(ABS(rand() % 500))), 8) AS complex_math2,
-- Conditional string concatenation based on id
CASE
WHEN id % 2 = 0 THEN CONCAT(toString(id), toString(FLOOR(rand() % 10 + sin(rand() % 1000 * 2))), LPAD(toString(id), 5, '0'))
ELSE CONCAT(toString(id), toString(FLOOR(rand() % 10 + cos(rand() % 1000 * 3))), LPAD(toString(id), 5, '0'))
END AS segment,
-- Simple random category generation
CONCAT('category_', toString(FLOOR(rand() % 8)), LPAD(toString(FLOOR(rand() % 1000)), 5, '0')) AS category,
-- Complex conditionals and mathematical operations
CASE
WHEN GREATEST(pow(rand() % 100, 2), rand() % 200) > 150 THEN 'high'
WHEN LEAST(sqrt(ABS(rand() % 100)), rand() % 200) < 50 THEN 'low'
ELSE CONCAT('medium_', toString(ABS(CAST(rand() % 100 AS Int64))))
END AS value_range,
-- Nested trigonometric calculations
ROUND(sin(sin(rand() % 1000 * 2 * 3.14)) * tan(cos(rand() % 1000 * 3.14 / 2)) + exp(sin(rand())), 8) AS nested_trig_value,
-- Random array as a string
CONCAT(toString(rand() % 10), toString(rand() % 10), toString(rand() % 10)) AS rand_array,
-- Additional mathematical computations
ROUND((rand() % 1000 + pow(rand() % 100, 2)) / sqrt(ABS(log(GREATEST(rand() % 100 + 1, 0.0001)))), 8) AS extra_computation_1,
-- String concatenation with random numbers
CONCAT(toString(id), LPAD(toString(id), 5, '0'), toString(FLOOR(exp(rand() % 10)))) AS extra_computation_2
FROM numbers;