Decoding the Cost and Performance of Modern Data Warehouses: BigQuery, Snowflake, Redshift, and ClickHouse


Introduction

In today's data-driven world, the selection of a data warehouse that balances cost and performance is critical for businesses striving to optimize their data pipelines. This document delves into a standardized approach to compare four prominent data warehouses—BigQuery, Snowflake, Redshift, and ClickHouse—focusing on their computational costs and performance for complex SQL queries. Storage costs, while relevant, are generally minimal, making computation the key cost driver. By running a complex SQL query with intensive mathematical functions, we benchmark the performance and cost across these platforms.

Note: this document is not testing caching performance or complex joins as other benchmark doing.


The Players: Claims and Key Features

1. BigQuery:

BigQuery, offered by Google Cloud, is renowned for its serverless, highly scalable, and cost-efficient architecture. It operates on a pay-as-you-go model, charging for the compute resources used during query execution. BigQuery’s reputation lies in its simplicity—users do not need to manage infrastructure or scale systems manually. The pricing is transparent and linked directly to query runtime, making it attractive for businesses looking for low operational overhead.

2. Snowflake:

Snowflake provides a cloud-native, fully managed SaaS data warehouse, known for its ability to separate storage and compute. This allows users to scale both independently based on their needs. Snowflake excels in ease of use, multi-cloud support, and concurrency, boasting near-infinite scalability. However, these advantages come with a higher price point, especially in computational-heavy workloads. Snowflake uses credits as the unit of consumption, with the smallest computational tier (XS) used in this comparison.

3. Redshift:

Amazon Redshift offers two deployment models: Redshift serverless and Redshift clusters. Redshift provides high-performance querying capabilities, tightly integrated with the AWS ecosystem. However, Redshift can become complex to manage and scale, especially when dealing with varied workloads. Its pricing structure is based on node types for clusters and RPUs (Redshift Processing Units) for serverless. Redshift clusters require careful optimization to get the best performance for the lowest cost.

4. ClickHouse:

ClickHouse is an open-source columnar database designed for real-time analytics. ClickHouse’s performance is often praised for its high throughput on OLAP (Online Analytical Processing) queries. Unlike the managed services like Snowflake and BigQuery, ClickHouse requires users to manage their own infrastructure, typically making it more cost-efficient in raw compute terms. In this comparison, ClickHouse was run on AWS's m6i.xlarge instance, closely mirroring Redshift's hardware configuration.


Benchmarking SQL Query:

The SQL query we used to benchmark computation across all platforms includes complex mathematical operations like sin(), power(), and log(), along with string manipulations, designed to simulate a real-world, computation-heavy workload. The following is the Pseudo SQL:

-- Generate a series of numbers to simulate a dataset
WITH numbers AS (
  SELECT
    ROW_NUMBER() OVER () AS id -- Generate a unique ID for each row
  FROM
    UNNEST(GENERATE_ARRAY(1, 1000000)) 
)

-- Perform calculations on the generated dataset
SELECT
    CAST(id AS string) AS id, -- Convert the ID to a string
    CONCAT('user_', LPAD(CAST(id AS string), 6, '0')) AS user_id, -- Create a user ID by padding the ID
    ROUND(POWER(RANDOM() * 1000, 2), 2) AS random_square, -- Square a random number between 0 and 1000
    ROUND(LOG(RANDOM() * 100 + 1), 2) AS random_log, -- Take the log of a random number between 1 and 100
    ROUND(SIN(RANDOM() * 2 * 3.14), 2) AS sin_value, -- Calculate the sine of a random angle
    CASE 
        WHEN MOD(id, 2) = 0 THEN 'even' -- Label even IDs as 'even'
        ELSE 'odd' -- Label odd IDs as 'odd'
    END AS number_type -- Classify each row as 'even' or 'odd'
FROM
    numbers; -- From the generated dataset
          

Note all compatible sql are in the Appendix at the end of the article.


Performance and Cost Breakdown

Evaluation:

cost: cost of executing the query. The lower cost, the better solution.

time: time of executing the query. The lower time, the better performing solution.

cost-time: cost*time as one parameter for both cost and time. The lower cost-time, the better solution in terms of cost and time together.


1. BigQuery:

- Running Time: 20 seconds

- Slot Time Consumed: 40 seconds

- Cost Calculation:

- Pay-as-you-go cost at $0.04 per slot hour.

- Cost = 40 seconds 0.04 / (60 60) = $0.0004

- Cost-time = 20*0.0004 = 0.008 ($s)


2. Snowflake:

- Running Time: 20 seconds

- Instance Size: XS (smallest tier)

- Cost Calculation:

- XS instance costs 1 credit per hour, approximately $3/hour.

- Cost = 20 seconds * $3 / (60 * 60) = $0.0167.

- Cost-time = 20 * 0.0167 = 0.334 ($s)

Note: changing instance side doesn't affect the running time, accordingly cost is assessed on the lowest size.


3. Redshift:

Redshift Serverless:

- Running Time: 5 minutes (300 seconds)

- Cost Calculation:

- Cost is based on RPUs, where the query consumed 8 RPUs.

- Approximate cost = $3.00.

- Cost-time = 900 ($s)


Redshift Cluster (dc2.large):

- Running Time: 5 minutes

- Cost Calculation:

- Node cost = $0.33 per hour.

- Utilization = 20%.

- Cost = 0.2 * 300 seconds * $0.3 / (60 * 60) = $0.005.

- Cost-time = 300 * 0.005 = 1.5 $s

Note: changing instance side doesn't affect the running time, accordingly cost is assessed on the lowest size.


4. ClickHouse:

- Instance Type: AWS m6i.xlarge ($0.2354/hour)

- Running Time: 20 seconds

- Cost Calculation:

- CPU Utilization: < 5%.

- Cost = 0.05 * 20 seconds * $0.2354 / (60 60) = $0.00007

- Cost-time = 20 * 0.00007 = 0.0014 $s


Performance and Cost Comparison

Note we ignore here the setup and maintaince cost. For example, clickhouse requires efforts as it is opensource.

Cost:

  1. ClickHouse: $0.00007
  2. BigQuery: $0.0004
  3. Redshift Cluster (dc2.large): $0.005
  4. Snowflake: $0.0167
  5. Redshift Serverless: $3


Time:

  1. Clickhouse, BigQuery, Snowflake: 20 s
  2. Redshift Cluster, Redshift Serverless: 3 min


Cost-time:

  1. ClickHouse: 0.0014 $s
  2. BigQuery: 0.008 $s
  3. Snowflake: 0.334 $s
  4. Redshift Cluster (dc2.large): 1.5 $s
  5. Redshift Serverless: 900 $s

Conclusion

The cost-performance analysis shows that ClickHouse is the clear winner when both cost and performance are considered, with extremely low CPU utilization and cost. BigQuery offers a good balance of cost and performance, but it is slightly more expensive than ClickHouse. Snowflake, while fast, comes with a higher price tag due to its credit-based pricing model. Redshift (Cluster) provides decent pricing at lower utilization but struggles with higher query runtimes. Redshift Serverless, on the other hand, is significantly more expensive and slower.

While these benchmarks focus purely on the cost of computation, it’s important to note that operational costs (like ease of use, automation, and administrative overhead) are not included in this comparison. Snowflake, being a fully managed SaaS, might justify its higher costs with lower operational overhead, whereas ClickHouse requires more manual intervention but shines in raw performance efficiency.


Appendix

SQL for each datawarehouse

Bigquery


WITH numbers AS (
SELECT
  ROW_NUMBER() OVER () AS id
FROM
  UNNEST(GENERATE_ARRAY(1, 1000)) a
CROSS JOIN
  UNNEST(GENERATE_ARRAY(1, 1000)) b
)
SELECT 
    CAST(id AS string) AS id
    
    ,

    -- Complex String concatenation with explicit casting
    CONCAT(
        concat(CAST(id AS string), 
        cast(LPAD(CAST(id AS string), 10, '0') as string)), 
        -- CAST(id AS string), 
        CAST(ABS(CAST(rand() * 1000000 AS BIGINT)) AS string)
    ) AS user_id
    ,

    -- Double Nested Mathematical Operations with randization
    ROUND(POWER(rand() * 1000, 2) + LOG(CAST(rand() * 100 + 1 AS FLOAT64)) * SQRT(ABS(rand() * 1000)), 8) AS rand_float1,
    ROUND(SQRT(POWER(rand() * 1000, 3) / (rand() * 10 + 1)) + EXP(rand()), 8) AS rand_float2,

    -- Trigonometric Calculations Combined with Logarithmic and Exponential Functions
    SIN(rand() * 2 * 3.4) + COS(LOG(CAST(ABS(rand() * 100 + 1) AS FLOAT64))) * EXP(SIN(rand() * 3.14 / 2)) AS trig_log_exp_value,

    -- Nested Complex Math: Pow, Sqrt, and Logarithmic Operations Combined
    ROUND(POWER(SQRT(ABS(rand() * 10000)), 3) / LOG(POWER(CAST(rand() + 1 AS FLOAT64), 2) + 1), 8) AS complex_math1,
    ROUND(EXP(POWER(LOG(CAST(rand() * 1000 + 1 AS FLOAT64)), 2)) * COS(SQRT(ABS(rand() * 500))), 8) AS complex_math2
    
    ,

    -- Complex String Manipulations and Conditions
CASE 
  WHEN MOD(id, 2) = 0 
  THEN 
      CAST(id AS string) || 
      CAST(FLOOR(rand() * 10 + SIN(rand() * 2)) AS string) || 
      LPAD(CAST(id AS string), 5, '0')
  ELSE 
      CAST(id AS string) || 
      CAST(FLOOR(rand() * 10 + COS(rand() * 3)) AS string) || 
      LPAD(CAST(id AS string), 5, '0')
END AS segment
,

--     -- -- More complex string operations (explicit casting)
    CONCAT(
        SUBSTRING('category_', 1, CAST(FLOOR(rand() * 8) AS INT)), 
        LPAD(CAST(FLOOR(rand() * 1000) AS string), 5, '0')
    ) AS category,

    -- Comparisons with Multiple Nested Conditions and Mathematical Operations
    CASE
      WHEN GREATEST(POWER(rand() * 100, 2), rand() * 200) > 150 
      THEN 'high'
      WHEN LEAST(SQRT(rand() * 100), rand() * 200) < 50 
      THEN 'low'
      ELSE CONCAT(
          'medium_', 
          CAST(ABS(CAST(rand() * 100 AS BIGINT)) AS string)
      )
    END AS value_range
    ,

    -- Double Trigonometric Calculations
    ROUND(SIN(SIN(rand() * 2 * 3.14)) * TAN(COS(rand() * 3.14 / 2)) + EXP(SIN(rand())), 8) AS nested_trig_value
    
    ,

    -- rand String Concatenation with More randized Components
    CONCAT(
        concat(CAST(FLOOR(rand() * 10) AS string),
        CAST(FLOOR(rand() * 10) AS string)),

        CAST(FLOOR(rand() * 10) AS string)
    ) AS rand_array
    
    ,


    -- Additional Computations Involving Double Nesting and String Manipulation
    ROUND((rand() * 1000 + POWER(rand() * 100, 2)) / SQRT(ABS(LOG(CAST(rand() * 100 + 1 AS FLOAT64)))), 8) AS extra_computation_1,
    CONCAT(
        concat(CAST(id AS string) , 
        LPAD(CAST(id AS string), 5, '0')), 
        CAST(FLOOR(EXP(rand() * 10)) AS string)
    ) AS extra_computation_2
FROM numbers;        


Snowflake

WITH numbers AS (
  SELECT 
        ROW_NUMBER() OVER (ORDER BY SEQ4()) AS id
    FROM 
        TABLE(GENERATOR(ROWCOUNT => 1000000))  -- Generate 1,000,000 rows
)
SELECT
  id,
  
  -- Concatenation of id with padded id and random number
  CONCAT(id, LPAD(id::STRING, 10, '0'), ABS(CAST(RANDOM() * 1000000 AS BIGINT))) AS user_id,

  -- Random float with adjusted LOG usage, using LN instead of LOG when only one argument is needed
  (POWER(RANDOM() * 1000, 2) + LN(GREATEST(RANDOM() * 100 + 1, 0.0001)) * SQRT(ABS(RANDOM(1000) ))) AS rand_float1,

  -- -- Random square root and exponential
ROUND(SQRT(ABS(POWER(RANDOM() * 10000, 3))) / (RANDOM() + 1) + EXP(RANDOM()), 8) AS rand_float2,
  -- -- Trigonometric and logarithmic calculations, with corrected log function
  SIN(RANDOM() * 2 * 3.14) + COS(LN(GREATEST(ABS(RANDOM() * 100 + 1), 0.0001))) * EXP(SIN(RANDOM() * 3.14 / 2)) AS trig_log_exp_value,

  -- Complex mathematical operations with appropriate functions
  ROUND(POWER(SQRT(ABS(RANDOM() * 10000)), 3) / LN(GREATEST(POWER(RANDOM() + 1, 2) + 1, 0.0001)), 8) AS complex_math1,
  ROUND(EXP(POWER(LN(GREATEST(RANDOM() * 1000 + 1, 0.0001)), 2)) * COS(SQRT(ABS(RANDOM() * 500))), 8) AS complex_math2,

  -- -- Conditional string concatenation based on id
  CASE
    WHEN MOD(id, 2) = 0 THEN CONCAT(id, FLOOR(RANDOM() * 10 + SIN(RANDOM() * 2)), LPAD(id::STRING, 5, '0'))
    ELSE CONCAT(id, FLOOR(RANDOM() * 10 + COS(RANDOM() * 3)), LPAD(id::STRING, 5, '0'))
  END AS segment,

  -- Simple random category generation
  CONCAT('category_', FLOOR(RANDOM() * 8), LPAD(FLOOR(RANDOM() * 1000)::STRING, 5, '0')) AS category,

  -- Complex conditionals and mathematical operations
  CASE
    WHEN GREATEST(POWER(RANDOM() * 100, 2), RANDOM() * 200) > 150 THEN 'high'
    WHEN LEAST(SQRT(ABS(RANDOM() * 100)), RANDOM() * 200) < 50 THEN 'low'
    ELSE CONCAT('medium_', ABS(CAST(RANDOM() * 100 AS BIGINT)))
  END AS value_range,

  -- Nested trigonometric calculations
  ROUND(SIN(SIN(RANDOM() * 2 * 3.14)) * TAN(COS(RANDOM() * 3.14 / 2)) + EXP(SIN(RANDOM())), 8) AS nested_trig_value,

  -- Random array as a string
  CONCAT(FLOOR(RANDOM() * 10), FLOOR(RANDOM() * 10), FLOOR(RANDOM() * 10)) AS rand_array,

  -- Additional mathematical computations
  ROUND((RANDOM() * 1000 + POWER(RANDOM() * 100, 2)) / SQRT(ABS(LN(GREATEST(RANDOM() * 100 + 1, 0.0001)))), 8) AS extra_computation_1,

  -- String concatenation with random numbers
  CONCAT(id, LPAD(id::STRING, 5, '0'), FLOOR(EXP(RANDOM() * 10))) AS extra_computation_2
FROM numbers;        

Redshift

WITH numbers AS (
  SELECT
    ROW_NUMBER() OVER () AS id
  FROM
    (SELECT 1 FROM stv_blocklist LIMIT 1000) a
  CROSS JOIN
    (SELECT 1 FROM stv_blocklist LIMIT 1000) b
)
SELECT 
    CAST(id AS VARCHAR) AS id
    
    ,


    -- Complex String concatenation with explicit casting
    CONCAT(
        concat(CAST(id AS VARCHAR), 
        cast(LPAD(CAST(id AS VARCHAR), 10, '0') as varchar)), 
        -- CAST(id AS VARCHAR), 
        CAST(ABS(CAST(RANDOM() * 1000000 AS BIGINT)) AS VARCHAR)
    ) AS user_id
    ,


    -- Double Nested Mathematical Operations with Randomization
    ROUND(POWER(RANDOM() * 1000, 2) + LOG(CAST(RANDOM() * 100 + 1 AS FLOAT)) * SQRT(ABS(RANDOM() * 1000)), 8) AS random_float1,
    ROUND(SQRT(POWER(RANDOM() * 1000, 3) / (RANDOM() * 10 + 1)) + EXP(RANDOM()), 8) AS random_float2,


    -- Trigonometric Calculations Combined with Logarithmic and Exponential Functions
    SIN(RANDOM() * 2 * PI()) + COS(LOG(CAST(ABS(RANDOM() * 100 + 1) AS FLOAT))) * EXP(SIN(RANDOM() * PI() / 2)) AS trig_log_exp_value,


    -- Nested Complex Math: Pow, Sqrt, and Logarithmic Operations Combined
    ROUND(POWER(SQRT(ABS(RANDOM() * 10000)), 3) / LOG(POWER(CAST(RANDOM() + 1 AS FLOAT), 2) + 1), 8) AS complex_math1,
    ROUND(EXP(POWER(LOG(CAST(RANDOM() * 1000 + 1 AS FLOAT)), 2)) * COS(SQRT(ABS(RANDOM() * 500))), 8) AS complex_math2
    
    ,


    -- Complex String Manipulations and Conditions
CASE 
  WHEN MOD(id, 2) = 0 
  THEN 
      CAST(id AS VARCHAR) || 
      CAST(FLOOR(RANDOM() * 10 + SIN(RANDOM() * 2)) AS VARCHAR) || 
      LPAD(CAST(id AS VARCHAR), 5, '0')
  ELSE 
      CAST(id AS VARCHAR) || 
      CAST(FLOOR(RANDOM() * 10 + COS(RANDOM() * 3)) AS VARCHAR) || 
      LPAD(CAST(id AS VARCHAR), 5, '0')
END AS segment
,


--     -- -- More complex string operations (explicit casting)
    CONCAT(
        SUBSTRING('category_', 1, CAST(FLOOR(RANDOM() * 8) AS INT)), 
        LPAD(CAST(FLOOR(RANDOM() * 1000) AS VARCHAR), 5, '0')
    ) AS category,


    -- Comparisons with Multiple Nested Conditions and Mathematical Operations
    CASE
      WHEN GREATEST(POWER(RANDOM() * 100, 2), RANDOM() * 200) > 150 
      THEN 'high'
      WHEN LEAST(SQRT(RANDOM() * 100), RANDOM() * 200) < 50 
      THEN 'low'
      ELSE CONCAT(
          'medium_', 
          CAST(ABS(CAST(RANDOM() * 100 AS BIGINT)) AS VARCHAR)
      )
    END AS value_range
    ,


    -- Double Trigonometric Calculations
    ROUND(SIN(SIN(RANDOM() * 2 * PI())) * TAN(COS(RANDOM() * PI() / 2)) + EXP(SIN(RANDOM())), 8) AS nested_trig_value
    
    ,


    -- Random String Concatenation with More Randomized Components
    CONCAT(
        concat(CAST(FLOOR(RANDOM() * 10) AS VARCHAR),
        CAST(FLOOR(RANDOM() * 10) AS VARCHAR)),
        CAST(FLOOR(RANDOM() * 10) AS VARCHAR)
    ) AS random_array
    ,
    -- Additional Computations Involving Double Nesting and String Manipulation
    ROUND((RANDOM() * 1000 + POWER(RANDOM() * 100, 2)) / SQRT(ABS(LOG(CAST(RANDOM() * 100 + 1 AS FLOAT)))), 8) AS extra_computation_1,
    CONCAT(
        concat(CAST(id AS VARCHAR) , 
        LPAD(CAST(id AS VARCHAR), 5, '0')), 
        CAST(FLOOR(EXP(RANDOM() * 10)) AS VARCHAR)
    ) AS extra_computation_2
FROM numbers;        

Clickhouse

WITH numbers AS (
  SELECT 
        ROW_NUMBER() OVER (ORDER BY number) AS id
    FROM 
        numbers(1000000)  -- Generate 1,000,000 rows
)
SELECT
  id,
  
  -- Concatenation of id with padded id and random number
  CONCAT(id, LPAD(toString(id), 10, '0'), toString(ABS(CAST(rand() % 1000000 AS Int64)))) AS user_id,

  -- Random float with adjusted LOG usage
  (pow(rand() % 1000, 2) + log(GREATEST(rand() % 100 + 1, 0.0001)) * sqrt(ABS(rand() % 1000))) AS rand_float1,

  -- Random square root and exponential
  ROUND(sqrt(ABS(pow(rand() % 10000, 3))) / (rand() + 1) + exp(rand()), 8) AS rand_float2,

  -- Trigonometric and logarithmic calculations
  sin(rand() % 1000 * 2 * 3.14) + cos(log(GREATEST(ABS(rand() % 100 + 1), 0.0001))) * exp(sin(rand() % 1000 * 3.14 / 2)) AS trig_log_exp_value,

  -- Complex mathematical operations
  ROUND(pow(sqrt(ABS(rand() % 10000)), 3) / log(GREATEST(pow(rand() % 1000 + 1, 2) + 1, 0.0001)), 8) AS complex_math1,
  ROUND(exp(pow(log(GREATEST(rand() % 1000 + 1, 0.0001)), 2)) * cos(sqrt(ABS(rand() % 500))), 8) AS complex_math2,

  -- Conditional string concatenation based on id
  CASE
    WHEN id % 2 = 0 THEN CONCAT(toString(id), toString(FLOOR(rand() % 10 + sin(rand() % 1000 * 2))), LPAD(toString(id), 5, '0'))
    ELSE CONCAT(toString(id), toString(FLOOR(rand() % 10 + cos(rand() % 1000 * 3))), LPAD(toString(id), 5, '0'))
  END AS segment,

  -- Simple random category generation
  CONCAT('category_', toString(FLOOR(rand() % 8)), LPAD(toString(FLOOR(rand() % 1000)), 5, '0')) AS category,

  -- Complex conditionals and mathematical operations
  CASE
    WHEN GREATEST(pow(rand() % 100, 2), rand() % 200) > 150 THEN 'high'
    WHEN LEAST(sqrt(ABS(rand() % 100)), rand() % 200) < 50 THEN 'low'
    ELSE CONCAT('medium_', toString(ABS(CAST(rand() % 100 AS Int64))))
  END AS value_range,

  -- Nested trigonometric calculations
  ROUND(sin(sin(rand() % 1000 * 2 * 3.14)) * tan(cos(rand() % 1000 * 3.14 / 2)) + exp(sin(rand())), 8) AS nested_trig_value,

  -- Random array as a string
  CONCAT(toString(rand() % 10), toString(rand() % 10), toString(rand() % 10)) AS rand_array,

  -- Additional mathematical computations
  ROUND((rand() % 1000 + pow(rand() % 100, 2)) / sqrt(ABS(log(GREATEST(rand() % 100 + 1, 0.0001)))), 8) AS extra_computation_1,

  -- String concatenation with random numbers
  CONCAT(toString(id), LPAD(toString(id), 5, '0'), toString(FLOOR(exp(rand() % 10)))) AS extra_computation_2
FROM numbers;
        

要查看或添加评论,请登录

社区洞察

其他会员也浏览了