UUID in SQL: insert & reindex performance

This article is inspired by discussion about is it worth using UUID as primary key for Hibernate entity to overcome issue with equals() & hashCode() methods. Some say it is, some say - no. Reasons ?

First, because of larger size (16 bytes instead of 8 in bigint), UUID takes more disk space.

Second, due to random nature of UUID which is not created by incrementing previous value by 1 as bigint, adjacent rows in table may be located far from one another in index.

But despite this the major benefit of using UUID is that each UUID will be unique in distributed microservice system, hence if we base hashCode() & equals() on it, there won't be any collisions ever.

The trade-off is obvious: strong uniqueness vs performance penalty. So, let's clarify, how much should we pay for such benefits on the database side.

I'll use PostgreSQL for this experiment and scenario is pretty simple:

- create two similar tables, huge_table_with_uuid & huge_table_with_bigint with the only difference in primary key type. In first table it will be UUID, in 2nd - bigint.

- fill both tables with 60M rows (which approximately will take ~5-6 Gb of disk space).

- add 1M rows more and measure time of re-indexing.

The 3rd step is needed because every time when we add new rows to the table it should be re-indexed in order to keep maximum index efficiency, so it will show cost of adding rows to table during system work.

For filling tables with data I'll use custom function:

CREATE OR REPLACE FUNCTION gen_random_string(_min_length INT = 3)
RETURNS VARCHAR
LANGUAGE SQL
AS '
SELECT substring(
  md5(random()::TEXT),
  0,
  _min_length + floor(random() * 10 + 1)::INT
)
';        

Let's start.

I'll create extension for generating UUID's and create a table with UUID as primary key:

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

CREATE TABLE IF NOT exists huge_table_with_uuid (
    id UUID DEFAULT uuid_generate_v4() PRIMARY KEY,
    column1 VARCHAR(255) NOT NULL,
    column2 VARCHAR(255) NOT NULL,
    column3 TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);        

Now it's time to fill this table with data....

DO
$do$
BEGIN
  FOR index IN 1..60000000 LOOP
    INSERT INTO huge_table_with_uuid (column1,column2)
    SELECT gen_random_string(),
    gen_random_string();
  END LOOP;
END
$do$;        

.... and get out for a workout :-), because process will take several hours.

Yeah, it took 3 hours and 6.1 Gb of disk in order to generate 60 M of rows.

Let's make the 2nd table and fill it with data.

CREATE TABLE IF NOT exists huge_table_with_bigint (
    id bigint GENERATED ALWAYS AS IDENTITY
             PRIMARY KEY,
    column1 VARCHAR(255) NOT NULL,
    column2 VARCHAR(255) NOT NULL,
    column3 TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

DO
$do$
BEGIN
  FOR index IN 1..60000000 LOOP
    INSERT INTO huge_table_with_bigint (column1,column2)
    SELECT gen_random_string(),
    gen_random_string();
  END LOOP;
END
$do$;        

This time process is much faster and took 15 minutes and 4.9 Gb of disk space.

Let's switch back to huge_table_with_uuid, add 1M rows and re-index:

DO
	$do$
	BEGIN
	  FOR index IN 1..1000000 LOOP
	    INSERT INTO huge_table_with_uuid (column1,column2)
	    SELECT gen_random_string(),
	    gen_random_string();
	  END LOOP;
	END
	$do$

reindex index huge_table_with_uuid_pkey        

The same for huge_table_with_bigint:

DO
    $do$
    BEGIN
      FOR index IN 1..1000000 LOOP
        INSERT INTO huge_table_with_bigint (column1,column2)
        SELECT gen_random_string(),
        gen_random_string();
      END LOOP;
    END
    $do$

reindex index huge_table_with_bigint_pkey        

The same difference with INSERT, but time spent on re-index is the same in both tables.

To summarize, here are my findings:

huge_table_with_uuid has size 6.1 Gb while huge_table_with_bigint has 4.9 Gb, UUID took ~20% more for the table with 2 columns of useful payload. For larger tables this proportion will be lower.

INSERT with UUID is 12 times slower than with incrementally increasing bigint.

REINDEX has the same performance in both cases;

Conclusion

I think that in case if you do not expect frequent batch inserts, UUID performance penalty won't affect end-users, but will give strong JPA entity's uniqueness. Need to add, that to ensure that the benefit received exceeds the cost of disk space, table should have significant amount of columns in order to balance the cost of UUID. Otherwise the game is not worth the candle.

Taking into account all the above conditions, UUID is worth to be used as primary key in some certain situations.

ASHVIN KASHYAP

?? Helping Luxury Travel Brands, Retreat Hosts & Villa Planners Attract Premium Clients | Funnels ? CRM ? Automation | DM “LUXE” for a Strategy Guide

7 个月

Useful tips

Marjan Sterjev

IT Engineer | CISSP | CCSP | CEH (Master): research | learn | do | MENTOR

7 个月

I'm not sure if we need cross domain uniqueness or not. Even if you need it in the service layer, you can always wrap the entity into an object with additional "address" properties. But, that is another story. What you are using in the benchmark is UUID V4 which is known to be more or less random and not B-Tree index friendly. Your results are normal and expected. In order to overcome the insert slowness you can try with Time Based UUIDs with versions 1, 6 and 7. The results obtained with those UUID types would be more relevant.

Nikola Popovski

Java Tech Lead | Practice Manager @ IWConnect

7 个月

Dzmitry Kashlach nice post, I agree with everything written here

要查看或添加评论,请登录

Dzmitry Kashlach的更多文章

  • Nuclear thear pool

    Nuclear thear pool

    Java developers know, that beneath classes from *.concurrent.

  • StampedLock(): be optimistic

    StampedLock(): be optimistic

    I've completed short performance test with StampedLock() where I wanted to compare performance influence of…

社区洞察

其他会员也浏览了