ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Hybrid Columnar Compression In Autonomous Transaction Processing

Cristiano Ghirardi

EMEA Cloud Architect @ Oracle

å‘å¸ƒæ—¥æœŸ: 2023å¹´12æœˆ4æ—¥

Overview

In a nutshell, Hybrid Columnar Compression or HCC is a compression feature supplied by some Oracle smart storage systems and, most notably, by the Exadata Storage Servers. Since Exadata Smart Storage is the underlining storage infrastructure used by Oracle Autonomous Database (ADB), HCC is also, available on all ADB deployments. Both the compression algorithm and how is automatically used in Autonomous Data Warehouse (ADW) are nicely described here, instead, in this article I will focus on a specific use case on ATP and I will show some useful PL/SQL snippets.

Why This Article

HCC is most useful in Data Warehouse environments and for analytical and batch use cases. As a matter of fact, HCC is automatically enabled and used in ADW. That said, while HCC is not suitable or convenient to compress hot data (subject to DML operations), it can be very useful in managing cold historical data.

Many companies have strict requirements about the retention of certain data, for example an insurance company might need to keep the insurance policy documents for ten years after their expiration. Typically those data are never updated but need to be accessible for read-only activities, for example, custom queries or periodic reports. Furthermore, those data tend to accumulate over time consuming a lot of precious storage in the company OLTP databases. A tradition approach to this problem is to move the cold data on less expensive storage, for example, object storage or HDFS on Big Data platforms. That approach has some advantages but has downsides as well. it introduces complexity and scatters the data in multiple locations with possible security or performance implications.

In this article I will show:

how to assess the storage savings when using HCC
how to enable HCC and verify the compression ratio

Once I made sure that our cold table is a good candidate for HCC, I will demonstrate that using HCC in ATP we can keep all that data, together with the hot data, saving a lot of storage and, often, improving the performance of our queries too.

The code snippets provided here will show how to use the Oracle supplied PL/SQL packages and SQL statement, to assess, enable and test the HCC feature. To keep things eays, I will consider a simple scenario involving a single table, in a more realistic scenario, we often need to operate at partition level where we have cold partitions where data is compressed and never updated and hot ones where data are not compressed.

Cold partitions are compressed, hot ones are not

HCC Compression Assessment

As said, the example scenario is quite simple: I have a table which stores historical data and that I want to compress. However, before doing that, I want to see if there is a real advantage in terms of storage saving. The test table is the following:

desc my_test_table

Name      Null?    Type            
_________ ________ _______________ 
YEAR               NUMBER          
AGE                NUMBER          
ETHNIC             NUMBER          
SEX                VARCHAR2(10)    
AREA               NUMBER          
COUNT              NUMBER   DEFAULT COLLATION "USING_NLS_COMP" ;

select count(*) from my_test_table;

   COUNT(*) 
___________ 
   15710331

First, I want to know what is the current real storage usage and, to do that, I will use the DBMS_SPACE package.

DECLARE
   unf   NUMBER; 
   unfb  NUMBER; 
   fs1   NUMBER; 
   fs1b  NUMBER; 
   fs2   NUMBER; 
   fs2b  NUMBER; 
   fs3   NUMBER; 
   fs3b  NUMBER; 
   fs4   NUMBER; 
   fs4b  NUMBER; 
   full  NUMBER; 
   fullb NUMBER; 
BEGIN 
   dbms_space.space_usage('ADMIN',
                        'MY_TEST_TABLE', 
                        'TABLE', 
                        unf, unfb, 
                        fs1, fs1b, 
                        fs2, fs2b, 
                        fs3, fs3b, 
                        fs4, fs4b, 
                        full, fullb); 

   DBMS_OUTPUT.put_line( 'Number of unformatted blocks                                : ' || unf);
   DBMS_OUTPUT.put_line( 'Number of MB in unformatted blocks                          : ' || unfb/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of blocks with free space between 0% and 25%         : ' || fs1);
   DBMS_OUTPUT.put_line( 'Number of MB in blocks with free space between 0% and 25%   : ' || fs1b/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of blocks with free space between 25% and 50%        : ' || fs2);
   DBMS_OUTPUT.put_line( 'Number of MB in blocks with free space between 25% and 50%  : ' || fs2b/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of blocks with free space between 50% and 75%        : ' || fs3);
   DBMS_OUTPUT.put_line( 'Number of MB in blocks with free space between 50% and 75%  : ' || fs3b/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of blocks with free space between 75% and 100%       : ' || fs4);
   DBMS_OUTPUT.put_line( 'Number of MB in blocks with free space between 75% and 100% : ' || fs4b/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of full blocks                                       : ' || full);
   DBMS_OUTPUT.put_line( 'Number of MB in full blocks                                 : ' || fullb/1024/1024);  
END;
/

In this case I get the following output:

Number of unformatted blocks                                : 0
Number of MB in unformatted blocks                          : 0
Number of blocks with free space between 0% and 25%         : 0
Number of MB in blocks with free space between 0% and 25%   : 0
Number of blocks with free space between 25% and 50%        : 0
Number of MB in blocks with free space between 25% and 50%  : 0
Number of blocks with free space between 50% and 75%        : 0
Number of MB in blocks with free space between 50% and 75%  : 0
Number of blocks with free space between 75% and 100%       : 0
Number of MB in blocks with free space between 75% and 100% : 0
Number of full blocks                                       : 67006
Number of MB in full blocks                                 : 523.484375

As we can se the table use 523 MBytes or 67006 full DB blocks (each block has a size of 8 KBytes).

Now I want to estimate the compression ratio achievable using HCC compression and, more specifically, the HCC Query High compression type. The HCC Query High compression type is the most effective when we want both a good compression rate and good query performance. To do the evaluation I use the GET_COMPRESSION_RATIO procedure of the DBMS_COMPRESSION package. This procedure executes a HCC dry run compressing a subset of rows and computing the resulting compression ratio.

é¢†è‹±æŽ¨è

Embrace Scalability: Building Resilient and Efficient Data Systems

Embrace Scalability: Building Resilient and Efficientâ€¦

Jay Gimple 3 ä¸ªæœˆå‰

Building a Modern Data Platform as a Service (DPaaS) with Data Contracts and PaaS for Scalable Ingestion

Building a Modern Data Platform as a Service (DPaaS)â€¦

Birendra Kumar Sahu 3 ä¸ªæœˆå‰

Modern Storage Engine Magic

Bamiyan Gobets 2 å¹´å‰

DECLARE
  my_blkcnt_cmp   PLS_INTEGER;
  my_blkcnt_uncmp PLS_INTEGER;
  my_row_cmp      PLS_INTEGER;
  my_row_uncmp    PLS_INTEGER;
  my_cmp_ratio    NUMBER;
  my_comptype_str VARCHAR2(32767);
BEGIN
   DBMS_COMPRESSION.GET_COMPRESSION_RATIO (
   scratchtbsname => 'DATA' ,
   ownname	  => 'ADMIN' ,
   objname	  => 'MY_TEST_TABLE' ,
   subobjname     =>  NULL ,
   comptype       =>  DBMS_COMPRESSION.COMP_QUERY_HIGH,
   blkcnt_cmp     => my_blkcnt_cmp,
   blkcnt_uncmp   => my_blkcnt_uncmp,
   row_cmp	      => my_row_cmp,
   row_uncmp      => my_row_uncmp,
   cmp_ratio      => my_cmp_ratio,
   comptype_str   => my_comptype_str,
   subset_numrows => DBMS_COMPRESSION.comp_ratio_minrows,
   objtype	  => DBMS_COMPRESSION.objtype_table
  );
  
  DBMS_OUTPUT.put_line( 'Number of blocks used by the compressed sample of the object   :  ' || my_blkcnt_cmp);
  DBMS_OUTPUT.put_line( 'Number of blocks used by the uncompressed sample of the object :  ' || my_blkcnt_uncmp);
  DBMS_OUTPUT.put_line( 'Number of rows in a block in compressed sample of the object   :  ' || my_row_cmp);
  DBMS_OUTPUT.put_line( 'Number of rows in a block in uncompressed sample of the object :  ' || my_row_uncmp);
  DBMS_OUTPUT.put_line( 'Estimated Compression Ratio of Sample                          :  ' || my_cmp_ratio);
  DBMS_OUTPUT.put_line( 'Compression Type                                               :  ' || my_comptype_str);
END;
/

In this case I obtain:

Compression Advisor self-check validation successful. select count(*) on both Uncompressed and EHCC Compressed format = 1000001 rows
Number of blocks used by the compressed sample of the object   :  110
Number of blocks used by the uncompressed sample of the object :  4182
Number of rows in a block in compressed sample of the object   :  9091
Number of rows in a block in uncompressed sample of the object :  239
Estimated Compression Ratio of Sample                          :  38
Compression Type                                               :  "Compress Query High"

so the estimated compression ratio is 38x which is a very good result for this simple case.

HCC Compression Validation

To verify if the previous estimation was accurate I will enable the HCC compression in a table cloned from the original one.

CREATE TABLE my_comp_test_table AS SELECT * FROM my_test_table; 

Table MY_COMP_TEST_TABLE created.

Elapsed: 00:00:11.198

ALTER TABLE my_comp_test_table MOVE COLUMN STORE COMPRESS FOR QUERY HIGH;

Table MY_COMP_TEST_TABLE altered.

Elapsed: 00:00:16.646

The next step is about using DBMS_SPACE again to compute the space used by the compressed table:

DECLARE
   unf   NUMBER; 
   unfb  NUMBER; 
   fs1   NUMBER; 
   fs1b  NUMBER; 
   fs2   NUMBER; 
   fs2b  NUMBER; 
   fs3   NUMBER; 
   fs3b  NUMBER; 
   fs4   NUMBER; 
   fs4b  NUMBER; 
   full  NUMBER; 
   fullb NUMBER; 
BEGIN 
   dbms_space.space_usage('ADMIN',
                        'MY_COMP_TEST_TABLE', 
                        'TABLE', 
                        unf, unfb, 
                        fs1, fs1b, 
                        fs2, fs2b, 
                        fs3, fs3b, 
                        fs4, fs4b, 
                        full, fullb); 

   DBMS_OUTPUT.put_line( 'Number of unformatted blocks                                : ' || unf);
   DBMS_OUTPUT.put_line( 'Number of MB in unformatted blocks                          : ' || unfb/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of blocks with free space between 0% and 25%         : ' || fs1);
   DBMS_OUTPUT.put_line( 'Number of MB in blocks with free space between 0% and 25%   : ' || fs1b/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of blocks with free space between 25% and 50%        : ' || fs2);
   DBMS_OUTPUT.put_line( 'Number of MB in blocks with free space between 25% and 50%  : ' || fs2b/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of blocks with free space between 50% and 75%        : ' || fs3);
   DBMS_OUTPUT.put_line( 'Number of MB in blocks with free space between 50% and 75%  : ' || fs3b/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of blocks with free space between 75% and 100%       : ' || fs4);
   DBMS_OUTPUT.put_line( 'Number of MB in blocks with free space between 75% and 100% : ' || fs4b/1024/1024);
   DBMS_OUTPUT.put_line( 'Number of full blocks                                       : ' || full);
   DBMS_OUTPUT.put_line( 'Number of MB in full blocks                                 : ' || fullb/1024/1024);  
END;
/

and I get:

Number of unformatted blocks                                : 0
Number of MB in unformatted blocks                          : 0
Number of blocks with free space between 0% and 25%         : 0
Number of MB in blocks with free space between 0% and 25%   : 0
Number of blocks with free space between 25% and 50%        : 0
Number of MB in blocks with free space between 25% and 50%  : 0
Number of blocks with free space between 50% and 75%        : 0
Number of MB in blocks with free space between 50% and 75%  : 0
Number of blocks with free space between 75% and 100%       : 0
Number of MB in blocks with free space between 75% and 100% : 0
Number of full blocks                                       : 1796
Number of bytes in full blocks                              : 14.03125

Given the previous output, the actual compression ratio is 37.3, in fact:

523.48 MB / 14.03 MB = 37.3

Finally I can check, a a simple query, to see if there is any change in performance. On the non-compressed table we get:

select count(*),sex from my_test_table group by sex;

   COUNT(*) SEX           
___________ _____________ 
    7183855 NOT DECLARED    
    4253247 MALE          
    4273229 FEMALE      

Elapsed: 00:00:00.619

on the compressed one:

select count(*),sex from my_comp_test_table group by sex;

   COUNT(*) SEX           
___________ _____________ 
    7183855 NOT DECLARED  
    4253247 MALE          
    4273229 FEMALE        

Elapsed: 00:00:00.071

So, this specific query is fast on both tables but on the compressed one is almost 10 times faster. In this case, this performance improvement is due to the fact that we are not using any index and the query is executed with a FULL TABLE SCAN so when scanning the compressed table we need to retrieve a lot less database blocks.

Takeaways

In this short article I briefly explained how HCC can be beneficial also for some ATP workloads. The use case is about historical data and how we can keep them together with the live ones. I showed, also, a simple method to estimate the final compression ratio given an existing table. The important takeaways are:

HCC is available also in ATP but must be enabled and managed manually.
Even in OLTP workloads, there are use cases where HCC adoption might be extremely beneficial, however its adoption should be preceded by a careful evaluation.
HCC compression is mostly suitable for cold tables and partitions. Furthermore, to have optimal performance and all the expected benefits, a dataset, once compressed, should be only queried and no DML operations should be executed on it.

For more information about Hybrid Columnar Compression I suggest the following technical brief and the official documentation.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Cristiano Ghirardiçš„æ›´å¤šæ–‡ç«

Autonomous Database Serverless - TLS vs mTLS Connection.

2023å¹´11æœˆ13æ—¥

Autonomous Database Serverless - TLS vs mTLS Connection.

Why this article? Well, in my experience, when connecting to a database, many Oracle users and developers use a simpleâ€¦
Oracle Multitenant Database. Part 1

2023å¹´6æœˆ13æ—¥

Oracle Multitenant Database. Part 1

Introduction This article is the first part of a series of hands-on blog posts regarding the Oracle Databaseâ€¦
OCI File Storage Service Overview

2023å¹´4æœˆ13æ—¥

OCI File Storage Service Overview

Introduction In this article we briefly introduce the OCI File System Storage Service (in short FSS) and talk a bitâ€¦
A Homemade FaaS Platform

2023å¹´1æœˆ20æ—¥

A Homemade FaaS Platform

Overview FaaS, or Function-as-a-Service, is a cloud-computing service that allows users to execute code in response toâ€¦

2 æ¡è¯„è®º

Hybrid Columnar Compression In Autonomous Transaction Processing

Cristiano Ghirardi

EMEA Cloud Architect @ Oracle

Overview

Why This Article

HCC Compression Assessment

é¢†è‹±æŽ¨è

HCC Compression Validation

Takeaways

Cristiano Ghirardiçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Databricks Acquiring BladeBridge

API-driven data quality

Unlocking the Power of Catalog Ingestion: A Key to Data Security and Recovery

Manufacturing Data Infrastructure: MAGE, Redpanda, QuestDB ...

Why Hitachi Content Software for File is the Ideal AI/ML/HPC Data Platform

Data management in microservices

How to use data and the relevance of using it properly (and yes also bigdata, machine learning, system integration, industry40, opensource)

7-Tier Data Architecture

How Does etcd Work in Kubernetes?

Leveraging Local Storage with Persistent Volume Claims in Kubernetes

Overview

Why This Article

HCC Compression Assessment

é¢†è‹±æŽ¨è

HCC Compression Validation

Takeaways

Cristiano Ghirardiçš„æ›´å¤šæ–‡ç«

Autonomous Database Serverless - TLS vs mTLS Connection.

Oracle Multitenant Database. Part 1

OCI File Storage Service Overview

A Homemade FaaS Platform

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

Databricks Acquiring BladeBridge

API-driven data quality

Unlocking the Power of Catalog Ingestion: A Key to Data Security and Recovery

Manufacturing Data Infrastructure: MAGE, Redpanda, QuestDB ...

Why Hitachi Content Software for File is the Ideal AI/ML/HPC Data Platform

Data management in microservices

How to use data and the relevance of using it properly (and yes also bigdata, machine learning, system integration, industry40, opensource)

7-Tier Data Architecture

How Does etcd Work in Kubernetes?

Leveraging Local Storage with Persistent Volume Claims in Kubernetes

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†