登录查看更多内容

Amazon RedShift for MySQL DBA's

Constantin Alexander

Data Platform, AiOps, Data Governance

发布日期: 2018年8月13日

What is RedShift

Amazon RedShift is a data analytics database provided as a service by Amazon AWS that is specifically designed for analyzing data using standard SQL. Amazon RedShift was build on top of ParAccel MPP (massively parallel processing) database, RedShift also has roots in PostgreSQL database 8.0, with exception of the storage engine that is very different from PostgreSQL database. RedShift does however support all PostgreSQL clients and drivers. Amazon RedShift is designed to handle petabytes of data, while RedShift Spectrum is designed to handle exabytes of data.

Columnar Data Store

Amazon RedShift unlike any RDBMS does not store data in rows, it stores data in columns that are physical entities, just like in RDBMS the data is presented in table but physically columns are separate entities. Why columns? Simply to reduce io (input output) or reduce disk reads to retrieve the data. Since the user rarely selects all the data from the table (SELECT * FROM table_name) storing data in columns reduces disk access by accessing only the columns from the disk that are in the select statement, for example SELECT FIRST_NAME, REGION, ORDER_ID will only retrieve those columns from the disk. This columnar store also aids in compression since the data stored in each column is uniform - for example FIRST_NAME only stores first names. Column stores closely resembles an index on a single column in RDBMS.

Column store, while very useful for selecting data, is very inefficient for OLTP operations since inserting a single row in a table will have to physically record each item in its own column thus increasing the disk io since each item in the table requires individual io, unlike RDBMS that can record a single transaction in single physical object - table. That is also why adding indexes to RDBMS increases io, the more indexes on RDBMS table the more io it will generate.

One can incorrectly perceive that column store is nothing more than RDBMS with an index on every column, but the real difference is how columnar store maps the data to the storage. In RDBMS indexed data primary key is the rowid that is mapped from the data, in columnar store primary key is the data is mapped from the rowid, hence retrieving all data for a given object is very inefficient in columnar for example to retrieve a single row will require multiple io’s, vs single io in RDBMS to retrieve a single row.

Columnar store is theoretically a lot slower than RDBMS - it requires multiple io’s to record individual items and it requires multiple io’s to retrieve a single row. i.e. all items for a specific object, but columnar store despite theoretical disadvantages has a practical advantage - one rarely needs to select all items for a specific object, for example in rolodex one usually looks for either an address or phone for a person, that selective retrieval is the main advantage of column store, since its selective operations it can retrieve the data in far fewer io’s than RDBMS.

RedShift Nodes

AWS RedShift has only two categories of nodes:

Leader node one per entire cluster, stores metadata, coordinates parallel SQL processing, optimizer, converting SQL to C++ for data nodes to execute, pg_catalog
Compute node up to 120 nodes, stores actual data, backup and restore, data loading

Each RedShift cluster has only one Leader node and several compute nodes. The leader node is receiving and parsing queries and the compute nodes actually store data.

The entire SQL interaction either via psql> or drivers is performed though Leader node.

Compute nodes, zone maps,slices, columns and blocks

Slices

AWS RedShift stores and distributes data in slices, slices are virtual compute nodes. Data nodes virtually partitioned in slices depending on the size of the data node there can be 2, 16 or 32 slices, slices are not pinned to CPU’s. The data is stored per slice and slice is essentially a unit of data partitioning. The data is distributed (sharded) on slices, one can think of slices as data shards.

Zone Maps

Zone maps are in memory metadata min and max values for data blocks, simply put it is a map pointing to the location of the blocks. Adding a Sort Key greatly help zone maps to locate the blocks of data. Sort keys usually based on query pattern but typically are time stamps. Sort Keys and Zone maps are similar to a partition in RDBMS, for example if we have a table with Primary Key (usually a distribution key) and timestamp (Sort Key), this Sort Key will sort the data and in effect update the zone map. When looking for a specific data range Zone Map will point to the exact blocks where the data is located - similar to partitioning by date in MySQL - selecting data in the range will point to the exact partition where the data is located.

Creating Tables - Compression, encoding, distribution and sorting

Creating RedShift tables is simpler than RDBMS - Primary Keys and Foreign Keys are not enforced, no constraints, no triggers, no stored procedures and only two types of optimizations - distribution and sorting. In addition since the table data is stored in individual physical structures - columns, each column can be encoded differently from the rest. RedShift supports 12 compression encodings types depending on the data type specific encoding can be selected.

There are only three aspects to RedShift optimization

Encoding
Distribution
Sorting

Encoding

Since every column is a separate physical entity it makes sense that each column can be encoded differently. Why bother with different encoding? Different data types compress differently depending on encoding and hence the main approach in RedShift to reduce io by as much as possible selecting correct encoding makes absolute sense.

If the default encoding is selected RedShift is going to encode Sort Keys in RAW, numerical types and boolean also encoded in RAW. All other types are encoded using LZO encoding. RAW means no compression and it makes more sense for Sort Keys and dist keys since they update zone maps and numeric types.

For datetime Delta encoding is recommended. There a lot more encoding types and the selection depends on data type.

Here is the example create table with diff encoding types:

create table schemaname.encodingshipmode (

moderaw varchar(22) encode raw,

modebytedict varchar(22) encode bytedict,

modelzo varchar(22) encode lzo,

moderunlength varchar(22) encode runlength,

modetext255 varchar(22) encode text255,

modetext32k varchar(22) encode text32k);

Distribution

Distribution essentially distributes the data on nodes, large fact table are distributed on all nodes in the cluster. The data is actually distributed by slices - slice being the virtual compute node in RedShift. A node can have 2, 16, or 32 slices. Each physical node divided on a number of slices.

There are three ways to distribute the data on slices:

DISTKEY
DISTSTYLE EVEN
DISTSTYLE ALL

DISTKEY

With DISTKEY style distribution you will need to select a key that will be used as a criteria for distribution, the goal here to is to select a high cardinality column so the data is distributed to all the nodes.

Example of using distribution key:

create table userskey distkey(state) as select * from users;

DISTSTYLE EVEN

Diststyle even allows RedShift to distribute the data evenly on all nodes in round robin fashion. This is typically for when one can not select a distribution key or all columns in the table are of low cardinality (a rare situation)

DISTSTYLE ALL

This distribution method is very handy for small dimension tables, since when DISTSTYLE ALL is selected RedShift will copy that table to every slice in the cluster. Having a copy of the table on every node really helps when joining large fact table with small dimension table since all data is present locally on the nodes.

Sort Keys

Having covered Distribution, Sort Keys are the next concept. Sort Keys are very similar to data partitioning in MySQL and usually are datetime types. For example we have a large table with high cardinality ORDER table, we will use ORDER_ID as distribution key and ORDER_DATE as a Sort Key.

Having ORDER_DATE as a sort key updates zone maps on RedShift instructing zone maps on which nodes the data is located.

For example when you are selecting for all orders for January, and ORDER_DATE is the sort key, zone maps in RedShift will direct the query to the exact nodes where January data is located, thus greatly reducing io.

Zone Maps

Zone Maps are in memory metadata stores on data nodes pointing to the location of the data on the slices, using sort keys partitions the data by nodes and updates the zone maps with data location.

Prashant G

6 年

Very precise and informative information. Thanks

Gene Kalmens

Experienced Software Engineer

6 年

Redshift has some features that depending on your usage can make it unfit for production. 1. Conflicting jobs. The only isolation level Redshift supports is serializable. This means that if one job runs a query with a nested select statement and another job makes changes invalidating the result set of this nested select statement, the first job will fail. To avoid this behavior, jobs need to be appropriately scheduled. Or else you need to maintain a watermark column for each changing table, in order to select below the high watermark and insert above the high watermark. 2. Table maintenance. Massive inserts/deletes leave behind skewed stats and segmentation of table spaces. Without maintenance, performance deteriorates very quickly. We saw a simple select from a 3-row table take 8 hours. Tables need to be analyzed and vacuumed, which itself may take a very long time. The most disappointing behavior is that only one vacuum job can run at any given time in the entire cluster. These problems made us decide to stop using Redshift. Disclaimer. My involvement with Redshift stopped in June 2017.

4 次回应

查看更多评论

要查看或添加评论，请登录

Constantin Alexander的更多文章

Understanding Data Governance

2023年6月7日

Understanding Data Governance

Understanding Data Governance Data governance refers to the overall management framework that ensures the availability,…
Graph databases vs Vector databases

2023年6月5日

Graph databases vs Vector databases

Graph databases and Vector databases, on the surface they may appear very similar, alas on the close examination they…

2 条评论
Data Catalogs Insights and Collaboration

2023年6月3日

Data Catalogs Insights and Collaboration

In the digital age, organizations are generating massive amounts of data at an unprecedented rate. To harness the true…
Should Snowflake expand into Transactional Databases and Messaging ? I think it has no choice.

2021年3月9日

Should Snowflake expand into Transactional Databases and Messaging ? I think it has no choice.

Snowflake Data Warehouse did very well, in highly competitive and ever changing field of data analytics Snowflake from…
Doing impossible - migrating gcp CloudSQL live from 5.6 to 5.7

2021年3月6日

Doing impossible - migrating gcp CloudSQL live from 5.6 to 5.7

Google Cloud - Google Cloud Platform, very versatile cloud offering and comes loaded with a lot of very interesting…
Friends don't let friends use MongoDB!

2020年3月15日

Friends don't let friends use MongoDB!

"Friends don't let friends use Relational Databases" - truly amazing, freedom, free and liberation from SQL, very…

8 条评论
Coronavirus fear - sharing my approach

2020年3月14日

Coronavirus fear - sharing my approach

As panic and fear setting in - what is the best approach and plan to remain healthy, active and most importantly happy,…

6 条评论
Using Amazon performance insights to save $$$ on your RDS bill

2020年1月21日

Using Amazon performance insights to save $$$ on your RDS bill

Amazon AWS has introduced performance insights, up until last year we didn't really have a really good visibility into…
Remembering Alan Turing, June 23, 1912

2019年12月22日

Remembering Alan Turing, June 23, 1912

As we enter 2020 and slipping into Holidays, I would like to take some of your time and remember the greatest person of…
Dawn of HTAP databases will spell the end for ETL and Data Warehouses

2019年7月26日

Dawn of HTAP databases will spell the end for ETL and Data Warehouses

HTAP.What is HTAP ? The term HTAP simply means - Hybrid Transactional and Analytics Processing.

3 条评论

See all articles

Amazon RedShift for MySQL DBA's

Constantin Alexander

Data Platform, AiOps, Data Governance

What is RedShift

Columnar Data Store

RedShift Nodes

Compute nodes, zone maps,slices, columns and blocks

Slices

Zone Maps

Creating Tables - Compression, encoding, distribution and sorting

There are only three aspects to RedShift optimization

Encoding

Distribution

DISTKEY

DISTSTYLE EVEN

DISTSTYLE ALL

Sort Keys

Zone Maps

Constantin Alexander的更多文章

社区洞察

其他会员也浏览了

Comparison of SQL and NoSQL: How to Choose a Data Storage System

Choosing between SQL and NoSQL

Explore the world of Databases

From SQL to NewSQL: The Evolution of Next-Gen Database Management Systems

A Comprehensive Analysis - NoSQL vs RDBMS

What is NoSQL? Databases for a cloud-scale future

Using DuckDB to Cache Query Results and Reduce Load on Your Operational Database

SQL vs NoSQL Databases - Part 1

SQL, NoSQL, and NewSQL Explained: Unlocking the Power of Databases!

SQL vs NoSQL: how to choose a database for your application?

What is RedShift

Columnar Data Store

RedShift Nodes

Compute nodes, zone maps,slices, columns and blocks

Slices

Zone Maps

Creating Tables - Compression, encoding, distribution and sorting

There are only three aspects to RedShift optimization

Encoding

Distribution

DISTKEY

DISTSTYLE EVEN

DISTSTYLE ALL

Sort Keys

Zone Maps

Constantin Alexander的更多文章

Understanding Data Governance

Graph databases vs Vector databases

Data Catalogs Insights and Collaboration

Should Snowflake expand into Transactional Databases and Messaging ? I think it has no choice.

Doing impossible - migrating gcp CloudSQL live from 5.6 to 5.7

Friends don't let friends use MongoDB!

Coronavirus fear - sharing my approach

Using Amazon performance insights to save $$$ on your RDS bill

Remembering Alan Turing, June 23, 1912

Dawn of HTAP databases will spell the end for ETL and Data Warehouses

社区洞察

其他会员也浏览了

Comparison of SQL and NoSQL: How to Choose a Data Storage System

Choosing between SQL and NoSQL

Explore the world of Databases

From SQL to NewSQL: The Evolution of Next-Gen Database Management Systems

A Comprehensive Analysis - NoSQL vs RDBMS

What is NoSQL? Databases for a cloud-scale future

Using DuckDB to Cache Query Results and Reduce Load on Your Operational Database

SQL vs NoSQL Databases - Part 1

SQL, NoSQL, and NewSQL Explained: Unlocking the Power of Databases!

SQL vs NoSQL: how to choose a database for your application?