Amazon RedShift for MySQL DBA's

Amazon RedShift for MySQL DBA's

What is RedShift

Amazon RedShift is a data analytics database provided as a service by Amazon AWS that is specifically designed for analyzing data using standard SQL. Amazon RedShift was build on top of ParAccel MPP (massively parallel processing) database, RedShift also has roots in PostgreSQL database 8.0, with exception of the storage engine that is very different from PostgreSQL database. RedShift does however support all PostgreSQL clients and drivers. Amazon RedShift is designed to handle petabytes of data, while RedShift Spectrum is designed to handle exabytes of data.

Columnar Data Store

Amazon RedShift unlike any RDBMS does not store data in rows, it stores data in columns that are physical entities, just like in RDBMS the data is presented in table but physically columns are separate entities. Why columns? Simply to reduce io (input output) or reduce disk reads to retrieve the data. Since the user rarely selects all the data from the table (SELECT * FROM table_name) storing data in columns reduces disk access by accessing only the columns from the disk that are in the select statement, for example SELECT  FIRST_NAME, REGION, ORDER_ID will only retrieve those columns from the disk. This columnar store also aids in compression since the data stored in each column is uniform - for example FIRST_NAME only stores first names. Column stores closely resembles an index on a single column in RDBMS.

Column store, while very useful for selecting data, is very inefficient for OLTP operations since inserting a single row in a table will have to physically record each item in its own column thus increasing the disk io since each item in the table requires individual io, unlike RDBMS that can record a single transaction in single physical object - table. That is also why adding indexes to RDBMS increases io, the more indexes on RDBMS table the more io it will generate.


One can incorrectly perceive that column store is nothing more than RDBMS with an index on every column, but the real difference is how columnar store maps the data to the storage. In RDBMS indexed data primary key is the rowid that is mapped from the data, in columnar store primary key is the data is mapped from the rowid, hence retrieving all data for a given object is very inefficient in columnar for example to retrieve a single row will require multiple io’s, vs single io in RDBMS to retrieve a single row.


Columnar store is theoretically a lot slower than RDBMS - it requires multiple io’s to record individual items and it requires multiple io’s to retrieve a single row. i.e. all items for a specific object, but columnar store despite theoretical disadvantages has a practical advantage - one rarely needs to select all items for a specific object, for example in rolodex one usually looks for either an address or phone for a person, that selective retrieval is the main advantage of column store, since its selective operations it can retrieve the data in far fewer io’s than RDBMS.

RedShift Nodes

AWS RedShift has only two categories of nodes:

  • Leader node one per entire cluster, stores metadata, coordinates parallel SQL processing, optimizer, converting SQL to C++ for data nodes to execute, pg_catalog
  • Compute node up to 120 nodes, stores actual data, backup and restore, data loading


Each RedShift cluster has only one Leader node and several compute nodes. The leader node is receiving and parsing queries and the compute nodes actually store data.

The entire SQL interaction either via psql> or drivers is performed though Leader node.  

Compute nodes, zone maps,slices, columns and blocks

Slices

AWS RedShift stores and distributes data in slices, slices are virtual compute nodes. Data nodes virtually partitioned in slices depending on the size of the data node there can be 2, 16 or 32 slices, slices are not pinned to CPU’s. The data is stored per slice and slice is essentially a unit of data partitioning. The data is distributed (sharded) on slices, one can think of slices as data shards.  


Zone Maps

Zone maps are in memory metadata min and max values for data blocks, simply put it is a map pointing to the location of the blocks. Adding a Sort Key greatly help zone maps to locate the blocks of data. Sort keys usually based on query pattern but typically are time stamps. Sort Keys and Zone maps are similar to a partition in RDBMS, for example if we have a table with Primary Key (usually a distribution key) and timestamp (Sort Key), this Sort Key will sort the data and in effect update the zone map. When looking for a specific data range Zone Map will point to the exact blocks where the data is located - similar to partitioning by date in MySQL - selecting data in the range will point to the exact partition where the data is located.

Creating Tables - Compression, encoding, distribution and sorting


Creating RedShift tables is simpler than RDBMS - Primary Keys and Foreign Keys are not enforced, no constraints, no triggers, no stored procedures and only two types of optimizations - distribution and sorting. In addition since the table data is stored in individual physical structures - columns, each column can be encoded differently from the rest.  RedShift supports 12 compression encodings types depending on the data type specific encoding can be selected.

There are only three aspects to RedShift optimization

  • Encoding
  • Distribution
  • Sorting

Encoding

Since every column is a separate physical entity it makes sense that each column can be encoded differently. Why bother with different encoding? Different data types compress differently depending on encoding and hence the main approach in RedShift to reduce io by as much as possible selecting correct encoding makes absolute sense.


If the default encoding is selected RedShift is going to encode Sort Keys in RAW, numerical types and boolean also encoded in RAW. All other types are encoded using LZO encoding. RAW means no compression and it makes more sense for Sort Keys and dist keys since they update zone maps and numeric types.


For datetime Delta encoding is recommended. There a lot more encoding types and the selection depends on data type.


Here is the example create table with diff encoding types:


create table schemaname.encodingshipmode (

moderaw varchar(22) encode raw,

modebytedict varchar(22) encode bytedict,

modelzo varchar(22) encode lzo,

moderunlength varchar(22) encode runlength,

modetext255 varchar(22) encode text255,

modetext32k varchar(22) encode text32k);

Distribution

Distribution essentially distributes the data on nodes, large fact table are distributed on all nodes in the cluster. The data is actually distributed by slices - slice being the virtual compute node in RedShift. A node can have 2, 16, or 32 slices. Each physical node divided on a number of slices.


There are three ways to distribute the data on slices:


  • DISTKEY
  • DISTSTYLE EVEN
  • DISTSTYLE ALL

DISTKEY

With DISTKEY style distribution you will need to select a key that will be used as a criteria for distribution, the goal here to is to select a high cardinality column so the data is distributed to all the nodes.


Example of using distribution key:


create table userskey distkey(state) as select * from users;

DISTSTYLE EVEN

Diststyle even allows RedShift to distribute the data evenly on all nodes in round robin fashion. This is typically for when one can not select a distribution key or all columns in the table are of low cardinality (a rare situation)


DISTSTYLE ALL

This distribution method is very handy for small dimension tables, since when DISTSTYLE ALL is selected RedShift will copy that table to every slice in the cluster. Having a copy of the table on every node really helps when joining large fact table with small dimension table since all data is present locally on the nodes.


Sort Keys

Having covered Distribution, Sort Keys are the next concept. Sort Keys are very similar to data partitioning in MySQL and usually are datetime types. For example we have a large table with high cardinality ORDER table, we will use ORDER_ID as distribution key and ORDER_DATE as a Sort Key.


Having ORDER_DATE as a sort key updates zone maps on RedShift instructing zone maps on which nodes the data is located.


For example when you are selecting for all orders for January, and ORDER_DATE is the sort key, zone maps in RedShift will direct the query to the exact nodes where January data is located, thus greatly reducing io.


Zone Maps

Zone Maps are in memory metadata stores on data nodes pointing to the location of the data on the slices, using sort keys partitions the data by nodes and updates the zone maps with data location.


Very precise and informative information. Thanks

回复
Gene Kalmens

Experienced Software Engineer

6 年

Redshift has some features that depending on your usage can make it unfit for production. 1. Conflicting jobs. The only isolation level Redshift supports is serializable. This means that if one job runs a query with a nested select statement and another job makes changes invalidating the result set of this nested select statement, the first job will fail. To avoid this behavior, jobs need to be appropriately scheduled. Or else you need to maintain a watermark column for each changing table, in order to select below the high watermark and insert above the high watermark. 2. Table maintenance. Massive inserts/deletes leave behind skewed stats and segmentation of table spaces. Without maintenance, performance deteriorates very quickly. We saw a simple select from a 3-row table take 8 hours. Tables need to be analyzed and vacuumed, which itself may take a very long time. The most disappointing behavior is that only one vacuum job can run at any given time in the entire cluster. These problems made us decide to stop using Redshift. Disclaimer. My involvement with Redshift stopped in June 2017.

要查看或添加评论,请登录

Constantin Alexander的更多文章

社区洞察

其他会员也浏览了