Storage Comes At a Price

Storage Comes At a Price

In the year 2012, IDC and EMC estimated the size of the digital universe (that is, all the digital data created, replicated and consumed in that year) to be 2,837 exabytes (EB) and forecast this to grow to 40,000EB by 2020 — a doubling time of roughly two years. One exabyte equals a thousand petabytes (PB), or a million terabytes (TB), or a billion gigabytes (GB). So by 2020, according to IDC and EMC, the digital universe will amount to over 5,200GB per person on the planet.

This sort of data growth clearly requires heroic amounts of storage to keep it available for as long as it's required, and to archive much of it thereafter. Keeping this in mind, I'm going to pen down my thoughts, and share few things on how to define the storage model, in an Analytical Data Warehouse.

Here are some of the topics that we will be covering today:

? “Steps to Properly Manage and Store the Data”,

? “Choosing the table storage model in Greenplum Database” (GPDB) —              The World's First Open Source Shared Nothing Massively Parallel Processing (MPP) Data Warehouse, and

? An overview of “Greenplum's Polymorphic Data Storage”.

“Steps to Properly Manage and Store the Data”
  1. Know your data.
  2. Don't miss insights by neglecting unstructured data.
  3. Understand your compliance needs.
  4. Establish a data retention policy.
  5. Look for a solution that fits your data needs, not the other way around.
  6. Don't let price dictate your decision.
  7. Don't store redundant copies of your data.
  8. Make sure your data is secure.
  9. Leverage technologies that use deduplication.
  10. Make sure you can access the data you need, once archived.
  11. Eliminate data loss, by adding replication factor.
  12. Have a disaster recovery plan.
“Choosing the proper table storage model in GPDB”

Table storage causes a lot of head scratching among developers, and it all starts at the table creation. Choosing a storage model comes with experience. When ever you create a table you can pick the parameters that will be used to define the storage policy and storage mechanism of the table. GPDB provides several storage models and a mix of storage options when creating tables.

? 1. Heap Storage: Heap storage works best with On-Line Transaction Processing (OLTP) type workloads, where the data is often modified after it is initially loaded. Update and delete operations require storing row-level versioning information, to ensure reliable database transaction processing. Heap tables are best suited for smaller tables, such as dimension tables, that are often updated after they are initially loaded. Row-oriented heap tables are the default storage type in GPDB.

Use Heap storage for tables and partitions that will receive concurrent update, delete, and insert operations.

? 2. Append-Optimized Storage: AO storage model is good for large denormalized fact tables in a data warehouse environment. These types of tables are loaded using batch process, and accessed by read-only queries. Moving large fact tables to an append-optimized storage model eliminates the storage overhead of the per-row update visibility information, saving about 20 bytes per row.

Use AO storage for tables and partitions that are updated infrequently after the initial load. Never perform singleton insert, update or delete operations on AO tables, and also never go for concurrent batch update or delete operations. Space occupied by rows that are updated and deleted in AO tables is not recovered and reused as efficiently as with Heap tables, so the AO storage model is inappropriate for frequently updated tables. It is mainly intended for large tables that are loaded once, updated infrequently, and queried frequently for Analytical Query Processing.

? 3. Choosing Row or Column-Oriented Storage: Greenplum created a hybrid storage layer, you can store data in rows and columns or a combination of both. The fact is that for some workloads, row-oriented storage can get you an order of magnitude performance improvement relative to column-stores, and for other workloads, column-oriented storage can get you an order of magnitude performance improvement relative to row-stores. Having both together is a big win.

Use row-oriented storage for workloads with iterative transactions where updates are required and frequent inserts are performed, and use column-oriented storage for tables that have single columns that are regularly updated without modifying other columns in the row. If you have mixed workloads, then the best practice is to choose your table storage to be row-oriented, as it offers the best combination of flexibility and performance.

NOTE: Business and IT users, should be aware of what to query and what not, when the table is described as columnar or a combination of both. Because, reading a complete row in a wide columnar table requires more time and I/O than reading the same row from a row-oriented table.

Column-orientation stores column values together on disk. A separate file is created for each column. If the table is partitioned, a separate file is created for each column and partition. Do not create too many partitions with column-oriented storage, because it will increase the number of physical files on every segment. Here is how the math works: “Number of Physical Files = Number of Segments * Columns * Partitions”

Column-oriented tables can offer optimal query performance on large tables with many columns, where only a small subset of columns are accessed by the queries. Also, column-oriented tables compress better than row-oriented tables.

It is “very important” to know when to use Heap vs. Append-Optimized (AO) storage, and when to use Row-Oriented vs. Column-Oriented storage.

“Greenplum's Polymorphic Data Storage”

Greenplum’s Polymorhpic storage is a very useful feature in GPDB, where tables can be created and altered with a mixture of columnar, compressed, and row-oriented storage. GPDB offers a variety of options to compress AO tables and partitions. In general there are two types of in-database compression techniques in GPDB.

  1. Table level compression, which applies to the entire table.
  2. Column-level compression is applied to a specific column. You can apply different column-level compression algorithms to different columns, on the  same table.

Use compression techniques to improve I/O across the system, by allowing more data to be read with each disk read operation. Also keep in mind, your segment systems must have avialble CPU power to compress and uncompress the data. Performance with compressed AO tables depends on hardware, query tuning settings, and other factors.

One of the cool feature that came-up in Greenplum recently is to, have the ability to do external partitions using partition tables. Partitioning reduces the amount of data to be scanned by reading only the relevant data needed to satisfy a query. (Storing data by partition allows you to, easily get good performance from a query, when you only care about data in a subset of the partition large fact tables are good candidates for table partitioning.)

With the polymorphic storage, you can set different storage parameters for different partitions. 

For example, If you want to have the recent data stored as row, and the reason is because all the queries in the recent months are gonna go and do “select * from ..”, and show the whole row up to the user interface. Then the next set of months, you want to store as columnar, because your Business partners aren't gonna query that much, and if they do, they probably don't need all the columns for whatever reason, so you can have combination of months with columnar and uncompressed, and columnar with compression. The less accessed and archived data is rolled off to Hadoop and Network-Attached Storage (NAS) share, and you can access them through external partitions using partition tables.

Hadoop’s storage costs are substantially less, but HDFS automatically makes 3 copies of the entire file across three separate computer nodes of the Hadoop cluster.  While duplicating data three times does consume more space. NAS provides comprehensive data access and file sharing mechanism, and eliminates duplicating the copies three times. It also increases efficiency with centralized storage and gives you an additional layer of security. Based on your Business needs, you can archive data in Hadoop/NAS, and you can still query the data from Greenplum. 

GPDB has a concept called parallel data loading and unloading, using external tables. External tables allows you to store and query data outside of the database directly and in parallel using SQL commands. You can make a GPHDFS connector, to go out and get the data from Hadoop and Isilon, into GPDB. As far as the end user is concerned, all data are queried the same, as if it's in the database. 

Defining Date Range Table Partition:

As you can see, the sales table has six partitions with each having different storage techniques.

           ? January 2015 is columnar and compressed with ZLIB.

           ? February 2015 is columnar and compressed with RLE_TYPE.

           ? April 2015 is stored as columnar and uncompressed.

           ? May 2015 is row store and compressed with QUICKLZ.

           ? Year 2014 and 2012 is stored as a row store and is not compressed.

You can merge the data into one single table, by taking the GPHDFS external tables, and swap them in as partitions into partition table. If you select something from “January 2015”, “February 2014” & “March 2012”, it's only gonna read one column-oriented partition from GPDB, one partition from Hadoop, and other from NAS, based on your query.  

When you do a “select * from sales;” it shows merging the data from GPDB, the data that's out in Hadoop, and the data that's in Isilon, all coming back from one select statement. It's pretty awesome to merge all into one table, by having external partitions with partitioned tables.

Let me conclude by saying it again, “Storage Comes At a Price” — Be Wise!

References

  1. EMC Pulse Product & Technology Blog
  2. CIO Magazine.
  3. Pivotal Greenplum.
  4. Inspired, from Pivotal Open Source Hub.

要查看或添加评论,请登录

??????????????? ? ?.的更多文章

社区洞察

其他会员也浏览了