登录查看更多内容

Storage Comes At a Price

??????????????? ? ?.

???????????????????????????? | ???????????????????? | ????????A???????????????? | ?????????????? & ?????????????????????????? | ?????????????????? | PersonalityType:Logistician (ISTJ-A) | ???????????????? ?????? ????????

发布日期: 2016年1月22日

In the year 2012, IDC and EMC estimated the size of the digital universe (that is, all the digital data created, replicated and consumed in that year) to be 2,837 exabytes (EB) and forecast this to grow to 40,000EB by 2020 — a doubling time of roughly two years. One exabyte equals a thousand petabytes (PB), or a million terabytes (TB), or a billion gigabytes (GB). So by 2020, according to IDC and EMC, the digital universe will amount to over 5,200GB per person on the planet.

This sort of data growth clearly requires heroic amounts of storage to keep it available for as long as it's required, and to archive much of it thereafter. Keeping this in mind, I'm going to pen down my thoughts, and share few things on how to define the storage model, in an Analytical Data Warehouse.

Here are some of the topics that we will be covering today:

? “Steps to Properly Manage and Store the Data”,

? “Choosing the table storage model in Greenplum Database” (GPDB) — The World's First Open Source Shared Nothing Massively Parallel Processing (MPP) Data Warehouse, and

? An overview of “Greenplum's Polymorphic Data Storage”.

“Steps to Properly Manage and Store the Data”

Know your data.
Don't miss insights by neglecting unstructured data.
Understand your compliance needs.
Establish a data retention policy.
Look for a solution that fits your data needs, not the other way around.
Don't let price dictate your decision.
Don't store redundant copies of your data.
Make sure your data is secure.
Leverage technologies that use deduplication.
Make sure you can access the data you need, once archived.
Eliminate data loss, by adding replication factor.
Have a disaster recovery plan.

“Choosing the proper table storage model in GPDB”

Table storage causes a lot of head scratching among developers, and it all starts at the table creation. Choosing a storage model comes with experience. When ever you create a table you can pick the parameters that will be used to define the storage policy and storage mechanism of the table. GPDB provides several storage models and a mix of storage options when creating tables.

? 1. Heap Storage: Heap storage works best with On-Line Transaction Processing (OLTP) type workloads, where the data is often modified after it is initially loaded. Update and delete operations require storing row-level versioning information, to ensure reliable database transaction processing. Heap tables are best suited for smaller tables, such as dimension tables, that are often updated after they are initially loaded. Row-oriented heap tables are the default storage type in GPDB.

Use Heap storage for tables and partitions that will receive concurrent update, delete, and insert operations.

? 2. Append-Optimized Storage: AO storage model is good for large denormalized fact tables in a data warehouse environment. These types of tables are loaded using batch process, and accessed by read-only queries. Moving large fact tables to an append-optimized storage model eliminates the storage overhead of the per-row update visibility information, saving about 20 bytes per row.

Use AO storage for tables and partitions that are updated infrequently after the initial load. Never perform singleton insert, update or delete operations on AO tables, and also never go for concurrent batch update or delete operations. Space occupied by rows that are updated and deleted in AO tables is not recovered and reused as efficiently as with Heap tables, so the AO storage model is inappropriate for frequently updated tables. It is mainly intended for large tables that are loaded once, updated infrequently, and queried frequently for Analytical Query Processing.

? 3. Choosing Row or Column-Oriented Storage: Greenplum created a hybrid storage layer, you can store data in rows and columns or a combination of both. The fact is that for some workloads, row-oriented storage can get you an order of magnitude performance improvement relative to column-stores, and for other workloads, column-oriented storage can get you an order of magnitude performance improvement relative to row-stores. Having both together is a big win.

Use row-oriented storage for workloads with iterative transactions where updates are required and frequent inserts are performed, and use column-oriented storage for tables that have single columns that are regularly updated without modifying other columns in the row. If you have mixed workloads, then the best practice is to choose your table storage to be row-oriented, as it offers the best combination of flexibility and performance.

NOTE: Business and IT users, should be aware of what to query and what not, when the table is described as columnar or a combination of both. Because, reading a complete row in a wide columnar table requires more time and I/O than reading the same row from a row-oriented table.

Column-orientation stores column values together on disk. A separate file is created for each column. If the table is partitioned, a separate file is created for each column and partition. Do not create too many partitions with column-oriented storage, because it will increase the number of physical files on every segment. Here is how the math works: “Number of Physical Files = Number of Segments * Columns * Partitions”.

Column-oriented tables can offer optimal query performance on large tables with many columns, where only a small subset of columns are accessed by the queries. Also, column-oriented tables compress better than row-oriented tables.

It is “very important” to know when to use Heap vs. Append-Optimized (AO) storage, and when to use Row-Oriented vs. Column-Oriented storage.

“Greenplum's Polymorphic Data Storage”

Greenplum’s Polymorhpic storage is a very useful feature in GPDB, where tables can be created and altered with a mixture of columnar, compressed, and row-oriented storage. GPDB offers a variety of options to compress AO tables and partitions. In general there are two types of in-database compression techniques in GPDB.

Table level compression, which applies to the entire table.
Column-level compression is applied to a specific column. You can apply different column-level compression algorithms to different columns, on the same table.

Use compression techniques to improve I/O across the system, by allowing more data to be read with each disk read operation. Also keep in mind, your segment systems must have avialble CPU power to compress and uncompress the data. Performance with compressed AO tables depends on hardware, query tuning settings, and other factors.

One of the cool feature that came-up in Greenplum recently is to, have the ability to do external partitions using partition tables. Partitioning reduces the amount of data to be scanned by reading only the relevant data needed to satisfy a query. (Storing data by partition allows you to, easily get good performance from a query, when you only care about data in a subset of the partition — large fact tables are good candidates for table partitioning.)

With the polymorphic storage, you can set different storage parameters for different partitions.

For example, If you want to have the recent data stored as row, and the reason is because all the queries in the recent months are gonna go and do “select * from ..”, and show the whole row up to the user interface. Then the next set of months, you want to store as columnar, because your Business partners aren't gonna query that much, and if they do, they probably don't need all the columns for whatever reason, so you can have combination of months with columnar and uncompressed, and columnar with compression. The less accessed and archived data is rolled off to Hadoop and Network-Attached Storage (NAS) share, and you can access them through external partitions using partition tables.

Hadoop’s storage costs are substantially less, but HDFS automatically makes 3 copies of the entire file across three separate computer nodes of the Hadoop cluster. While duplicating data three times does consume more space. NAS provides comprehensive data access and file sharing mechanism, and eliminates duplicating the copies three times. It also increases efficiency with centralized storage and gives you an additional layer of security. Based on your Business needs, you can archive data in Hadoop/NAS, and you can still query the data from Greenplum.

GPDB has a concept called parallel data loading and unloading, using external tables. External tables allows you to store and query data outside of the database directly and in parallel using SQL commands. You can make a GPHDFS connector, to go out and get the data from Hadoop and Isilon, into GPDB. As far as the end user is concerned, all data are queried the same, as if it's in the database.

Defining Date Range Table Partition:

As you can see, the sales table has six partitions with each having different storage techniques.

? January 2015 is columnar and compressed with ZLIB.

? February 2015 is columnar and compressed with RLE_TYPE.

? April 2015 is stored as columnar and uncompressed.

? May 2015 is row store and compressed with QUICKLZ.

? Year 2014 and 2012 is stored as a row store and is not compressed.

You can merge the data into one single table, by taking the GPHDFS external tables, and swap them in as partitions into partition table. If you select something from “January 2015”, “February 2014” & “March 2012”, it's only gonna read one column-oriented partition from GPDB, one partition from Hadoop, and other from NAS, based on your query.

When you do a “select * from sales;” it shows merging the data from GPDB, the data that's out in Hadoop, and the data that's in Isilon, all coming back from one select statement. It's pretty awesome to merge all into one table, by having external partitions with partitioned tables.

Let me conclude by saying it again, “Storage Comes At a Price” — Be Wise!

References

EMC Pulse Product & Technology Blog.
CIO Magazine.
Pivotal Greenplum.
Inspired, from Pivotal Open Source Hub.

要查看或添加评论，请登录

??????????????? ? ?.的更多文章

11 Years 11 Months & 11 Days

2020年12月10日

11 Years 11 Months & 11 Days

11:11 - Some call it angel numbers & some call it coincidental nonsense. — At first, this occurrence felt like a silly…
Perspective Is Everything

2019年7月22日

Perspective Is Everything

Do you remember hearing the story about the blind men and an elephant each of whom were exploring an elephant for the…
It’s time to Comply with GDPR — Are you Ready?

2018年4月26日

It’s time to Comply with GDPR — Are you Ready?

What Is GDPR? The General Data Protection Regulation (GDPR) is a new regulation aimed at improving the protection of…
Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

2017年7月17日

Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

All most all applications need to use some kind of persistent durable storage to get their job done. This might be to…
Reading and Writing Data to Cloud Storage using Pivotal Greenplum

2016年12月21日

Reading and Writing Data to Cloud Storage using Pivotal Greenplum

The evolution of the cloud has seen many organizations evolve from private, community, public and now hybrid cloud…
Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

2016年9月2日

Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

Like all other databases, Greenplum Database (GPDB) can do different kinds of computations: it can add, subtract…
Optimize GPDB with Vacuum

2016年8月1日

Optimize GPDB with Vacuum

Just like your house gets dirty and requires cleaning from time to time, your Greenplum Database (GPDB) may accumulate…

1 条评论
Analyzing Query Execution Plans In Greenplum

2016年6月30日

Analyzing Query Execution Plans In Greenplum

Query Execution plans can become very useful tool for every database developer and DBA. They provide a deeper…

5 条评论
Big Data Analysis with MapReduce and Greenplum

2016年5月22日

Big Data Analysis with MapReduce and Greenplum

Before I say anything about Big Data Analysis/MapReduce, let me start this with a quote: “Google is living a few years…

4 条评论
Scaling Greenplum with PgBouncer

2016年4月22日

Scaling Greenplum with PgBouncer

Deploying Greenplum in a high-demand environment requires reliability and scalability. (Scalability, simply, is about…

4 条评论

See all articles

Storage Comes At a Price

??????????????? ? ?.

???????????????????????????? | ???????????????????? | ????????A???????????????? | ?????????????? & ?????????????????????????? | ?????????????????? | PersonalityType:Logistician (ISTJ-A) | ???????????????? ?????? ????????

??????????????? ? ?.的更多文章

社区洞察

其他会员也浏览了

January 2024 Hammerspace Newsletter

Data Gravity: Navigating the Shifting Landscape of Data Management

Lightbits v3.13.1: Introducing Thick Clones via Data Mobility Service (DMS)

Introducing Resilio Active Everywhere

Data Storage Industry- What’s In the Future?

What is the Best Storage Filesystem and Why?

PowerStore Prime: The Ultimate Powerhouse For Your Data Needs!

RAID 1 & RAID 10

?? Optimizing Data Storage and Management for Cloud-Based XDR Systems

The Evolution of Data Storage Technologies

??????????????? ? ?.的更多文章

11 Years 11 Months & 11 Days

Perspective Is Everything

It’s time to Comply with GDPR — Are you Ready?

Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

Reading and Writing Data to Cloud Storage using Pivotal Greenplum

Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

Optimize GPDB with Vacuum

Analyzing Query Execution Plans In Greenplum

Big Data Analysis with MapReduce and Greenplum

Scaling Greenplum with PgBouncer

社区洞察

其他会员也浏览了

January 2024 Hammerspace Newsletter

Data Gravity: Navigating the Shifting Landscape of Data Management

Lightbits v3.13.1: Introducing Thick Clones via Data Mobility Service (DMS)

Introducing Resilio Active Everywhere

Data Storage Industry- What’s In the Future?

What is the Best Storage Filesystem and Why?

PowerStore Prime: The Ultimate Powerhouse For Your Data Needs!

RAID 1 & RAID 10

?? Optimizing Data Storage and Management for Cloud-Based XDR Systems

The Evolution of Data Storage Technologies