Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum

Like all other databases, Greenplum Database (GPDB) can do different kinds of computations: it can add, subtract, multiply and divide, it can do computations on dates, it can count rows and sum row values, and so on. But what sets it apart from other databases is, it has some advanced analytic functions which allows users to more easily compose complex Online Analytical Processing (OLAP) queries, using standard SQL commands. Last but not least, it can also do quite sophisticated statistics using MADlib built-in functions, by leveraging Massively Parallel Processing(MPP) shared nothing technology.

The most basic way to compute statistics in SQL is by using GROUP BY clause, and it's really powerful. Let’s look at the details of the GROUP BY clause, starting with the basics, and then we can explore some advanced analytics to predict the probability of future events.

FYR: The examples in this article will run against 2008, 2012, 2016 summer Olympics and sample Weather data.

  • How “GROUP BY” works?

Let's start by reminding ourselves how the GROUP BY clause works. An aggregate function takes multiple rows of data returned by a query and aggregates them into a single result row.

If we now include the GROUP BY clause, it then limits the window of data processed by the aggregate function. This way we get an aggregated value for each distinct combination of values present in the columns listed in the GROUP BY clause. The GROUP BY clause comes right after the WHERE clause in SQL query, and can group rows by more than one column.

With GROUP BY the aggregates (count, sum, avg, min, max, and others) are computed for each separately. In the above example the database counts the number of medals, during a given year.

Note: Rows with NULL values all go into one group, and the aggregates are also computed for this group, as for any other. It also works for multi-column GROUP BY's.

  • How “ROLLUP” works?

In addition to the regular aggregation results, the ROLLUP extension produces group subtotals from right to left and a grand total. If “n” is the number of columns listed in the ROLLUP, there will be “n+1” levels of subtotals.

A ROLLUP grouping can be thought of as a series of grouping sets. GROUP BY ROLLUP (a,b,c) will generate GROUP BY GROUPING SETS ((a, b, c), (a, b), (a)) and a row for an aggregation of all selected rows. It is simply equivalent to four SELECT queries with various GROUP BY clauses. The additional rows are sometimes called super aggregates.

  • How “CUBES” works?

In addition to the subtotals generated by the GROUP BY ROLLUP extension, the CUBE extension will generate subtotals for all combinations of the dimensions specified. If “n” is the number of columns listed in the CUBE, there will be “2n” subtotal combinations.

It works similarly to ROLLUP but it generates all possible combinations of columns: (a,b,c), (a,b), (a,c), (b,c), (a), (b), (c) and a row for all rows selected.

  • How “GROUP BY GROUPING SETS” works?

GROUPING SETS works similarly, but it allows to specify which column combinations must be used in the result. GROUPING SETS have to be separated with commas. If there is more than one column in a grouping set, the grouping set should be put in parentheses. Empty parentheses mean the general record with aggregates for the whole set.

  • How “WINDOW” Clause works?

The WINDOW clause is used to define a window that can be used in the OVER() expression of a window function. You need to remember that the windowing clauses are always considered last in the query, meaning after the WHERE clause. Here, the WHERE clause is missing, so it's right after GROUP BY.

There is a SQL before window functions and SQL after window functions; that's how powerful it is. The whole idea behind Window Functions is to allow you to process several values of the result set at a time: you see through the window with some rows and are able to compute a single output value from them, much like when using an aggregate function.

Let me make this post more interesting by calculating the probability of two or more people, with same interest in a given community, will say Yes / No for an outdoor data science meetup, based on weather conditions.

Step_1: Unpivot the data (transform one dimensional array into rows). If the data is not denormalized, create a view with the identification and classification that unpivots all the values (In the above example, the value names become the values of the field attr:). If the data is already in denormalized form, you do not need to unpivot the data.

Step_2: Create an "outdoor_meetup_signup_nb_count" table, which shifts the view of the data to the values of the field attire. Below SQL has pivot_sum, which is an advanced aggregate function, where it aggregates and resolve duplicate entries.

Step_3: Create a summary view (outdoor_meetup_signup_nb_classify_functions) from "outdoor_meetup_signup_nb_count" table.

Step_4: Aggregate the data with nb_classify, nb_probabilities, or both (which are advanced aggregate functions, nb_classify predicts the classification of new rows and returns the class with the largest likelihood of appearing in the new rows, and whereas nb_probabilities predicts the classification of new rows and returns probabilities that each class will appear in new rows). Decide what to classify and load them into a table. Let us call the table name as "predicting_the_likelihood_of_future_events".

The below SQL aggregates the data against the above table. The result gives the classification Yes / No and the probability of signing-up for an outdoor data science meetup for each set of conditions in the table t.

Conclusion:

The bottom line is—learning how to handle data inside the database creates a lot of opportunities for predictive analytics as well as machine learning at scale.

References

  1. Pivotal Greenplum.
  2. Apache MADlib.


要查看或添加评论,请登录

??????????????? ? ?.的更多文章

  • 11 Years 11 Months & 11 Days

    11 Years 11 Months & 11 Days

    11:11 - Some call it angel numbers & some call it coincidental nonsense. — At first, this occurrence felt like a silly…

  • Perspective Is Everything

    Perspective Is Everything

    Do you remember hearing the story about the blind men and an elephant each of whom were exploring an elephant for the…

  • It’s time to Comply with GDPR — Are you Ready?

    It’s time to Comply with GDPR — Are you Ready?

    What Is GDPR? The General Data Protection Regulation (GDPR) is a new regulation aimed at improving the protection of…

  • Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

    Choosing Storage in an Advanced, Fully Featured, Open Source Data Warehouse

    All most all applications need to use some kind of persistent durable storage to get their job done. This might be to…

  • Reading and Writing Data to Cloud Storage using Pivotal Greenplum

    Reading and Writing Data to Cloud Storage using Pivotal Greenplum

    The evolution of the cloud has seen many organizations evolve from private, community, public and now hybrid cloud…

  • Optimize GPDB with Vacuum

    Optimize GPDB with Vacuum

    Just like your house gets dirty and requires cleaning from time to time, your Greenplum Database (GPDB) may accumulate…

    1 条评论
  • Analyzing Query Execution Plans In Greenplum

    Analyzing Query Execution Plans In Greenplum

    Query Execution plans can become very useful tool for every database developer and DBA. They provide a deeper…

    5 条评论
  • Big Data Analysis with MapReduce and Greenplum

    Big Data Analysis with MapReduce and Greenplum

    Before I say anything about Big Data Analysis/MapReduce, let me start this with a quote: “Google is living a few years…

    4 条评论
  • Scaling Greenplum with PgBouncer

    Scaling Greenplum with PgBouncer

    Deploying Greenplum in a high-demand environment requires reliability and scalability. (Scalability, simply, is about…

    4 条评论
  • Fast Data Loading into Greenplum “Open-source” Data Warehouse

    Fast Data Loading into Greenplum “Open-source” Data Warehouse

    As the nature of enterprise data changes, data warehouses must change too. Today’s data warehousing relies on fast data…

    11 条评论

社区洞察

其他会员也浏览了