Manipulating Result Sets & Predicting the Likelihood of Future Events Using Greenplum
??????????????? ? ?.
???????????????????????????? | ???????????????????? | ????????A???????????????? | ?????????????? & ?????????????????????????? | ?????????????????? | PersonalityType:Logistician (ISTJ-A) | ???????????????? ?????? ????????
Like all other databases, Greenplum Database (GPDB) can do different kinds of computations: it can add, subtract, multiply and divide, it can do computations on dates, it can count rows and sum row values, and so on. But what sets it apart from other databases is, it has some advanced analytic functions which allows users to more easily compose complex Online Analytical Processing (OLAP) queries, using standard SQL commands. Last but not least, it can also do quite sophisticated statistics using MADlib built-in functions, by leveraging Massively Parallel Processing(MPP) shared nothing technology.
The most basic way to compute statistics in SQL is by using GROUP BY clause, and it's really powerful. Let’s look at the details of the GROUP BY clause, starting with the basics, and then we can explore some advanced analytics to predict the probability of future events.
FYR: The examples in this article will run against 2008, 2012, 2016 summer Olympics and sample Weather data.
- How “GROUP BY” works?
Let's start by reminding ourselves how the GROUP BY clause works. An aggregate function takes multiple rows of data returned by a query and aggregates them into a single result row.
If we now include the GROUP BY clause, it then limits the window of data processed by the aggregate function. This way we get an aggregated value for each distinct combination of values present in the columns listed in the GROUP BY clause. The GROUP BY clause comes right after the WHERE clause in SQL query, and can group rows by more than one column.
With GROUP BY the aggregates (count, sum, avg, min, max, and others) are computed for each separately. In the above example the database counts the number of medals, during a given year.
Note: Rows with NULL values all go into one group, and the aggregates are also computed for this group, as for any other. It also works for multi-column GROUP BY's.
- How “ROLLUP” works?
In addition to the regular aggregation results, the ROLLUP extension produces group subtotals from right to left and a grand total. If “n” is the number of columns listed in the ROLLUP, there will be “n+1” levels of subtotals.
A ROLLUP grouping can be thought of as a series of grouping sets. GROUP BY ROLLUP (a,b,c) will generate GROUP BY GROUPING SETS ((a, b, c), (a, b), (a)) and a row for an aggregation of all selected rows. It is simply equivalent to four SELECT queries with various GROUP BY clauses. The additional rows are sometimes called super aggregates.
- How “CUBES” works?
In addition to the subtotals generated by the GROUP BY ROLLUP extension, the CUBE extension will generate subtotals for all combinations of the dimensions specified. If “n” is the number of columns listed in the CUBE, there will be “2n” subtotal combinations.
It works similarly to ROLLUP but it generates all possible combinations of columns: (a,b,c), (a,b), (a,c), (b,c), (a), (b), (c) and a row for all rows selected.
- How “GROUP BY GROUPING SETS” works?
GROUPING SETS works similarly, but it allows to specify which column combinations must be used in the result. GROUPING SETS have to be separated with commas. If there is more than one column in a grouping set, the grouping set should be put in parentheses. Empty parentheses mean the general record with aggregates for the whole set.
- How “WINDOW” Clause works?
The WINDOW clause is used to define a window that can be used in the OVER() expression of a window function. You need to remember that the windowing clauses are always considered last in the query, meaning after the WHERE clause. Here, the WHERE clause is missing, so it's right after GROUP BY.
There is a SQL before window functions and SQL after window functions; that's how powerful it is. The whole idea behind Window Functions is to allow you to process several values of the result set at a time: you see through the window with some rows and are able to compute a single output value from them, much like when using an aggregate function.
Let me make this post more interesting by calculating the probability of two or more people, with same interest in a given community, will say Yes / No for an outdoor data science meetup, based on weather conditions.
Step_1: Unpivot the data (transform one dimensional array into rows). If the data is not denormalized, create a view with the identification and classification that unpivots all the values (In the above example, the value names become the values of the field attr:). If the data is already in denormalized form, you do not need to unpivot the data.
Step_2: Create an "outdoor_meetup_signup_nb_count" table, which shifts the view of the data to the values of the field attire. Below SQL has pivot_sum, which is an advanced aggregate function, where it aggregates and resolve duplicate entries.
Step_3: Create a summary view (outdoor_meetup_signup_nb_classify_functions) from "outdoor_meetup_signup_nb_count" table.
Step_4: Aggregate the data with nb_classify, nb_probabilities, or both (which are advanced aggregate functions, nb_classify predicts the classification of new rows and returns the class with the largest likelihood of appearing in the new rows, and whereas nb_probabilities predicts the classification of new rows and returns probabilities that each class will appear in new rows). Decide what to classify and load them into a table. Let us call the table name as "predicting_the_likelihood_of_future_events".
The below SQL aggregates the data against the above table. The result gives the classification Yes / No and the probability of signing-up for an outdoor data science meetup for each set of conditions in the table t.
Conclusion:
The bottom line is—learning how to handle data inside the database creates a lot of opportunities for predictive analytics as well as machine learning at scale.
References