Grouping Data
Roland Schubert
Entscheidungs- und ergebnisorienierte Datenanalyse für Finance, Marketing & HR - Von der Strategie bis zur Umsetzung | Planungs- und Analyseprozesse effizient gestalten, optimieren und automatisieren | Alteryx ACE
Grouping data in some way is an essential part of day-to-day business for data analysts. Many people immediately think of classification or clustering, probably these are the two terms that first come to a Data Scientist's mind.
But sometimes (actually most of the time) things are much simpler - you just want to group products according to their revenue, suppliers according to their on-time delivery or customers based on their payment history - there is simply a criterion (or several criteria) that decide who belongs to a group.
In a simple example, the following data could be available:
To form groups here, you can of course work with a FORMULA tool and use an appropriate IF construction:
You could also use a sequence of FILTER tools that filter out the respective group one by one:
In both cases, the threshold of the specific groups must be defined and implemented in a condition. These values are always valid and independent of the data supplied - in many cases this may be correct.?
Often, however, this logic is not really fitting. If the products are to be classified according to their cumulative revenues, the threshold values would have to be adjusted every month - not really nice.
But there certainly is the TILE tool - and this simplifies the process of grouping and makes it much more flexible.
Of course, you can also specify the respective limit values for individual groups within the TILE tool:
As a result, the TILE tool returns a group number (Tile_Num) for the different data records, in addition they are numbered, however, independent of the data values, but exclusively according to the order in the table.?
But be careful: The TILE tool always specifies the upper limit of a group - i.e. a "<=" always applies to the "manual" setting! There is no chance to change this operator!
It gets really interesting, however, if we take a look at the other options offered by the TILE tool. Besides the manual grouping, there are several other methods:
领英推荐
Equal Sum:
The data records are divided into a predefined number of groups so that the sum of the values in the different groups is identical. Sorting can be done in advance (always in ascending order), but it is not required.
The result can show, for example, in the style of an ABC analysis, the importance of the individual products for the total revenue; each of the groups has the same total revenue, but group 1 only contains two products, while group 5 includes seven.
Without sorting, by the way, the total is added up in the order delivered - here the result is not related to the size ratios in the table, but to the (random) order.
Equal Records:?
Using the "Equal Records" option, on the other hand, groups of equal size are created - regardless of the values. Without sorting, the records are arranged according to their order; with sorting, groups of equal size are formed, but each group contains records containing high and low values:
Smart Tile:
I especially like the option "Smart Tile" - the data sets are divided according to the standard deviation, so the grouping essentially depends on the structure of the supplied data.?
Data values that fall within +/- the standard deviation are assigned to the group "0", between +/- standard deviation and +/- 2 * standard deviation to the groups "-1" and "1", and the process continues accordingly. Here, neither the number nor the size of the individual groups is fixed in advance.
Of course, the type of grouping must be chosen according to the specific application, but most real-life situations should be covered.