Sample or Random Sample?
Roland Schubert
Entscheidungs- und ergebnisorienierte Datenanalyse für Finance, Marketing & HR - Von der Strategie bis zur Umsetzung | Planungs- und Analyseprozesse effizient gestalten, optimieren und automatisieren | Alteryx ACE
In Alteryx there are some tools that can be used for very different tasks and whose name can sometimes be confusing. The SAMPLE tool is definitely one of them. But there is also the RANDOM SAMPLE tool. Both have the task of extracting part of a database. Maybe the difference is that in one of the tools, randomness plays a bigger role in selecting the records in the "sample"? Anyway, this article is about what these tools can - and cannot - be used for.
My first experience with the SAMPLE tool was a bit irritating - the task was to find the "top 5" products which generated the highest revenue in the last year. So, first add up the sales with the SUMMARIZE tool, then sort in descending order of the total sales using the SORT tool, that's obvious.?
And then? What is needed now is a tool that selects the first records from a dataset. There is no such thing as a "First n Rows" tool. The solution was the SAMPLE tool; simply select "First N rows", enter the desired number - done!
The SAMPLE tool thus offers the option of selecting the first "n" rows from a dataset. The term "sample" is deliberately interpreted very comprehensively here; it is not about a "representative partial dataset", but about various forms of a part of the entire data.?The SAMPLE tool thus covers several areas of application for which separate tools would otherwise be required.
For example, the question of which are the worst products in the sales ranking can be answered with the SAMPLE tool; instead of "First N rows", it is now the "Last N rows" that are selected.?
Okay, this could also have been achieved with reverse sorting, but sometimes it is also about simply pulling the first or last records from a file without sorting first - and that is exactly what the SAMPLE tool can do.?
By the way, you can not only choose the "Top 5" (or 10, 20, ...), but also define a percentage, for example the "Top 5%"; here we would determine the 5% of products with which the highest turnover is achieved. The counterpart (i.e. the "Last N%") is missing here - so you would have to sort there after all.
It is particularly convenient that these selections can also be made for groupings - so it is not necessarily the total data set for which the "Top 5" are determined, but perhaps the "Top 5 per sales area". For this purpose, there is the "Group by column", in which it can be specified to which group the first/last selection should refer.
In our example we have now added the sales area and selected it as "Group by column" - now the 5 products with the highest turnover per sales area are selected. Of course, it is also possible to group by several columns!
By the way, we still only have to sort by turnover - the SAMPLE tool selects the first five per sales area without having to sort by this.?
But you can not only select the first rows, but also exclude them - there is the option "Skip 1st N rows". When do you need something like that? For example, in a situation like this:
Rows 1-4 are probably not needed here (report title etc.). In this case, you could skip them with the SAMPLE tool using "Skip 1st N rows".
领英推荐
But of course, this can also be done in the Input Data Tool by simply starting the import in row 5. But after that it will continue - we need row 5 as a column header, rows 6-8 are not needed, then the data that is needed again will follow.
We can solve this with a setting in the Input Data Tool and two SAMPLE tools - the headings (rows 1-4) are excluded by starting the import at row 5, row 5 (which is now row 1) is selected by one SAMPLE tool, the first 4 rows (in the original 1-8) are excluded by the second SAMPLE tool.
The results of the two tools can then be merged again. In situations like this, the SAMPLE tool is used frequently!?
So far, this has hardly anything to do with "representative part of a dataset". But there are two more options!
In order to clarify the difference, I have added record numbers with a RECORDID tool.
The first option ("1 of every N rows") simply returns every nth record (in this case, every hundredth).?This determines which records are included, and the results are reproducible, i.e. identical for each run.
ith the second option ("1 in N chance to include each row"), there is a 1:N chance for each row to be included, regardless of the selection of other rows. This means that the result changes with each run, and the number of result rows is also unpredictable and variable. In any case, chance plays a greater role here!
By the way, even with this form of selection, it is possible to group by columns, i.e. to make a selection per product and/or sales area!
Compared to the SAMPLE tool, the applications for the RANDOM SAMPLE tool is considerably limited. Here it is really only a matter of selecting a part of a dataset. We specify either the number of rows or the proportion of rows to be included in the selection. The number of data records in the result always remains the same, but the selection changes (see example).
To always get the same selection, the option "deterministic seed" can be activated; the number entered determines the selection, which then always remains constant.
For a real random selection, both the RANDOM SAMPLE and the SAMPLE Tool can be used; if a certain number of records or a reproducible result is required, the RANDOM SAMPLE tool has its advantages, but the random character is actually greater with the SAMPLE tool, which also allows grouping according to additional fields.
However, the SAMPLE tool can generally do much more than the name suggests and is very helpful in many workflows - not only when it comes to selecting a partial dataset.