Sample or Random Sample?

Sample or Random Sample?

In Alteryx there are some tools that can be used for very different tasks and whose name can sometimes be confusing. The SAMPLE tool is definitely one of them. But there is also the RANDOM SAMPLE tool. Both have the task of extracting part of a database. Maybe the difference is that in one of the tools, randomness plays a bigger role in selecting the records in the "sample"? Anyway, this article is about what these tools can - and cannot - be used for.

My first experience with the SAMPLE tool was a bit irritating - the task was to find the "top 5" products which generated the highest revenue in the last year. So, first add up the sales with the SUMMARIZE tool, then sort in descending order of the total sales using the SORT tool, that's obvious.?

Es wurde kein Alt-Text für dieses Bild angegeben.

And then? What is needed now is a tool that selects the first records from a dataset. There is no such thing as a "First n Rows" tool. The solution was the SAMPLE tool; simply select "First N rows", enter the desired number - done!

Es wurde kein Alt-Text für dieses Bild angegeben.

The SAMPLE tool thus offers the option of selecting the first "n" rows from a dataset. The term "sample" is deliberately interpreted very comprehensively here; it is not about a "representative partial dataset", but about various forms of a part of the entire data.?The SAMPLE tool thus covers several areas of application for which separate tools would otherwise be required.

For example, the question of which are the worst products in the sales ranking can be answered with the SAMPLE tool; instead of "First N rows", it is now the "Last N rows" that are selected.?

Es wurde kein Alt-Text für dieses Bild angegeben.

Okay, this could also have been achieved with reverse sorting, but sometimes it is also about simply pulling the first or last records from a file without sorting first - and that is exactly what the SAMPLE tool can do.?

By the way, you can not only choose the "Top 5" (or 10, 20, ...), but also define a percentage, for example the "Top 5%"; here we would determine the 5% of products with which the highest turnover is achieved. The counterpart (i.e. the "Last N%") is missing here - so you would have to sort there after all.

Es wurde kein Alt-Text für dieses Bild angegeben.

It is particularly convenient that these selections can also be made for groupings - so it is not necessarily the total data set for which the "Top 5" are determined, but perhaps the "Top 5 per sales area". For this purpose, there is the "Group by column", in which it can be specified to which group the first/last selection should refer.

In our example we have now added the sales area and selected it as "Group by column" - now the 5 products with the highest turnover per sales area are selected. Of course, it is also possible to group by several columns!

Es wurde kein Alt-Text für dieses Bild angegeben.

By the way, we still only have to sort by turnover - the SAMPLE tool selects the first five per sales area without having to sort by this.?

But you can not only select the first rows, but also exclude them - there is the option "Skip 1st N rows". When do you need something like that? For example, in a situation like this:

Es wurde kein Alt-Text für dieses Bild angegeben.

Rows 1-4 are probably not needed here (report title etc.). In this case, you could skip them with the SAMPLE tool using "Skip 1st N rows".

But of course, this can also be done in the Input Data Tool by simply starting the import in row 5. But after that it will continue - we need row 5 as a column header, rows 6-8 are not needed, then the data that is needed again will follow.

Es wurde kein Alt-Text für dieses Bild angegeben.

We can solve this with a setting in the Input Data Tool and two SAMPLE tools - the headings (rows 1-4) are excluded by starting the import at row 5, row 5 (which is now row 1) is selected by one SAMPLE tool, the first 4 rows (in the original 1-8) are excluded by the second SAMPLE tool.

Es wurde kein Alt-Text für dieses Bild angegeben.

The results of the two tools can then be merged again. In situations like this, the SAMPLE tool is used frequently!?

So far, this has hardly anything to do with "representative part of a dataset". But there are two more options!

In order to clarify the difference, I have added record numbers with a RECORDID tool.

The first option ("1 of every N rows") simply returns every nth record (in this case, every hundredth).?This determines which records are included, and the results are reproducible, i.e. identical for each run.

ith the second option ("1 in N chance to include each row"), there is a 1:N chance for each row to be included, regardless of the selection of other rows. This means that the result changes with each run, and the number of result rows is also unpredictable and variable. In any case, chance plays a greater role here!

Es wurde kein Alt-Text für dieses Bild angegeben.

By the way, even with this form of selection, it is possible to group by columns, i.e. to make a selection per product and/or sales area!

Compared to the SAMPLE tool, the applications for the RANDOM SAMPLE tool is considerably limited. Here it is really only a matter of selecting a part of a dataset. We specify either the number of rows or the proportion of rows to be included in the selection. The number of data records in the result always remains the same, but the selection changes (see example).

Es wurde kein Alt-Text für dieses Bild angegeben.

To always get the same selection, the option "deterministic seed" can be activated; the number entered determines the selection, which then always remains constant.

Es wurde kein Alt-Text für dieses Bild angegeben.

For a real random selection, both the RANDOM SAMPLE and the SAMPLE Tool can be used; if a certain number of records or a reproducible result is required, the RANDOM SAMPLE tool has its advantages, but the random character is actually greater with the SAMPLE tool, which also allows grouping according to additional fields.

However, the SAMPLE tool can generally do much more than the name suggests and is very helpful in many workflows - not only when it comes to selecting a partial dataset.

要查看或添加评论,请登录

Roland Schubert的更多文章

  • Time For Summary

    Time For Summary

    When I receive a new table or file, I always try to first get an overview. Usually by bringing an INPUT DATA tool onto…

  • A Little Bit More: Oversampling

    A Little Bit More: Oversampling

    When selecting data records, it is often simply a matter of selecting only the first (or last) data from a table…

    1 条评论
  • Building Groups Based on Relations

    Building Groups Based on Relations

    Grouping is not necessarily a very unusual task - customer groups always come to my mind spontaneously. Common…

  • Different Types of Correlation

    Different Types of Correlation

    I have to admit it - I intuitively tend to look for relationships between different data. And indeed, I often recognize…

  • Grouping Data

    Grouping Data

    Grouping data in some way is an essential part of day-to-day business for data analysts. Many people immediately think…

  • Comparing To Lists

    Comparing To Lists

    Long years ago, a DIY chain in Germany launched a discount campaign entitled "20% discount on all items - except pet…

  • Famous (or Not-So-Famous) Last Words

    Famous (or Not-So-Famous) Last Words

    Sometimes you just have to have the last word - the last word from a text field, of course. When it comes to "breaking…

    1 条评论
  • Break on Error

    Break on Error

    An error has occurred in a workflow and it continues to run anyway? Sometimes that's all right, but only sometimes…

  • Year-To-Date Calculations

    Year-To-Date Calculations

    If you are working in Finance/FP&A/Controlling, calculating "Year to Date" (YTD) values is an essential part of your…

  • Compare Date and Time

    Compare Date and Time

    We often need data only for a specific period of time - a year, a month or a week, sometimes just a few hours, but the…

社区洞察

其他会员也浏览了