登录查看更多内容

Sample or Random Sample?

Roland Schubert

Entscheidungs- und ergebnisorienierte Datenanalyse für Finance, Marketing & HR - Von der Strategie bis zur Umsetzung | Planungs- und Analyseprozesse effizient gestalten, optimieren und automatisieren | Alteryx ACE

发布日期: 2022年12月14日

In Alteryx there are some tools that can be used for very different tasks and whose name can sometimes be confusing. The SAMPLE tool is definitely one of them. But there is also the RANDOM SAMPLE tool. Both have the task of extracting part of a database. Maybe the difference is that in one of the tools, randomness plays a bigger role in selecting the records in the "sample"? Anyway, this article is about what these tools can - and cannot - be used for.

My first experience with the SAMPLE tool was a bit irritating - the task was to find the "top 5" products which generated the highest revenue in the last year. So, first add up the sales with the SUMMARIZE tool, then sort in descending order of the total sales using the SORT tool, that's obvious.?

Es wurde kein Alt-Text für dieses Bild angegeben.

And then? What is needed now is a tool that selects the first records from a dataset. There is no such thing as a "First n Rows" tool. The solution was the SAMPLE tool; simply select "First N rows", enter the desired number - done!

The SAMPLE tool thus offers the option of selecting the first "n" rows from a dataset. The term "sample" is deliberately interpreted very comprehensively here; it is not about a "representative partial dataset", but about various forms of a part of the entire data.?The SAMPLE tool thus covers several areas of application for which separate tools would otherwise be required.

For example, the question of which are the worst products in the sales ranking can be answered with the SAMPLE tool; instead of "First N rows", it is now the "Last N rows" that are selected.?

Okay, this could also have been achieved with reverse sorting, but sometimes it is also about simply pulling the first or last records from a file without sorting first - and that is exactly what the SAMPLE tool can do.?

By the way, you can not only choose the "Top 5" (or 10, 20, ...), but also define a percentage, for example the "Top 5%"; here we would determine the 5% of products with which the highest turnover is achieved. The counterpart (i.e. the "Last N%") is missing here - so you would have to sort there after all.

It is particularly convenient that these selections can also be made for groupings - so it is not necessarily the total data set for which the "Top 5" are determined, but perhaps the "Top 5 per sales area". For this purpose, there is the "Group by column", in which it can be specified to which group the first/last selection should refer.

In our example we have now added the sales area and selected it as "Group by column" - now the 5 products with the highest turnover per sales area are selected. Of course, it is also possible to group by several columns!

By the way, we still only have to sort by turnover - the SAMPLE tool selects the first five per sales area without having to sort by this.?

But you can not only select the first rows, but also exclude them - there is the option "Skip 1st N rows". When do you need something like that? For example, in a situation like this:

Rows 1-4 are probably not needed here (report title etc.). In this case, you could skip them with the SAMPLE tool using "Skip 1st N rows".

领英推荐

How to Count specific characters in a range in Excel

工程关注我们，每天学习?? 11 个月前

Count specific characters in text string in Excel

工程关注我们，每天学习?? 9 个月前

How to use the ISNONTEXT Function in Google Sheets

工程关注我们，每天学习?? 11 个月前

But of course, this can also be done in the Input Data Tool by simply starting the import in row 5. But after that it will continue - we need row 5 as a column header, rows 6-8 are not needed, then the data that is needed again will follow.

We can solve this with a setting in the Input Data Tool and two SAMPLE tools - the headings (rows 1-4) are excluded by starting the import at row 5, row 5 (which is now row 1) is selected by one SAMPLE tool, the first 4 rows (in the original 1-8) are excluded by the second SAMPLE tool.

The results of the two tools can then be merged again. In situations like this, the SAMPLE tool is used frequently!?

So far, this has hardly anything to do with "representative part of a dataset". But there are two more options!

In order to clarify the difference, I have added record numbers with a RECORDID tool.

The first option ("1 of every N rows") simply returns every nth record (in this case, every hundredth).?This determines which records are included, and the results are reproducible, i.e. identical for each run.

ith the second option ("1 in N chance to include each row"), there is a 1:N chance for each row to be included, regardless of the selection of other rows. This means that the result changes with each run, and the number of result rows is also unpredictable and variable. In any case, chance plays a greater role here!

By the way, even with this form of selection, it is possible to group by columns, i.e. to make a selection per product and/or sales area!

Compared to the SAMPLE tool, the applications for the RANDOM SAMPLE tool is considerably limited. Here it is really only a matter of selecting a part of a dataset. We specify either the number of rows or the proportion of rows to be included in the selection. The number of data records in the result always remains the same, but the selection changes (see example).

To always get the same selection, the option "deterministic seed" can be activated; the number entered determines the selection, which then always remains constant.

For a real random selection, both the RANDOM SAMPLE and the SAMPLE Tool can be used; if a certain number of records or a reproducible result is required, the RANDOM SAMPLE tool has its advantages, but the random character is actually greater with the SAMPLE tool, which also allows grouping according to additional fields.

However, the SAMPLE tool can generally do much more than the name suggests and is very helpful in many workflows - not only when it comes to selecting a partial dataset.

要查看或添加评论，请登录

Roland Schubert的更多文章

Time For Summary

2022年12月24日

Time For Summary

When I receive a new table or file, I always try to first get an overview. Usually by bringing an INPUT DATA tool onto…
A Little Bit More: Oversampling

2022年12月23日

A Little Bit More: Oversampling

When selecting data records, it is often simply a matter of selecting only the first (or last) data from a table…

1 条评论
Building Groups Based on Relations

2022年12月22日

Building Groups Based on Relations

Grouping is not necessarily a very unusual task - customer groups always come to my mind spontaneously. Common…
Different Types of Correlation

2022年12月21日

Different Types of Correlation

I have to admit it - I intuitively tend to look for relationships between different data. And indeed, I often recognize…
Grouping Data

2022年12月20日

Grouping Data

Grouping data in some way is an essential part of day-to-day business for data analysts. Many people immediately think…
Comparing To Lists

2022年12月19日

Comparing To Lists

Long years ago, a DIY chain in Germany launched a discount campaign entitled "20% discount on all items - except pet…
Famous (or Not-So-Famous) Last Words

2022年12月18日

Famous (or Not-So-Famous) Last Words

Sometimes you just have to have the last word - the last word from a text field, of course. When it comes to "breaking…

1 条评论
Break on Error

2022年12月17日

Break on Error

An error has occurred in a workflow and it continues to run anyway? Sometimes that's all right, but only sometimes…
Year-To-Date Calculations

2022年12月16日

Year-To-Date Calculations

If you are working in Finance/FP&A/Controlling, calculating "Year to Date" (YTD) values is an essential part of your…
Compare Date and Time

2022年12月15日

Compare Date and Time

We often need data only for a specific period of time - a year, a month or a week, sometimes just a few hours, but the…

See all articles

Sample or Random Sample?

Roland Schubert

Entscheidungs- und ergebnisorienierte Datenanalyse für Finance, Marketing & HR - Von der Strategie bis zur Umsetzung | Planungs- und Analyseprozesse effizient gestalten, optimieren und automatisieren | Alteryx ACE

领英推荐

Roland Schubert的更多文章

社区洞察

其他会员也浏览了

Applying Conditional Formatting Based on Another Cell in Excel

How to Implement data validation to allow only text entries in Excel

How to Identify the Last Row in Text Data in Excel

How to use the AVERAGEIFS Function in Google Sheets

How to Use If cell contains in Excel

How to use the COUNT Function in Google Sheets

How to use the ISNUMBER Function in Google Sheets

How to use the POISSON.DIST Function in Google Sheets

How to use the MAX Function in Google Sheets

Data Validation for Specific Multiples in Excel

领英推荐

Roland Schubert的更多文章

Time For Summary

A Little Bit More: Oversampling

Building Groups Based on Relations

Different Types of Correlation

Grouping Data

Comparing To Lists

Famous (or Not-So-Famous) Last Words

Break on Error

Year-To-Date Calculations

Compare Date and Time

社区洞察

其他会员也浏览了

Applying Conditional Formatting Based on Another Cell in Excel

How to Implement data validation to allow only text entries in Excel

How to Identify the Last Row in Text Data in Excel

How to use the AVERAGEIFS Function in Google Sheets

How to Use If cell contains in Excel

How to use the COUNT Function in Google Sheets

How to use the ISNUMBER Function in Google Sheets

How to use the POISSON.DIST Function in Google Sheets

How to use the MAX Function in Google Sheets

Data Validation for Specific Multiples in Excel