Sampling Sorcery: The Magic of Data Selection

Sampling Sorcery: The Magic of Data Selection

Sampling the data is a crucial step in Research, Data Analysis and Machine Learning. It is the process of drawing a predefined number of data points from a population. Imagine having more than a million data points available and needing to deduce any inferences or use them for training a machine learning model, so it wouldn’t be easy to use all the available data. This is where the sampling comes into the picture.

It’s computationally expensive and less productive to use the whole data. So, we draw out a sample out of it to do the task. This makes it a lot more productive and efficient, especially when using it for training models. But the thing is, we need to ensure that the sample represents the whole population. This is pretty important as we are using a sample from the population, so we are not including all the data points. So, we need to use specific sampling methods which will ensure that the sample is representative of the population.

Think of it as trying out a small portion of a new dish before having it for the meal. Sampling makes data more manageable for analysis or training machine learning models. Let’s discuss the process of sampling in detail!

Sampling involves various?steps:

  1. Define Your Objective: It is crucial to define the need for sampling to ensure that your sample aligns with your goals.
  2. Identify the target Population
  3. Select the Sampling Frame: The sampling frame is the list of things or items or features we are interested in the target population. Not every feature is relevant to our needs, so features are filtered out.
  4. Choose the Sampling Method: The sampling method is decided on the basis of the kind of data to be sampled. We will discuss more on this in the latter part.
  5. Determine the Sample Size: In this step, we decide the number of records of the data to be sampled. Try to have as many data points as possible which enhances the overall population representativeness of the sample.
  6. Collect the Sample Data: Once we have sampled the data, it is extracted and stored in a format as per the requirements.

The purpose of sampling is to get the “just right” amount of data. There are two types of sampling methods:

1. Probabilistic Sampling:

In such sampling, each data point has an equal or some calculable probability of getting selected into the sample. This type of sampling is often used for getting inferences about a large population.

2. Non-probabilistic Sampling:

In this, the data points are not selected based on some probability, but instead on the basis of the judgement or expertise of the researcher or certain criteria. It can be used to understand a data population. In this, the researcher deliberately chooses data points they believe are most relevant to their research objectives. However, it is important to note that it can introduce bias into the sample as not all data points have equal chances of getting selected.

Types of Probabilistic Sampling:

1. Simple Random Sampling:

As the name suggests, this is the simplest form of sampling. In this, the data points are selected totally random. So, each data point has an equal chance of being chosen. This reduces the chances of biased sampling as the data is selected randomly.

For example, say we have a dataset containing customer information for an e-commerce website, and you want to perform some analysis on a sample of this data. Simple Random Sampling will be relevant here!

2. Stratified Sampling:

Strata is the term referring to subgroups of data that share similar characteristics. In this method, we divide the data points into different strata, and then equal fractions of the data points from each strata are selected randomly for the sample. This is the most commonly used method for sampling data used for classification problems. It is quite useful especially when we have an imbalanced dataset i.e. one class has more observations than the other, where simple random sampling will not be effective and may result in a skewed training dataset. Selection is again random, so this preserves the equal chance feature.

For example, say we have a song dataset that includes features and the emotion associated with the song. We are creating a machine-learning model based on this data to predict the emotion associated with the song with the help of given input features. Here, we can use stratified sampling to get the training dataset, which will contain equal fractions of songs from the dataset related to each emotion. This will ensure that we have a training dataset containing an adequate number of records related to each emotion, even if the data is imbalanced.

3. Cluster Random Sampling:

The Cluster also means group but the grouping is more natural. It is different than strata. Clusters are formed on the basis of similarities or differences between data points, using certain criteria with the help of some clustering algorithms and data characteristics. Each cluster should be representative of the population. Each should be mutually exclusive. In this sampling method, we divide the population into different clusters and then randomly select any one of those clusters as the sample.

Careful planning is essential in such sampling as it can introduce bias if clusters are not representative of the population. Some clusters may contain data points that are more similar than in others clusters, resulting in bias.

For example, suppose you’re conducting a healthcare study in a large region with multiple health clinics. We can use cluster random sampling to sample the data. In this, we would be dividing the data on the basis of geographical detail. With this, we will study a small area of that region to infer the insights related to healthcare practices in the region.

4. Systematic random Sampling:

In such sampling, we put the data into an ordered sequence using some criteria. A random starting point is selected and then, the data points for the sample are chosen from that sequence at regular intervals. Here, the sampling is systematic, yet random. The sample would represent the population and chances of bias are also reduced because of randomness.

For example, let’s say you are conducting a customer satisfaction survey for a mall that has 200 stores. Here, we can use systematic random sampling. Since it’s not easy to survey each customer, you survey every customer of every 10th store. This ensures that each store in the shopping mall has an equal chance of being included in the sample, making it a more efficient and structured method compared to simple random sampling.

Types of Non-Probabilistic Sampling:

1. Convenience Sampling:

In this sampling, the researcher selects the samples which are easy and convenient to collect. However, it can introduce significant bias into the sample because the sample may not represent the whole population. This is less reliable than simple random sampling. It can often exhibit under-coverage of the population. While it can be cost-effective and time-productive. Convenience sampling often involves selecting data points that are easily accessible to the researcher.

For example, consider a street interview about a new product. The people who are walking by the street we are in are easily accessible to us. So, it gives us the convenience to ask them about the product without much wandering on the street.

2. Voluntary Response Sampling:

Voluntary response sampling involves voluntary acquaintance from the people for a study. It consists of people from the population who willingly participated in the study. It suffers from non-response bias as there will be a group of people who refrained from participating in the study and they may pose strong responses.

For example, running a poll on platforms like Twitter or Instagram where followers voluntarily vote or respond. Some people might not show interest in joining the poll and they may pose strong responses, so it would be a loss for the sampling.

3. Snowball Sampling:

In this sampling, the initial participants are recruited and then they are asked to recruit others to participate. Like a snowball, the sample keeps growing as more participants promote the study. However, this can introduce bias as the people who will be recruited by the participants may have similar characteristics, resulting in a similar response. This method is usually used when researching hard-to-reach or hidden populations. It relies on the social networks and connections of the initial participants.

For example, you want to study the experiences of people who have recently immigrated to your country. These are hard to reach, so snowball sampling suits well for this. We asked some of those people to participate in the survey and these were asked to refer us to their connections. The process continues, with each new participant referring you to others in their network who have relevant experiences. Over time, your sample size grows as you follow these referrals.

4. Purposive Sampling:

In this sampling, the researchers choose the individuals or elements from the population on the basis of certain predefined conditions/criteria. Purposive selection is done. This is preferred for qualitative research when data related to specific characteristics is required and relevant to the research goal.

For example, consider medical research. When conducting a research on particular medical condition, the researcher may purposively select the individuals who have been diagnosed with that condition and undergoing treatment.

Sampling Accuracy:

The sample, produced, must be representative of the population. The best way to understand how great of a fit is the sample to the population is by calculating the Standard Error of the sample.

where n is the sample size

Using standard error, we can understand how representative the sample is to the population. It indicates the variability between sample statistics of various samples produced from the population.

Any sample can not accurately represent the population as a sample does not include all the data points thus resulting in an approximate representation. Each sampling method possesses some standard error. Some possess more than others, although it depends on the data as well.

The smaller the standard error, the greater the accuracy of the sample in reflecting the population.

The distribution of samples follows the Normal distribution. This is created by repeatedly drawing different samples from a population and then plotting the concerned statistics. The peak of this distribution represents the actual population parameter. As the statistics of the sample approach the population parameter, the sample becomes more representative of the population.

The Standard Error represents the standard deviation of this distribution! We know that the larger the standard deviation of a distribution, the wider the curve and (from observation) as a result, we will have less probability of getting the statistics near the population parameter (i.e. centre).

With a low standard error, we will have a narrower distribution and thus, more accurate statistics. With a narrower distribution, the statistics are more likely to fall near the parameter.

Let’s connect on LinkedIn

You can also find me on Instagram

Thanks For Reading???

Have a great?day!


Aishwarya Thengne Panghate

Delivery Lead | Product & User Experience Designer | Fintech | SaaS | B2B | B2C | Accessibility | PayFac | Interaction Design | BFSI | UX Design & Strategy

1 年

Keep it up??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了