登录查看更多内容

Dealing with extremely small datasets

Chidi Akurunwa

AI Technology Leader at SCSK & Sumitomo Europe | ex-CGI | AI Strategist | Helping business leaders and organisations transform AI concepts into business solutions

发布日期: 2020年7月14日

I am passionate about using data science tools and techniques to analyse social and economic issues in Sub-Saharan Africa, and unfortunately finding data on some of these issues can be quite difficult. It is safe to say, I have had to get to grips with using small datasets. In this project, I was interested in finding out just how much manufacturing value was being added to the technology industry in Nigeria, and if increasing this, could drop the unemployment rate. This was the alternative hypothesis for this project.

DATA COLLECTION

I was able to download some data from the World Bank website here: https://data.worldbank.org/country/nigeria, and use this in the project. I selected the % of manufacturing value, added to the tech industry, adult literacy rate, population, unemployment rate and labour force from 1960–2019.

DATA CLEANING

Unfortunately, there were no records for most of the dataset, until 1991, and so I had to get rid of all the records from 1960 till 1990. I also dropped the population from the dataset, as I felt the labour force variable was more related to the unemployment rate, which is our target variable since the unemployment rate in this dataset is relative to the labour force in Nigeria. The literacy variable was mostly empty, with only 4 rows available, leaving me no choice, but to drop it too. This meant I now only had 30 rows of data!

FEATURE ENGINEERING

It seems like we’re good to go! But, there’s another problem! This project is all about testing the alternative hypothesis, that increasing the % of manufacturing value, added to the tech industry in Nigeria would reduce the unemployment rate. Therefore, having that as a variable would be a lot better for the analysis. Let’s do that!

For me to be able to obtain that variable, I need to be able to calculate the percentage change in manufacturing value from one year to another. The first function above helps to return a list of tuples, which always contains the current row and the subsequent row. The second function then allows me to iterate through the list, and calculate the percentage change using the values in each tuple.

Looks like things are starting to come together. Now, because this is an extremely small dataset, the features we have will not be enough to produce a good enough performance for any machine learning we may do. There is a nice feature engineering trick we can use here, and that is interactions! Interactions are these pairings we can create between variables and are a good way to tell the model that these variables are related. You can learn more about this here: https://www.kaggle.com/matleonard/feature-generation. Now let’s put this all together!

SPLITTING DATA INTO TRAINING AND VALIDATION DATA

Our dataset is ready! As you probably already noticed, our dataset has time-series data. When working with time-series data, we must ensure that our dataset is split in such a way, that we avoid future dates leaking into our training process. Time to split!

Our time-series data is categorical, so we have to be able to encode it in some way or use other techniques such as recurrent neural networks. There is also a recently released toolbox that can handle time-series data more effectively, called sktime. For this project, however, I chose to use count encoding. This simply replaces the categorical values with how many times that value appears. We will encode all the categorical values using this technique.

If you read this far, you deserve a good python joke!

MACHINE LEARNING AND SOME MORE FEATURE ENGINEERING

The key aim of this project is to see if increasing the manufacturing value in the tech industry in Nigeria, could improve employment rates, so building a model that can predict unemployment rates, if the manufacturing value is increased, is important. When working with extremely small datasets, it is wise to keep things simple. So, we are not going to use a complicated model. A simple Decision Tree Regressor will be enough. This model uses the data given to perform a test at each internal node and then uses the outcome of that test to make a decision.

For regression problems, the average value of the category the outcome of each test falls into is what is used as the prediction value. The depth of the tree is the number of internal nodes in the tree after the root node.

Now on to the exciting part! Let’s train and predict using our chosen model!

We get an error of approximately 0.45 for our predictions, which is not bad at all, considering the size of our dataset. Despite this, It is still important to use a confidence interval for our predictions on the test data, because our dataset is extremely small. Since our sample size is less than 30, we will calculate a student-t confidence interval.

You can learn all about confidence intervals here: https://sixsigmastudyguide.com/types-of-hypothesis-tests/#:~:text=Types%20of%20Hypothesis%20Tests%3A%20a,t%2Dtests%20compare%20two%20samples.

TESTING MODEL

Let us load the test dataset and predict the unemployment rate. The labour force in the test dataset has been forecasted with the average % growth in the labour force in Nigeria over the last couple of years, while the % increase in manufacturing value, added to the tech industry in Nigeria has been intentionally increased by 5% over 5 years to test if our alternative hypothesis is true.

Our model predicted a drop in the unemployment rate in the next year, but no change after that. This is not surprising, as many other factors contribute to high unemployment rates in Nigeria or any country for that matter.

ANALYSIS AND CONCLUSIONS

The student-t confidence interval for our predictions, assuming a 95% significance level, is approximately [2.997, 12.45], while that of the unemployment rate in 2019 is approximately [3.26,12.94]. We can, therefore, assume, that the drop in the unemployment rate our model predicted could be statically significant since the lower bound exists outside the confidence interval of the unemployment rate in 2019. However, this is only an assumption, and with a larger dataset, could be proven wrong, because if we simply go with our prediction value, our model’s prediction would not be statistically significant. Still, since there are not enough records, the alternative hypothesis, that increasing the manufacturing value, added to the tech industry in Nigeria, can reduce unemployment rates, will be accepted.

Working with extremely small datasets is hard, but by keeping it simple, extending features when you can, using confidence intervals, you can make sensible conclusions from decent performing models. Another trick that could have been used here is outlier detection and handling, as these have a much more significant impact on model performance when the dataset is extremely small. I must also point out that these predictions do not take into consideration, the impact of the COVID-19 pandemic.

If you made it this far, you are a gem! Please connect with me on here, like and share if you enjoyed it! You can check out the full code here: https://www.kaggle.com/chizzi25/data-science-project-1-manufacturing-in-ng/edit/run/38483023

Please do provide any feedback if you can and thanks for reading :)

Aifuwa Efosa Igbinoba

Stakeholder Engagement, Development & Partnerships

4 年

Definitely not my Forte but I wasn't totally clueless as to not appreciate the significance of your assertions and the potential value of your solutions...

2 次回应

Emmanuel Ajaero

Software Engineer | Research Assistant ApLab & Health-X Lab | Master of Science in Computer Science

4 年

Wow, this is thoughtful I must say. I can already think of an application of this, the COVID-19 is new, there is hardly accurate data or professionals, the process you proposed can be used to develop models with the small data set available.

1 次回应

查看更多评论

要查看或添加评论，请登录

Chidi Akurunwa的更多文章

The art of hyperparameter tuning

2020年8月26日

The art of hyperparameter tuning

Building an effective machine learning model can be quite challenging as there are many aspects to it. However, it is…
Transfer Learning with ResNet50

2020年8月12日

Transfer Learning with ResNet50

Deep learning is one of the breakthroughs in artificial intelligence and it has made room for some of the amazing…
Time-series regression with sktime

2020年7月29日

Time-series regression with sktime

One of the things that got me so interested in data science, was the idea that the future could be predicted by…

2 条评论
Understanding forms of violence against women with machine learning

2020年7月21日

Understanding forms of violence against women with machine learning

I have a deep conviction that AI can be a voice for social and economic change. The power of AI lies in the simple fact…

Dealing with extremely small datasets

Chidi Akurunwa

AI Technology Leader at SCSK & Sumitomo Europe | ex-CGI | AI Strategist | Helping business leaders and organisations transform AI concepts into business solutions

DATA COLLECTION

DATA CLEANING

FEATURE ENGINEERING

SPLITTING DATA INTO TRAINING AND VALIDATION DATA

MACHINE LEARNING AND SOME MORE FEATURE ENGINEERING

TESTING MODEL

ANALYSIS AND CONCLUSIONS

Chidi Akurunwa的更多文章

社区洞察

其他会员也浏览了

Graph Databases and Knowledge Graphs for Science - A Primer

Advanced Statistical Analysis Using Statistical Package for Social Sciences (SPSS)

WiDA Speaker Series: Building a Career in Data Science with Olumide.

EVOLUTION OF DATA SCIENCE IN INDIA

Data Science Scope in India 2023

Effective approaches to capacity development for better and timely data collection, analysis and visualization using ODK, SPSS/Stata/R and QGIS

Words to aspiring data scientists.

Advanced Statistical Analysis Using Statistical Package for Social Sciences (SPSS)

#1 | What is Statistics? | 7-Days of Statistics for Data Science

Statistics and Probability for Data Science

DATA COLLECTION

DATA CLEANING

FEATURE ENGINEERING

SPLITTING DATA INTO TRAINING AND VALIDATION DATA

MACHINE LEARNING AND SOME MORE FEATURE ENGINEERING

TESTING MODEL

ANALYSIS AND CONCLUSIONS

Chidi Akurunwa的更多文章

The art of hyperparameter tuning

Transfer Learning with ResNet50

Time-series regression with sktime

Understanding forms of violence against women with machine learning

社区洞察

其他会员也浏览了

Graph Databases and Knowledge Graphs for Science - A Primer

Advanced Statistical Analysis Using Statistical Package for Social Sciences (SPSS)

WiDA Speaker Series: Building a Career in Data Science with Olumide.

EVOLUTION OF DATA SCIENCE IN INDIA

Data Science Scope in India 2023

Effective approaches to capacity development for better and timely data collection, analysis and visualization using ODK, SPSS/Stata/R and QGIS

Words to aspiring data scientists.

Advanced Statistical Analysis Using Statistical Package for Social Sciences (SPSS)

#1 | What is Statistics? | 7-Days of Statistics for Data Science

Statistics and Probability for Data Science