R Challenge #1: World Population
I'm proud to say I've joined the esteemed ranks of Code Challenge Authors. In this and a set of future articles, I'll share some backstory behind each challenge. My peers have supplied other code challenges you might find interesting: Javascript ... Python ... Java ... Github ... HTML ... SQL ... SQL for Data Science ... PHP
R Challenge #1 : Import the World Population database
Importing CSV into an R data object would seem straightforward - but you quickly run into nuances that will foul your data. Data types, factors, missing data, incomplete lines and more cause importing to be a nightmare. In this episode, I challenge you to import a CSV file.
Here's My Solution
# import the United Nations world population database
# all fields should be integer or numeric
# except variant = factor, location = character
worldPop <- read.csv("https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv",
? ? ? ? ?colClasses = c("integer","character","integer", "factor", "integer", "numeric","numeric","numeric","numeric","numeric"))
What You Don't See
Writing courses for LinkedIn (and other online learning portals) requires sample code to be both public domain and accessible - AND - it should be interesting to work with. The UN world population data fills all three, but MAN, it was difficult to find. I spent more time looking for the dataset than I did actually writing the code.
An option would be to download the data and include it with the example files. Which would contribute to a multi-gigabyte download. Some of you don't have that kind of patience and bandwidth, so it's easier to grab it directly from the code.
领英推荐
An Alternate Solution
I have an ongoing debate with my peers regarding base R vs tidy R. Should I teach the tidy verse (read_csv)? Or should I teach base R (read.csv)? So far, I've focused on base R - but I'm aware there are cleaner solutions available in the tidyverse. For example, compare this code to the above.
library(readr)
WPP2019_TotalPopulationBySex <- read_csv("https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?col_types = "icifnnnnnn")
It looks similar, but read_csv is faster and makes better decisions about factors and missing data. Plus it creates an object tailored to other routines in the tidyverse.
Choosing a Problem
Choosing data is one aspect of authoring a course. Choosing a problem is another. Good challenges require some thought, possibly resulting in failure the first time through. I believe it's possible to import the data within two or three attempts.
And...good challenges are focused. I hope to illustrate one concept - maybe two. If a challenge requires multiple solutions, an early failure obscures the remaining lessons. Less complex challenges are easier to focus on a concept - but they may be less interesting to solve.
How about you?
Do you have some opinions on this challenge? Please share them in the comments below.
mnr
I help busy solopreneur parents save 12+ hours per week by putting systems in place and automating more of their work.
3 年Welcome to the pack! I love this course format!
Epidemiology & Biostatistics Consultant a/k/a Data Scientist | Exclusive and innovative solutions for data science challenges in public health, research and education
3 年Hey Mark Niemann-Ross I love your videos! They are so fun! Hey Daniel Wanjiru - take a look at this video to see the R version of one of those SAS data steps where you use "input" and "cards". Hey healthcare analytics people: If you want to play with smaller datasets, try copying and pasting from this site with data about hospitals in the US: https://www.ahd.com/state_statistics.html
PhD, Researcher
3 年Since the size of .csv file is relatively large (20M), I tried the following code to download the file into the default folder used by web browser, then can be readed locally: url <- "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv" browseURL(url) Wish it can help people with limited internet bandwidth or need to access the file frequently.
Generative AI, Prompt Engineering and Full Stack Development. LinkedIn Top Voice. Senior Staff Instructor at LinkedIn, Instructor at Stanford University.
3 年I think it's Scotty's turn to show you the secret handshake.
Product Manager | Engineer | Instructor | Veteran
3 年Congrats Mark! Welcome to the club!!! I had Metallica blasting in the background when I started playing your first challenge video. I misheard your opening words of "data science" as "data sucks." Seemed like an odd way to kick off an R video! ??