bReakfast: subsetting data
Mark Niemann-Ross
Author of "Stupid Machine" and educator at LinkedIn learning
Slice and Dice the StackOverflow Developer Survey
Stack Overflow surveys developers every year and their resulting analysis is recommended reading. In addition, they publish the complete (anonymized) data set for anyone interested in further research.
Since I teach R (and Raspberry Pi) I'm interested in anything related to R learners. So I'm interested in subsetting that survey to just R programmers. In the next couple of posts, I'll show you how I did it. (Are you interested in how to do this with the tidyverse? See how fellow LinkedIn Learning author Martin John Hadley accomplishes this task)
My first step is to import the data...
survey_results_public <- read.csv("~/so developer survey 2019/developer_survey_2019/survey_results_public.csv",
stringsAsFactors=FALSE)
... then explore what I've imported. Exploration is a fancy word for just looking at the data.frame that resulted from read.csv. I can see two interesting variables...
survey_results_public$LanguageWorkedWith
survey_results_public$LanguageDesireNextYear
These variables contain semi-colon delimited strings...
sample(survey_results_public$LanguageWorkedWith, 1)
[1] "Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;PHP;Python;SQL"
...or...
[1] "C;C++;C#;HTML/CSS;Java;JavaScript;PHP;Python;SQL;TypeScript"
But I only want R programmers. Is there a simple way to subset R users? Sure...use grep
grep("R", survey_results_public$LanguageWorkedWith, value = TRUE)
...which results in...
"Bash/Shell/PowerShell;HTML/CSS;Ruby;SQL"
Maybe I shouldn't be smug. grepping for "R" also finds Ruby and Rust and possibly others. I need to be more selective. Regular Expressions are a black art, but worth adding to your toolbox. In this case, I'm looking for...
- variables beginning with R; For example: "R;HTML;SQL". The regular expression is ^R;
- variables with ;R; For example: "HTML;R;SQL". The regular expression is ;R;
- variables ending ;R For example: "HTML;SQL;R". The regular expression is ;R$
Regular expressions allow us to combine search strings with a pipe to indicate "or". So the Regular expression...
^R; | ;R; | ;R$
... can be read as "strings beginning with R; OR ;R; OR ending with ;R"
Here's the grep with this new regular expression...
grep(c("^R;|;R;|;R$"), survey_results_public$LanguageWorkedWith, value = TRUE)
...which finds...
[999] "Assembly;Python;R"
[996] "R;SQL"
[982] "C++;C#;HTML/CSS;JavaScript;Python;R;SQL"
EXCELLENT. This regular expression works. Now I want a list of the survey responses that match this regular expression. Removing "value = TRUE" from the grep does just that...
grep(c("^R;|;R;|;R$"), survey_results_public$LanguageWorkedWith)
...which returns...
[1] 6 10 12 18 20 33 51 66 95 96 97 129 170 173
[15] 187 196 198 213 224 306 336 369 380 384 407 410 418 437
...and so on
Next Time...
That's enough for now, I need to get on with other tasks. Next time, I'll discover how to combine...
grep(c("^R;|;R;|;R$"), survey_results_public$LanguageWorkedWith)
...with...
grep(c("^R;|;R;|;R$"), survey_results_public$LanguageDesireNextYear)
Or perhaps you'll beat me to it. Feel free to add your solution in the comments below...
bReakfast is an ongoing look over my shoulder as I use R to explore data.
#rstats #linkedinlearning