bReakfast: subsetting data
Janine from Mililani, Hawaii, United States CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)

bReakfast: subsetting data

Slice and Dice the StackOverflow Developer Survey

Stack Overflow surveys developers every year and their resulting analysis is recommended reading. In addition, they publish the complete (anonymized) data set for anyone interested in further research.

Since I teach R (and Raspberry Pi) I'm interested in anything related to R learners. So I'm interested in subsetting that survey to just R programmers. In the next couple of posts, I'll show you how I did it. (Are you interested in how to do this with the tidyverse? See how fellow LinkedIn Learning author Martin John Hadley accomplishes this task)

My first step is to import the data...

survey_results_public <- read.csv("~/so developer survey 2019/developer_survey_2019/survey_results_public.csv", 
                                  stringsAsFactors=FALSE)

... then explore what I've imported. Exploration is a fancy word for just looking at the data.frame that resulted from read.csv. I can see two interesting variables...

survey_results_public$LanguageWorkedWith
survey_results_public$LanguageDesireNextYear

These variables contain semi-colon delimited strings...

sample(survey_results_public$LanguageWorkedWith, 1)

[1] "Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;PHP;Python;SQL"

...or...

[1] "C;C++;C#;HTML/CSS;Java;JavaScript;PHP;Python;SQL;TypeScript"

But I only want R programmers. Is there a simple way to subset R users? Sure...use grep

grep("R", survey_results_public$LanguageWorkedWith, value = TRUE)

...which results in...

"Bash/Shell/PowerShell;HTML/CSS;Ruby;SQL"

Maybe I shouldn't be smug. grepping for "R" also finds Ruby and Rust and possibly others. I need to be more selective. Regular Expressions are a black art, but worth adding to your toolbox. In this case, I'm looking for...

  • variables beginning with R; For example: "R;HTML;SQL". The regular expression is ^R;
  • variables with ;R; For example: "HTML;R;SQL". The regular expression is ;R;
  • variables ending ;R For example: "HTML;SQL;R". The regular expression is ;R$

Regular expressions allow us to combine search strings with a pipe to indicate "or". So the Regular expression...

^R; | ;R; | ;R$

... can be read as "strings beginning with R; OR ;R; OR ending with ;R"

Here's the grep with this new regular expression...

grep(c("^R;|;R;|;R$"), survey_results_public$LanguageWorkedWith, value = TRUE)

...which finds...

[999] "Assembly;Python;R"
[996] "R;SQL"
[982] "C++;C#;HTML/CSS;JavaScript;Python;R;SQL"

EXCELLENT. This regular expression works. Now I want a list of the survey responses that match this regular expression. Removing "value = TRUE" from the grep does just that...

grep(c("^R;|;R;|;R$"), survey_results_public$LanguageWorkedWith)

...which returns...

   [1]     6    10    12    18    20    33    51    66    95    96    97   129   170   173
  [15]   187   196   198   213   224   306   336   369   380   384   407   410   418   437

...and so on

Next Time...

That's enough for now, I need to get on with other tasks. Next time, I'll discover how to combine...

grep(c("^R;|;R;|;R$"), survey_results_public$LanguageWorkedWith)

...with...

grep(c("^R;|;R;|;R$"), survey_results_public$LanguageDesireNextYear)

Or perhaps you'll beat me to it. Feel free to add your solution in the comments below...

bReakfast is an ongoing look over my shoulder as I use R to explore data.

#rstats #linkedinlearning

要查看或添加评论,请登录

社区洞察

其他会员也浏览了