登录查看更多内容

bReakfast: subsetting data

Mark Niemann-Ross

Author of "Stupid Machine" and educator at LinkedIn learning

发布日期: 2019年5月22日

Slice and Dice the StackOverflow Developer Survey

Stack Overflow surveys developers every year and their resulting analysis is recommended reading. In addition, they publish the complete (anonymized) data set for anyone interested in further research.

Since I teach R (and Raspberry Pi) I'm interested in anything related to R learners. So I'm interested in subsetting that survey to just R programmers. In the next couple of posts, I'll show you how I did it. (Are you interested in how to do this with the tidyverse? See how fellow LinkedIn Learning author Martin John Hadley accomplishes this task)

My first step is to import the data...

survey_results_public <- read.csv("~/so developer survey 2019/developer_survey_2019/survey_results_public.csv", 
                                  stringsAsFactors=FALSE)

... then explore what I've imported. Exploration is a fancy word for just looking at the data.frame that resulted from read.csv. I can see two interesting variables...

survey_results_public$LanguageWorkedWith
survey_results_public$LanguageDesireNextYear

These variables contain semi-colon delimited strings...

sample(survey_results_public$LanguageWorkedWith, 1)

[1] "Bash/Shell/PowerShell;C#;HTML/CSS;JavaScript;PHP;Python;SQL"

...or...

[1] "C;C++;C#;HTML/CSS;Java;JavaScript;PHP;Python;SQL;TypeScript"

But I only want R programmers. Is there a simple way to subset R users? Sure...use grep

grep("R", survey_results_public$LanguageWorkedWith, value = TRUE)

...which results in...

"Bash/Shell/PowerShell;HTML/CSS;Ruby;SQL"

Maybe I shouldn't be smug. grepping for "R" also finds Ruby and Rust and possibly others. I need to be more selective. Regular Expressions are a black art, but worth adding to your toolbox. In this case, I'm looking for...

variables beginning with R; For example: "R;HTML;SQL". The regular expression is ^R;
variables with ;R; For example: "HTML;R;SQL". The regular expression is ;R;
variables ending ;R For example: "HTML;SQL;R". The regular expression is ;R$

Regular expressions allow us to combine search strings with a pipe to indicate "or". So the Regular expression...

^R; | ;R; | ;R$

... can be read as "strings beginning with R; OR ;R; OR ending with ;R"

Here's the grep with this new regular expression...

grep(c("^R;|;R;|;R$"), survey_results_public$LanguageWorkedWith, value = TRUE)

...which finds...

[999] "Assembly;Python;R"
[996] "R;SQL"
[982] "C++;C#;HTML/CSS;JavaScript;Python;R;SQL"

EXCELLENT. This regular expression works. Now I want a list of the survey responses that match this regular expression. Removing "value = TRUE" from the grep does just that...

grep(c("^R;|;R;|;R$"), survey_results_public$LanguageWorkedWith)

...which returns...

   [1]     6    10    12    18    20    33    51    66    95    96    97   129   170   173
  [15]   187   196   198   213   224   306   336   369   380   384   407   410   418   437

...and so on

Next Time...

That's enough for now, I need to get on with other tasks. Next time, I'll discover how to combine...

grep(c("^R;|;R;|;R$"), survey_results_public$LanguageWorkedWith)

...with...

grep(c("^R;|;R;|;R$"), survey_results_public$LanguageDesireNextYear)

Or perhaps you'll beat me to it. Feel free to add your solution in the comments below...

bReakfast is an ongoing look over my shoulder as I use R to explore data.

#rstats #linkedinlearning

bReakfast: subsetting data

Mark Niemann-Ross

Author of "Stupid Machine" and educator at LinkedIn learning

Slice and Dice the StackOverflow Developer Survey

Next Time...

bReakfast is an ongoing look over my shoulder as I use R to explore data.

更多精彩文章

社区洞察

其他会员也浏览了

Love to create dashboards?

New Course: Data Analytics with Observable

Using ORMs vs Writing Raw Queries

Algorithms & Data Structures— A beginners guide ?? ??

Loading Data in GraphDB: Best Practices and Tools

Query Filters in Entity Framework Core: Applying Global Filters to Queries

React Hooks vs. Redux

Outputting Tables via Outlook in Microsoft Fabric Pipelines: A Step-by-Step Guide

The Actuary's Dilemma: Choosing Between Open Source and Vendor Software for Data Analytics

10 Advanced features of Laravel Eloquent

Slice and Dice the StackOverflow Developer Survey

Next Time...

bReakfast is an ongoing look over my shoulder as I use R to explore data.

Documenting My Code ... For Me

2024年5月15日

R Meets Hardware

2024年5月8日

Party Buzz Kill: modifying data

2024年4月17日

Rain - Evapotranspiration = mm Water

2024年4月11日

Party Buzz Kill: Data Storage

2024年4月3日

R Waters My Garden

2024年3月27日

Caning and Naming

2024年3月26日

Irrigate with R and Raspberry Pi

2024年3月5日

5 Reasons to Learn Natural Language Processing with R

2024年2月13日

Performing Natural Language Processing with R

2024年2月6日

社区洞察

其他会员也浏览了

Love to create dashboards?

New Course: Data Analytics with Observable

Using ORMs vs Writing Raw Queries

Algorithms & Data Structures— A beginners guide ?? ??

Loading Data in GraphDB: Best Practices and Tools

Query Filters in Entity Framework Core: Applying Global Filters to Queries

React Hooks vs. Redux

Outputting Tables via Outlook in Microsoft Fabric Pipelines: A Step-by-Step Guide

The Actuary's Dilemma: Choosing Between Open Source and Vendor Software for Data Analytics

10 Advanced features of Laravel Eloquent