bReakfast: Who Wants More R?
Mark Niemann-Ross
Author of "Stupid Machine" and educator at LinkedIn learning
Slice and Dice the StackOverflow Developer Survey
Stack Overflow surveys developers every year and their analysis is recommended reading. In addition, they publish the complete (anonymized) data set for further research.
Since I teach R (and Raspberry Pi) I'm interested in anything related to R learners. In the next couple of posts, I'll show you how I did research on the R dataset. (Are you interested in how to do this with the tidyverse? See how fellow LinkedIn Learning author Martin John Hadley accomplishes this task)
In the last post, I imported the survey and found the R programmers from a semi-colon delimited text string using grep and regular expressions.
Which Programmers Want To Learn More R?
I'm interested in creating awareness of my LinkedIn Courses on R programming. To do that, I'm going to advertise on a website - but which website will be the most valuable? I'd like to generate a table showing the amount of interest in R against languages people already know.
It's really important to clearly state the question. Which seems like a silly thing to point out - but if you aren't clear on where you're trying to go, how are you going to get there?
What I intend to do is ...
- Calculate how many programmers use each language
- Of the subset of programmers who want to learn more R, calculate how many programmers use each language
- Divide the second step by the first step. This will give me a % of programmers who want to learn R listed by their currently used language.
So I need a list of languages used by developers from the StackOverflow developer survey
# I'll start with ...
survey_results_public$LanguageWorkedWith
# ...then split it at the string delimiter ";"...
strsplit(survey_results_public$LanguageWorkedWith, ";")
# ...this produces a list which I'll flatten...
unlist(strsplit(survey_results_public$LanguageWorkedWith, ";"))
# ...then count per language ...
table(unlist(strsplit(survey_results_public$LanguageWorkedWith, ";")))
# ...then convert to a data.frame and store in "totPop_lang"
totPop_Lang <- as.data.frame(table(unlist(strsplit(survey_results_public$LanguageWorkedWith, ";"))))
Here's links to video lessons on unlist, table, and as.data.frame.
totPop_Lang is now a data.frame that contains ...
Assembly Bash/Shell/PowerShell C C#
5833 31991 18017 27097
C++ Clojure Dart Elixir
20524 1254 1683 1260
Erlang F# Go HTML/CSS
777 973 7201 55466
Java JavaScript Kotlin Objective-C
35917 59219 5620 4191
Other(s): PHP Python R
7920 23030 36443 5048
Ruby Rust Scala SQL
7331 2794 3309 47544
Swift TypeScript VBA WebAssembly
5744 18523 4781 1015
Next I'll count the languages used by programmers that want to learn more R. In this example, I've stored each element into successive variables. This is the sort of thing best done with pipelining ( i.e. %>%). I didn't use it here because I'm trying to keep the example clear.
step3a <- survey_results_public[grep(c("^R;|;R;|;R$"), survey_results_public$LanguageDesireNextYear), "LanguageWorkedWith"]
step3b <- strsplit(step3a, ";")
step3c <- unlist(step3b)
step3d <- table(step3c)
step3e <- as.data.frame(step3d)
step3e now contains a count by language, but only for programmers that want to learn R...
Assembly Bash/Shell/PowerShell C C#
615 2864 1767 1887
C++ Clojure Dart Elixir
1871 114 145 121
Erlang F# Go HTML/CSS
110 129 467 4139
Java JavaScript Kotlin Objective-C
2755 4017 293 269
Other(s): PHP Python R
684 1662 3987 2541
Ruby Rust Scala SQL
524 158 341 4454
Swift TypeScript VBA WebAssembly
314 1032 758 108
Next, I merge the two data sets into one data.frame. I'm doing this to simplify the example so someone else has a chance of understanding what I'm doing...
lang_tot_R <- merge(totPop_Lang, step3e,
by.x = "Var1", by.y = "step3c")
# then I clean up the names
names(lang_tot_R) <- c("Language", "worked with", "desire")
Here's video lessons on merge and names.
I'm interested in the relative interest among users of each language for learning R. So I divide the second set against the first....
# divide "desire" by "total population" and store in "quotient"
lang_tot_R$quotient <- lang_tot_R$desire / lang_tot_R$`worked with`
# sort by interest (quotient)
lang_tot_R <- lang_tot_R[order(lang_tot_R$quotient, decreasing = TRUE), ]
# convert the quotient to a percentage
lang_tot_R$quotient <- lang_tot_R$quotient * 100
And presto - I have the result...
Language worked with desire quotient
20 R 5048 2541 50.336767
27 VBA 4781 758 15.854424
9 Erlang 777 110 14.157014
10 F# 973 129 13.257965
19 Python 36443 3987 10.940373
28 WebAssembly 1015 108 10.640394
1 Assembly 5833 615 10.543460
23 Scala 3309 341 10.305228
3 C 18017 1767 9.807404
8 Elixir 1260 121 9.603175
24 SQL 47544 4454 9.368164
5 C++ 20524 1871 9.116157
6 Clojure 1254 114 9.090909
2 Bash/Shell/PowerShell 31991 2864 8.952518
17 Other(s): 7920 684 8.636364
7 Dart 1683 145 8.615567
13 Java 35917 2755 7.670462
12 HTML/CSS 55466 4139 7.462229
18 PHP 23030 1662 7.216674
21 Ruby 7331 524 7.147729
4 C# 27097 1887 6.963871
14 JavaScript 59219 4017 6.783296
11 Go 7201 467 6.485210
16 Objective-C 4191 269 6.418516
22 Rust 2794 158 5.654975
26 TypeScript 18523 1032 5.571452
25 Swift 5744 314 5.466574
15 Kotlin 5620 293 5.213523
...and a barplot of the results...
par(mar=c(11,4,4,4)) #increase margin
barplot(lang_tot_R$quotient,
names.arg = lang_tot_R$Language,
ylab = "% wanting to learn more R",
main = "Who wants to learn more R?",
las=2)
...that's the code. The plot is shown at the top of this article, but here it is in case it gets munged up.
Here are takeaways from this chart...
- 50% of R programmers want to learn more R
- The next interesting groups are VBA, Erlang, and F#
- Python programmers are a contented bunch - only 10% feel a need to learn R
So - perhaps I should be advertising on R-centric sites, followed by sites catering to VBA, Erlang, and F#
bReakfast is an ongoing look over my shoulder as I use R to explore data.
#rstats #linkedinlearning