bReakfast: Who Wants More R?

bReakfast: Who Wants More R?

Slice and Dice the StackOverflow Developer Survey

Stack Overflow surveys developers every year and their analysis is recommended reading. In addition, they publish the complete (anonymized) data set for further research.

Since I teach R (and Raspberry Pi) I'm interested in anything related to R learners. In the next couple of posts, I'll show you how I did research on the R dataset. (Are you interested in how to do this with the tidyverse? See how fellow LinkedIn Learning author Martin John Hadley accomplishes this task)

In the last post, I imported the survey and found the R programmers from a semi-colon delimited text string using grep and regular expressions.

Which Programmers Want To Learn More R?

I'm interested in creating awareness of my LinkedIn Courses on R programming. To do that, I'm going to advertise on a website - but which website will be the most valuable? I'd like to generate a table showing the amount of interest in R against languages people already know.

It's really important to clearly state the question. Which seems like a silly thing to point out - but if you aren't clear on where you're trying to go, how are you going to get there?

What I intend to do is ...

  1. Calculate how many programmers use each language
  2. Of the subset of programmers who want to learn more R, calculate how many programmers use each language
  3. Divide the second step by the first step. This will give me a % of programmers who want to learn R listed by their currently used language.

So I need a list of languages used by developers from the StackOverflow developer survey

# I'll start with ...

survey_results_public$LanguageWorkedWith

# ...then split it at the string delimiter ";"...

strsplit(survey_results_public$LanguageWorkedWith, ";")

# ...this produces a list which I'll flatten...

unlist(strsplit(survey_results_public$LanguageWorkedWith, ";"))

# ...then count per language ...

table(unlist(strsplit(survey_results_public$LanguageWorkedWith, ";")))

# ...then convert to a data.frame and store in "totPop_lang"

totPop_Lang <- as.data.frame(table(unlist(strsplit(survey_results_public$LanguageWorkedWith, ";"))))

Here's links to video lessons on unlist, table, and as.data.frame.

totPop_Lang is now a data.frame that contains ...

Assembly Bash/Shell/PowerShell                     C                    C
    5833                 31991                 18017                 27097 

     C++               Clojure                  Dart                Elixir 
   20524                  1254                  1683                  1260 

  Erlang                    F#                    Go              HTML/CSS 
     777                   973                  7201                 55466
 
    Java            JavaScript                Kotlin           Objective-C 
   35917                 59219                  5620                  4191 

Other(s):                   PHP                Python                     R 
     7920                 23030                 36443                  5048 

    Ruby                  Rust                 Scala                   SQL 
    7331                  2794                  3309                 47544 

   Swift            TypeScript                   VBA           WebAssembly 
    5744                 18523                  4781                  1015 

Next I'll count the languages used by programmers that want to learn more R. In this example, I've stored each element into successive variables. This is the sort of thing best done with pipelining ( i.e. %>%). I didn't use it here because I'm trying to keep the example clear.

step3a <- survey_results_public[grep(c("^R;|;R;|;R$"), survey_results_public$LanguageDesireNextYear), "LanguageWorkedWith"]
step3b <- strsplit(step3a, ";")
step3c <- unlist(step3b)
step3d <- table(step3c)
step3e <- as.data.frame(step3d)

step3e now contains a count by language, but only for programmers that want to learn R...

 Assembly Bash/Shell/PowerShell                     C                    C
      615                  2864                  1767                  1887 

      C++               Clojure                  Dart                Elixir 
     1871                   114                   145                   121 

   Erlang                    F#                    Go              HTML/CSS 
      110                   129                   467                  4139 

     Java            JavaScript                Kotlin           Objective-C 
     2755                  4017                   293                   269 

Other(s):                   PHP                Python                     R 
      684                  1662                  3987                  2541 

     Ruby                  Rust                 Scala                   SQL 
      524                   158                   341                  4454 

    Swift            TypeScript                   VBA           WebAssembly 
      314                  1032                   758                   108 

Next, I merge the two data sets into one data.frame. I'm doing this to simplify the example so someone else has a chance of understanding what I'm doing...

lang_tot_R <- merge(totPop_Lang, step3e, 
                    by.x = "Var1", by.y = "step3c")

# then I clean up the names
names(lang_tot_R) <- c("Language", "worked with", "desire")

Here's video lessons on merge and names.

I'm interested in the relative interest among users of each language for learning R. So I divide the second set against the first....

# divide "desire" by "total population" and store in "quotient"
lang_tot_R$quotient <- lang_tot_R$desire / lang_tot_R$`worked with`

# sort by interest (quotient)
lang_tot_R <- lang_tot_R[order(lang_tot_R$quotient, decreasing = TRUE), ]

# convert the quotient to a percentage
lang_tot_R$quotient <- lang_tot_R$quotient * 100

And presto - I have the result...

                Language worked with desire  quotient
20                     R        5048   2541 50.336767
27                   VBA        4781    758 15.854424
9                 Erlang         777    110 14.157014
10                    F#         973    129 13.257965
19                Python       36443   3987 10.940373
28           WebAssembly        1015    108 10.640394
1               Assembly        5833    615 10.543460
23                 Scala        3309    341 10.305228
3                      C       18017   1767  9.807404
8                 Elixir        1260    121  9.603175
24                   SQL       47544   4454  9.368164
5                    C++       20524   1871  9.116157
6                Clojure        1254    114  9.090909
2  Bash/Shell/PowerShell       31991   2864  8.952518
17             Other(s):        7920    684  8.636364
7                   Dart        1683    145  8.615567
13                  Java       35917   2755  7.670462
12              HTML/CSS       55466   4139  7.462229
18                   PHP       23030   1662  7.216674
21                  Ruby        7331    524  7.147729
4                     C#       27097   1887  6.963871
14            JavaScript       59219   4017  6.783296
11                    Go        7201    467  6.485210
16           Objective-C        4191    269  6.418516
22                  Rust        2794    158  5.654975
26            TypeScript       18523   1032  5.571452
25                 Swift        5744    314  5.466574
15                Kotlin        5620    293  5.213523

...and a barplot of the results...

par(mar=c(11,4,4,4)) #increase margin
barplot(lang_tot_R$quotient,
        names.arg = lang_tot_R$Language,
        ylab = "% wanting to learn more R",
        main = "Who wants to learn more R?",
        las=2)

...that's the code. The plot is shown at the top of this article, but here it is in case it gets munged up.

No alt text provided for this image

Here are takeaways from this chart...

  1. 50% of R programmers want to learn more R
  2. The next interesting groups are VBA, Erlang, and F#
  3. Python programmers are a contented bunch - only 10% feel a need to learn R

So - perhaps I should be advertising on R-centric sites, followed by sites catering to VBA, Erlang, and F#

bReakfast is an ongoing look over my shoulder as I use R to explore data.

#rstats #linkedinlearning

要查看或添加评论,请登录

社区洞察

其他会员也浏览了