If Marie Kondo did SQL...

If Marie Kondo did SQL...

In a number of previous posts I have mentioned the dbplyr package as a really cool way to handle SQL databases in R. dbplyr is a very clever SQL translator which is getting more and more powerful with every update. It allows you to treat your database table like it is an R dataframe object and to manipulate it using simple, clean tidyverse commands. It’s so much nicer and more intuitive that handling all those messy SQL strings in your R console.

I thought I would use a quick example to illustrate how dbplyr works, and I hope you can pick this example up and play further with it. Before you know it, you’ll never want to write in SQL again!

Using the chinook database

chinook is a SQLite database which can be downloaded here, and it’s really easy to use for practice. Once you’ve unzipped it to a folder, get your R session open and we will use the following packages:

library(tidyverse)
library(dbplyr)
library(DBI)
library(RSQLite)

First we will establish the connection with the SQLite database. Although this database is simply a locally saved file, you would use a similar connection into your Oracle or MSSQL databases or whatever[1].

chinook <- DBI::dbConnect(
  drv = RSQLite::SQLite(),
  dbname = "chinook.db"
)

We now have an active connection into the chinook database. We can browse the tables in there using RSQLite::dbListTables(chinook) which tells that there are the following tables inside:

"albums"          "artists"         "customers"       "employees"       "genres"         "invoice_items"   "invoices"        "media_types"     "playlist_track"  "playlists"      "sqlite_sequence" "sqlite_stat1"    "tracks"

We are going to play around with a few of these tables using dbplyr — in particular the following:

  • employees — details of company employees
  • customers — table of customer details
  • invoices — table of invoice details attached to customer IDs
  • invoice_items — table of details of units ordered and prices in each invoice ID

We can easily set these tables up as database objects using dplyr as follows[2]:

employees <- dplyr::tbl(chinook, "employees")
customers <- dplyr::tbl(chinook, "customers")
invoices <- dplyr::tbl(chinook, "invoices")
invoice_items <- dplyr::tbl(chinook, "invoice_items")

Each of these tables are now SQL database objects in your R session which you can manipulate in the same way as a dataframe. Note that you can see an extract of the data in each table by simply typing the table name in your R console. Importantly, in defining these tables, you are not physically downloading them, just creating a bare minimum extract to work with.

No alt text provided for this image

Querying using dbplyr

Now let’s try to do a really simple query, like how many employees do we have grouped by the year we hired them. If we were to write this as a straight SQL query, it would look like this:

qry <- "SELECT HireYear, COUNT(EmployeeId) FROM
(SELECT SUBSTR(HireDate, 1, 4) AS HireYear, EmployeeId
FROM employees)
GROUP BY HireYear"

Now if we run this

DBI::dbGetQuery(chinook, qry)

we get this returned:

   HireYear count(EmployeeId)
1     2002                 3
2     2003                 3
3     2004                 2

But we can also do this using nice tidyverse code as follows:

employees_by_hire_year <- employees %>% 
  dplyr::mutate(HireYear = substr(HireDate, 1, 4)) %>% 
  dplyr::group_by(HireYear) %>% 
  summarize(Total = n())

Since this is a very small table, you can see it by simply typing its name in the console and it does indeed return the same table as the one just above. If it was larger, you’d have to collect it from the database — more on that later.

How does this magic seem to happen? Well, because employees is recognized by R as a SQL database object, R invokes dbplyr in the background to translate the piped tidyverse code into SQL. You can see the translated SQL using the dbplyr::sql_render() function. So for example

dbplyr::sql_render(employees_by_hire_year)

returns this:

<SQL> SELECT `HireYear`, COUNT() AS `Total`
FROM (SELECT `EmployeeId`, `LastName`, `FirstName`, `Title`, `ReportsTo`, `BirthDate`, `HireDate`, `Address`, `City`, `State`, `Country`, `PostalCode`, `Phone`, `Fax`, `Email`, SUBSTR(`HireDate`, 1, 4) AS `HireYear`
FROM `employees`)
GROUP BY `HireYear`

Making complicated queries simple

The example above is pretty simple, and it’s not until your queries become quite complex that you really see the benefits of using dbplyr.

Let’s say that you want to work out how much customers spent per year on their orders. This means that you will have to extract the year from the Invoice Date, then join invoice to invoice_items using InvoiceId, then group by year and then calculate the total unit price times quantity ordered in that year. Thanks to dbplyr, we can just use our beautiful tidyverse grammar to do this:

amount_per_year <- invoices %>% 
  dplyr::mutate(InvoiceYear = substr(InvoiceDate, 1, 4) %>%          
                  as.integer()) %>% 
  dplyr::inner_join(invoice_items, by = "InvoiceId") %>% 
  dplyr::mutate(ItemTotal = UnitPrice * Quantity) %>% 
  dplyr::group_by(InvoiceYear) %>% 
  dplyr::summarise(Total = sum(ItemTotal))

This is so much nicer and easier to work with than its SQL equivalent[3]. Note that in its current form amount_per_year is still only a database object and the data has not yet been formally extracted from the database. To formally extract the data you can use dplyr's collect() function, as follows:

amount_per_year %>% dplyr::collect()

which will return:

# A tibble: 5 x 2
  InvoiceYear Total
        <int> <dbl>
1        2009  449.
2        2010  481.
3        2011  470.
4        2012  478.
5        2013  451.

If you were working on a remote database as opposed to a local SQLite database, the added advantage is that dbplyr ensures that all the manipulation is done on the database and you simply collect the results, rather than having to draw all that data down into your session.

Practice exercises

See if you can get dbplyr working for you on the chinook dataset by trying these practice exercises:

  1. How many customers are there by Country?
  2. How many invoices were issued in the first quarter of 2009, grouped by the country of the customer?
  3. What were the names of the top ten tracks sold to customers in the USA? (You’ll also need the tracks table for this one).

Appendix

[1] The general code for connecting to a remote database is as follows:

conn <- DBI::dbConnect(
  drv = [database driver, eg odbc::odbc()],
  dsn = "database_name",
  uid = "User_ID",
  pwd = "Password"
)

[2] To create a similar table inside a schema of a remote database, you’d use the in_schema() function in dbplyr:

my_db_tbl <- dplyr::tbl(
  conn, 
  dbplyr::in_schema("SCHEMA_NAME", "TABLE_NAME")
)

[3] Here’s the SQL translation of that last dbplyr query I wrote:

No alt text provided for this image

I lead McKinsey's internal People Analytics and Measurement function. Originally I was a Pure Mathematician, then I became a Psychometrician. I am passionate about applying the rigor of both those disciplines to complex people questions. I'm also a coding geek and a massive fan of Japanese RPGs.

All opinions expressed are my own and not to be associated with my employer or any other organization I am connected with.

Roberta Chinn

Senior Information Systems Analyst at California Public Employees Retirement System

5 年

Great analogy

回复
Katty Polyak

Software Engineer at Pythia Labs

5 年

Keith McNulty?thank you for sharing. I'm just getting started in learning about people analytics and I really appreciate articles like this. Why do you prefer R over Python for analyzing employee data?

回复
Bryan Sam

Sr. Analytics Engineer @ Level Ex | MSBA Graduate of 2021 from The Paul Merage School of Business, UC Irvine

5 年

Thank you for sharing. I’ll give it a try. Again, thank you.

回复

要查看或添加评论,请登录

Keith McNulty的更多文章

社区洞察

其他会员也浏览了