I figure there is some merit to the claim that “SQL has become an even more indispensable too for the savvy analyst or data scientist” made on the cover of SQL for Data Analysis by Cathy Tanimura.
However, I think one can make a solid case for learning to write SQL without writing SQL. Instead use the R package dbplyr to generate SQL behind the scenes. One can do a lot with dbplyr: as I show here, one can produce every query in SQL for Data Analysis using dbplyr and do so in a way that the heavy lifting is still being done in the database.
But why use dbplyr rather than SQL? One reason might be that you already know dplyr, but not SQL. But even if you know SQL, there are benefits to using dbplyr:
- Makes it easy to build up complex queries from subqueries. Simply store subqueries in R objects (“lazy remote data frames”) and click them together like Lego blocks (see here for some examples).
- Offers ready access to functions like pivot_wider() or pivot_longer() that aren’t supported by all databases. SQL for Data Analysis often uses length CASE statements to achieve the same end (examples illustrating the dplyr approach can be see here).
- Gives access to plotting and statistical tools in one place. While it’s often easy enough to copy your SQL query from your preferred SQL editor to Python or Tableau to make histograms or box plots, with dbplyr it can be as easy as piping a query to ggplot(). All plots in SQL for Data Analysis are recreated in this way here.
- Allows creation of functions with ease. We know that functions are first-class citizens in Python or R. It can be surprisingly easy to turn a query into a more general-purpose function in a way that can be difficult even in SQL back-ends that support functions (see here for an example).
- Makes reporting easy using Quarto. One can quickly go from a workbook to polished PDF output with a few edits to code. (Actually RStudio with Quarto can make for a pretty decent SQL notebook platform.) Pretty much this entire book is based on queries written using dbplyr and Quarto.
Co-founder at MetaLearner | Berkeley SkyDeck B19
1 周Ian Gow love this ??. Two years ago I wrote the following lectures: F1 Records with SQLite and dplyr: https://rpubs.com/rafael_nicolas/f1_queries F1 Ergast DB with SQLite and dplyr: https://rpubs.com/rafael_nicolas/f1_db
Providing premium translation services between business and IT for 30 years.
3 个月I think there's a case for moving this conversation up a level. A grammar of data manipulations (DQL: Data Query Logic, anyone?) that can then be implemented in any particular language/package. Thoughts?
PhD Candidate in Finance @ SFI Swiss Finance Institute | Data Analytics, Machine Learning
3 个月Completely agree!?dbplyr?makes it incredibly easy and convenient to access and work with?WRDS, for instance, directly in R
Professor
3 个月I forgot to mention Python. One can do much the same thing as I discuss above using packages like Ibis in place of dbplyr. That said, I think that a pandas user would have a steeper learning curve to move to Ibis than a dplyr user would have moving to dbplyr (often it's just a couple of lines of code for the latter shift). Also the "non-standard evaluation" in R can often make for slicker code. That said, Ibis seems to offer goodies such as user-defined functions and it's definitely worth taking a look at it if you're a Python user. (See https://www.dhirubhai.net/posts/iangow_empirical-research-in-accounting-with-python-activity-7143645442377990146-yr1F for an earlier post using Ibis.)