Write SQL without writing SQL

Write SQL without writing SQL

I figure there is some merit to the claim that “SQL has become an even more indispensable too for the savvy analyst or data scientist” made on the cover of SQL for Data Analysis by Cathy Tanimura.

However, I think one can make a solid case for learning to write SQL without writing SQL. Instead use the R package dbplyr to generate SQL behind the scenes. One can do a lot with dbplyr: as I show here, one can produce every query in SQL for Data Analysis using dbplyr and do so in a way that the heavy lifting is still being done in the database.

But why use dbplyr rather than SQL? One reason might be that you already know dplyr, but not SQL. But even if you know SQL, there are benefits to using dbplyr:

  1. Makes it easy to build up complex queries from subqueries. Simply store subqueries in R objects (“lazy remote data frames”) and click them together like Lego blocks (see here for some examples).
  2. Offers ready access to functions like pivot_wider() or pivot_longer() that aren’t supported by all databases. SQL for Data Analysis often uses length CASE statements to achieve the same end (examples illustrating the dplyr approach can be see here).
  3. Gives access to plotting and statistical tools in one place. While it’s often easy enough to copy your SQL query from your preferred SQL editor to Python or Tableau to make histograms or box plots, with dbplyr it can be as easy as piping a query to ggplot(). All plots in SQL for Data Analysis are recreated in this way here.
  4. Allows creation of functions with ease. We know that functions are first-class citizens in Python or R. It can be surprisingly easy to turn a query into a more general-purpose function in a way that can be difficult even in SQL back-ends that support functions (see here for an example).
  5. Makes reporting easy using Quarto. One can quickly go from a workbook to polished PDF output with a few edits to code. (Actually RStudio with Quarto can make for a pretty decent SQL notebook platform.) Pretty much this entire book is based on queries written using dbplyr and Quarto.

Rafael Nicolas Fermin Cota

Co-founder at MetaLearner | Berkeley SkyDeck B19

1 周

Ian Gow love this ??. Two years ago I wrote the following lectures: F1 Records with SQLite and dplyr: https://rpubs.com/rafael_nicolas/f1_queries F1 Ergast DB with SQLite and dplyr: https://rpubs.com/rafael_nicolas/f1_db

回复
Derick Lyle

Providing premium translation services between business and IT for 30 years.

3 个月

I think there's a case for moving this conversation up a level. A grammar of data manipulations (DQL: Data Query Logic, anyone?) that can then be implemented in any particular language/package. Thoughts?

Evgeniy Mayorov

PhD Candidate in Finance @ SFI Swiss Finance Institute | Data Analytics, Machine Learning

3 个月

Completely agree!?dbplyr?makes it incredibly easy and convenient to access and work with?WRDS, for instance, directly in R

Ian Gow

Professor

3 个月

I forgot to mention Python. One can do much the same thing as I discuss above using packages like Ibis in place of dbplyr. That said, I think that a pandas user would have a steeper learning curve to move to Ibis than a dplyr user would have moving to dbplyr (often it's just a couple of lines of code for the latter shift). Also the "non-standard evaluation" in R can often make for slicker code. That said, Ibis seems to offer goodies such as user-defined functions and it's definitely worth taking a look at it if you're a Python user. (See https://www.dhirubhai.net/posts/iangow_empirical-research-in-accounting-with-python-activity-7143645442377990146-yr1F for an earlier post using Ibis.)

要查看或添加评论,请登录

Ian Gow的更多文章

  • db2pq: A Python library for making parquet data

    db2pq: A Python library for making parquet data

    If you use WRDS data, there are many benefits to storing the data as parquet files. The parquet format is described in…

    2 条评论
  • DuckDB/parquet versus WRDS PostgreSQL

    DuckDB/parquet versus WRDS PostgreSQL

    From a coding perspective, Empirical Research in Accounting: Tools and Methods is kind of two books in one. While the…

  • Missing Form APs?

    Missing Form APs?

    In May of 2024, I posted on LinkedIn a brief note about working with data from Form APs filed with the PCAOB. In a…

    4 条评论
  • Working with dates and times

    Working with dates and times

    In a recent post on LinkedIn, I mentioned that one goal of Empirical Research in Accounting: Tools and Methods…

    1 条评论
  • Recommendation: Chanticleer podcast

    Recommendation: Chanticleer podcast

    For Australian students of business and finance, I strongly recommend the Chanticleer podcast as a great way to stay on…

    2 条评论
  • WRDS at Speed

    WRDS at Speed

    Users of data from Wharton Research Data Services (WRDS) may find the newly updated version of my Python package…

  • Gino's response to Data Colada

    Gino's response to Data Colada

    For some the big legal filing this week was the indictment of Donald Trump in Washington DC. But others in academia…

    5 条评论
  • Videos for our research course

    Videos for our research course

    Two weeks I posted a short piece about the PhD course book that Tony Ding and I are working on. One thing I feel may…

  • An accounting research course book

    An accounting research course book

    For those of you involved in academic accounting research, please check out the work-in-process draft of the course…

    8 条评论
  • The purpose of an accounting professor?

    The purpose of an accounting professor?

    Recently, Professor Alex Edmans of London Business School posted a paper entitled “The Purpose of a Finance Professor”…

    7 条评论