登录查看更多内容

Write SQL without writing SQL

Ian Gow

Professor

发布日期: 2024年12月16日

I figure there is some merit to the claim that “SQL has become an even more indispensable too for the savvy analyst or data scientist” made on the cover of SQL for Data Analysis by Cathy Tanimura.

However, I think one can make a solid case for learning to write SQL without writing SQL. Instead use the R package dbplyr to generate SQL behind the scenes. One can do a lot with dbplyr: as I show here, one can produce every query in SQL for Data Analysis using dbplyr and do so in a way that the heavy lifting is still being done in the database.

But why use dbplyr rather than SQL? One reason might be that you already know dplyr, but not SQL. But even if you know SQL, there are benefits to using dbplyr:

Makes it easy to build up complex queries from subqueries. Simply store subqueries in R objects (“lazy remote data frames”) and click them together like Lego blocks (see here for some examples).
Offers ready access to functions like pivot_wider() or pivot_longer() that aren’t supported by all databases. SQL for Data Analysis often uses length CASE statements to achieve the same end (examples illustrating the dplyr approach can be see here).
Gives access to plotting and statistical tools in one place. While it’s often easy enough to copy your SQL query from your preferred SQL editor to Python or Tableau to make histograms or box plots, with dbplyr it can be as easy as piping a query to ggplot(). All plots in SQL for Data Analysis are recreated in this way here.
Allows creation of functions with ease. We know that functions are first-class citizens in Python or R. It can be surprisingly easy to turn a query into a more general-purpose function in a way that can be difficult even in SQL back-ends that support functions (see here for an example).
Makes reporting easy using Quarto. One can quickly go from a workbook to polished PDF output with a few edits to code. (Actually RStudio with Quarto can make for a pretty decent SQL notebook platform.) Pretty much this entire book is based on queries written using dbplyr and Quarto.

Rafael Nicolas Fermin Cota

Co-founder at MetaLearner | Berkeley SkyDeck B19

1 周

Ian Gow love this ??. Two years ago I wrote the following lectures: F1 Records with SQLite and dplyr: https://rpubs.com/rafael_nicolas/f1_queries F1 Ergast DB with SQLite and dplyr: https://rpubs.com/rafael_nicolas/f1_db

Derick Lyle

Providing premium translation services between business and IT for 30 years.

3 个月

I think there's a case for moving this conversation up a level. A grammar of data manipulations (DQL: Data Query Logic, anyone?) that can then be implemented in any particular language/package. Thoughts?

1 次回应

Evgeniy Mayorov

PhD Candidate in Finance @ SFI Swiss Finance Institute | Data Analytics, Machine Learning

3 个月

Completely agree!?dbplyr?makes it incredibly easy and convenient to access and work with?WRDS, for instance, directly in R

1 次回应

Ian Gow

Professor

3 个月

I forgot to mention Python. One can do much the same thing as I discuss above using packages like Ibis in place of dbplyr. That said, I think that a pandas user would have a steeper learning curve to move to Ibis than a dplyr user would have moving to dbplyr (often it's just a couple of lines of code for the latter shift). Also the "non-standard evaluation" in R can often make for slicker code. That said, Ibis seems to offer goodies such as user-defined functions and it's definitely worth taking a look at it if you're a Python user. (See https://www.dhirubhai.net/posts/iangow_empirical-research-in-accounting-with-python-activity-7143645442377990146-yr1F for an earlier post using Ibis.)

1 次回应

查看更多评论

要查看或添加评论，请登录

Ian Gow的更多文章

db2pq: A Python library for making parquet data

2025年3月11日

db2pq: A Python library for making parquet data

If you use WRDS data, there are many benefits to storing the data as parquet files. The parquet format is described in…

2 条评论
DuckDB/parquet versus WRDS PostgreSQL

2025年1月16日

DuckDB/parquet versus WRDS PostgreSQL

From a coding perspective, Empirical Research in Accounting: Tools and Methods is kind of two books in one. While the…
Missing Form APs?

2024年12月1日

Missing Form APs?

In May of 2024, I posted on LinkedIn a brief note about working with data from Form APs filed with the PCAOB. In a…

4 条评论
Working with dates and times

2024年8月13日

Working with dates and times

In a recent post on LinkedIn, I mentioned that one goal of Empirical Research in Accounting: Tools and Methods…

1 条评论
Recommendation: Chanticleer podcast

2024年7月22日

Recommendation: Chanticleer podcast

For Australian students of business and finance, I strongly recommend the Chanticleer podcast as a great way to stay on…

2 条评论
WRDS at Speed

2023年12月14日

WRDS at Speed

Users of data from Wharton Research Data Services (WRDS) may find the newly updated version of my Python package…
Gino's response to Data Colada

2023年8月6日

Gino's response to Data Colada

For some the big legal filing this week was the indictment of Donald Trump in Washington DC. But others in academia…

5 条评论
Videos for our research course

2022年6月30日

Videos for our research course

Two weeks I posted a short piece about the PhD course book that Tony Ding and I are working on. One thing I feel may…
An accounting research course book

2022年6月16日

An accounting research course book

For those of you involved in academic accounting research, please check out the work-in-process draft of the course…

8 条评论
The purpose of an accounting professor?

2022年2月2日

The purpose of an accounting professor?

Recently, Professor Alex Edmans of London Business School posted a paper entitled “The Purpose of a Finance Professor”…

7 条评论

See all articles

Ian Gow的更多文章

db2pq: A Python library for making parquet data

DuckDB/parquet versus WRDS PostgreSQL

Missing Form APs?

Working with dates and times

Recommendation: Chanticleer podcast

WRDS at Speed

Gino's response to Data Colada

Videos for our research course

An accounting research course book

The purpose of an accounting professor?