登录查看更多内容

DuckDB/parquet versus WRDS PostgreSQL

Ian Gow

Professor

发布日期: 2025年1月16日

From a coding perspective, Empirical Research in Accounting: Tools and Methods is kind of two books in one. While the print version focuses on using the WRDS PostgreSQL database, the online version offers an alternative approach using DuckDB and parquet files. Because the differences in the code are so minor, I felt it was not worthwhile cluttering the book with alternative versions of the code (for one thing, one can’t put these in alternative tabs on a printed page).

Not only are the code differences minor, but how the code performs is often similar whether you use the PostgreSQL version or the DuckDB version. But, unfortunately, not always. One place where there is a difference is in the chapter on Ball and Brown (1968).

Using the WRDS PostgreSQL database server, the code to produce rets_all takes much longer than I would like. While one can reduce the time taken by judicious insertion of collect() statements, this entails downloading more data with each query and using more RAM. Also even then the time is still around 2 minutes.

Using a local PostgreSQL server helps a bit. One reason why the query does poorly against the WRDS PostgreSQL server is probably down to the lack of indexes and the like to make the query efficient. Using SQL’s EXPLAIN command I can confirm that the same SQL query requires about one-tenth the "cost" on my server than it needs on the WRDS server. But even here the query takes 96 seconds. And part of the reason for using the WRDS PostgreSQL server was to make it easier for readers … and setting up a PostgreSQL server is not for everyone (though it’s not too difficult, as explained in Appendix D of our book, which also includes a Python script for getting all files needed for the analyses in the book).

领英推荐

A gentle introduction to Embedded Databases

Arpit Bhayani 2 年前

PostgreSQL

Rohit Singh 1 个月前

SQL Server vs PostgreSQL

Ahmed Kamal 2 年前

Another reason the query takes so long seems to be because the query optimizer built into PostgreSQL is not able to “see” the best approach for such a complex query. One can give PostgreSQL a nudge by creating temporary tables to simplify the final query. Putting compute()—the command to create a temporary table—at the end of the queries creating each of the components —annc_events, td_link, and ccm_link—reduces the time to create rets_all to 3.5 seconds.

Unfortunately, compute() is not always the best answer. First, we cannot run compute() on the WRDS PostgreSQL server. Second, materializing a component of a larger query can result in a lot of wasted effort and uses additional resources such as RAM.

So how does the DuckDB version of the code do? Well it takes less than 0.5 seconds and one doesn't need to worry about whether to compute() or not. In teaching from Empirical Research in Accounting: Tools and Methods in 2023, I used the PostgreSQL version. But in 2024, I switched over to using the DuckDB/parquet. I simply shared a folder of parquet files (about 5GB) with my students and had them set up DATA_DIR accordingly. Appendix E of our book explains how to work with parquet files and includes a Python script for getting all files needed for the analyses in the book.

An additional benefit of the DuckDB/parquet approach is that it is something that students can carry over to their own research using data from sources other than WRDS.

要查看或添加评论，请登录

Ian Gow的更多文章

db2pq: A Python library for making parquet data

2025年3月11日

db2pq: A Python library for making parquet data

If you use WRDS data, there are many benefits to storing the data as parquet files. The parquet format is described in…

2 条评论
Write SQL without writing SQL

2024年12月16日

Write SQL without writing SQL

I figure there is some merit to the claim that “SQL has become an even more indispensable too for the savvy analyst or…

5 条评论
Missing Form APs?

2024年12月1日

Missing Form APs?

In May of 2024, I posted on LinkedIn a brief note about working with data from Form APs filed with the PCAOB. In a…

4 条评论
Working with dates and times

2024年8月13日

Working with dates and times

In a recent post on LinkedIn, I mentioned that one goal of Empirical Research in Accounting: Tools and Methods…

1 条评论
Recommendation: Chanticleer podcast

2024年7月22日

Recommendation: Chanticleer podcast

For Australian students of business and finance, I strongly recommend the Chanticleer podcast as a great way to stay on…

2 条评论
WRDS at Speed

2023年12月14日

WRDS at Speed

Users of data from Wharton Research Data Services (WRDS) may find the newly updated version of my Python package…
Gino's response to Data Colada

2023年8月6日

Gino's response to Data Colada

For some the big legal filing this week was the indictment of Donald Trump in Washington DC. But others in academia…

5 条评论
Videos for our research course

2022年6月30日

Videos for our research course

Two weeks I posted a short piece about the PhD course book that Tony Ding and I are working on. One thing I feel may…
An accounting research course book

2022年6月16日

An accounting research course book

For those of you involved in academic accounting research, please check out the work-in-process draft of the course…

8 条评论
The purpose of an accounting professor?

2022年2月2日

The purpose of an accounting professor?

Recently, Professor Alex Edmans of London Business School posted a paper entitled “The Purpose of a Finance Professor”…

7 条评论

See all articles

DuckDB/parquet versus WRDS PostgreSQL

Ian Gow

Professor

领英推荐

Ian Gow的更多文章

社区洞察

其他会员也浏览了

High-Performance PostgreSQL: A Dive Into the Internals

Creating a highly available and fault-tolerant Postgresql database cluster

Introduction to PostgreSQL: A Beginner’s Guide

Mydbops Newsletter January 2025 Edition

Monitor KubeDB Managed PostgreSQL With Datadog in Google Kubernetes Engine (GKE)

Mydbops Newsletter December Edition

Master SQL and PostgreSQL: Top Online Courses to Boost Your Database Skills (2025)

Deploy Production-Grade MySQL Cluster in Rancher Using KubeDB

BIGINT vs. BIGSERIAL in PostgreSQL

PostgreSQL Table and Index Mistakes: Insights and Best Practices

领英推荐

Ian Gow的更多文章

db2pq: A Python library for making parquet data

Write SQL without writing SQL

Missing Form APs?

Working with dates and times

Recommendation: Chanticleer podcast

WRDS at Speed

Gino's response to Data Colada

Videos for our research course

An accounting research course book

The purpose of an accounting professor?

社区洞察

其他会员也浏览了

High-Performance PostgreSQL: A Dive Into the Internals

Creating a highly available and fault-tolerant Postgresql database cluster

Introduction to PostgreSQL: A Beginner’s Guide

Mydbops Newsletter January 2025 Edition

Monitor KubeDB Managed PostgreSQL With Datadog in Google Kubernetes Engine (GKE)

Mydbops Newsletter December Edition

Master SQL and PostgreSQL: Top Online Courses to Boost Your Database Skills (2025)

Deploy Production-Grade MySQL Cluster in Rancher Using KubeDB

BIGINT vs. BIGSERIAL in PostgreSQL

PostgreSQL Table and Index Mistakes: Insights and Best Practices