DuckDB/parquet versus WRDS PostgreSQL
From a coding perspective, Empirical Research in Accounting: Tools and Methods is kind of two books in one. While the print version focuses on using the WRDS PostgreSQL database, the online version offers an alternative approach using DuckDB and parquet files. Because the differences in the code are so minor, I felt it was not worthwhile cluttering the book with alternative versions of the code (for one thing, one can’t put these in alternative tabs on a printed page).
Not only are the code differences minor, but how the code performs is often similar whether you use the PostgreSQL version or the DuckDB version. But, unfortunately, not always. One place where there is a difference is in the chapter on Ball and Brown (1968).
Using the WRDS PostgreSQL database server, the code to produce rets_all takes much longer than I would like. While one can reduce the time taken by judicious insertion of collect() statements, this entails downloading more data with each query and using more RAM. Also even then the time is still around 2 minutes.
Using a local PostgreSQL server helps a bit. One reason why the query does poorly against the WRDS PostgreSQL server is probably down to the lack of indexes and the like to make the query efficient. Using SQL’s EXPLAIN command I can confirm that the same SQL query requires about one-tenth the "cost" on my server than it needs on the WRDS server. But even here the query takes 96 seconds. And part of the reason for using the WRDS PostgreSQL server was to make it easier for readers … and setting up a PostgreSQL server is not for everyone (though it’s not too difficult, as explained in Appendix D of our book, which also includes a Python script for getting all files needed for the analyses in the book).
Another reason the query takes so long seems to be because the query optimizer built into PostgreSQL is not able to “see” the best approach for such a complex query. One can give PostgreSQL a nudge by creating temporary tables to simplify the final query. Putting compute()—the command to create a temporary table—at the end of the queries creating each of the components —annc_events, td_link, and ccm_link—reduces the time to create rets_all to 3.5 seconds.
Unfortunately, compute() is not always the best answer. First, we cannot run compute() on the WRDS PostgreSQL server. Second, materializing a component of a larger query can result in a lot of wasted effort and uses additional resources such as RAM.
So how does the DuckDB version of the code do? Well it takes less than 0.5 seconds and one doesn't need to worry about whether to compute() or not. In teaching from Empirical Research in Accounting: Tools and Methods in 2023, I used the PostgreSQL version. But in 2024, I switched over to using the DuckDB/parquet. I simply shared a folder of parquet files (about 5GB) with my students and had them set up DATA_DIR accordingly. Appendix E of our book explains how to work with parquet files and includes a Python script for getting all files needed for the analyses in the book.
An additional benefit of the DuckDB/parquet approach is that it is something that students can carry over to their own research using data from sources other than WRDS.