DuckDB/parquet versus WRDS PostgreSQL

DuckDB/parquet versus WRDS PostgreSQL

From a coding perspective, Empirical Research in Accounting: Tools and Methods is kind of two books in one. While the print version focuses on using the WRDS PostgreSQL database, the online version offers an alternative approach using DuckDB and parquet files. Because the differences in the code are so minor, I felt it was not worthwhile cluttering the book with alternative versions of the code (for one thing, one can’t put these in alternative tabs on a printed page).

Not only are the code differences minor, but how the code performs is often similar whether you use the PostgreSQL version or the DuckDB version. But, unfortunately, not always. One place where there is a difference is in the chapter on Ball and Brown (1968).

Using the WRDS PostgreSQL database server, the code to produce rets_all takes much longer than I would like. While one can reduce the time taken by judicious insertion of collect() statements, this entails downloading more data with each query and using more RAM. Also even then the time is still around 2 minutes.

Using a local PostgreSQL server helps a bit. One reason why the query does poorly against the WRDS PostgreSQL server is probably down to the lack of indexes and the like to make the query efficient. Using SQL’s EXPLAIN command I can confirm that the same SQL query requires about one-tenth the "cost" on my server than it needs on the WRDS server. But even here the query takes 96 seconds. And part of the reason for using the WRDS PostgreSQL server was to make it easier for readers … and setting up a PostgreSQL server is not for everyone (though it’s not too difficult, as explained in Appendix D of our book, which also includes a Python script for getting all files needed for the analyses in the book).

Another reason the query takes so long seems to be because the query optimizer built into PostgreSQL is not able to “see” the best approach for such a complex query. One can give PostgreSQL a nudge by creating temporary tables to simplify the final query. Putting compute()—the command to create a temporary table—at the end of the queries creating each of the components —annc_events, td_link, and ccm_link—reduces the time to create rets_all to 3.5 seconds.

Unfortunately, compute() is not always the best answer. First, we cannot run compute() on the WRDS PostgreSQL server. Second, materializing a component of a larger query can result in a lot of wasted effort and uses additional resources such as RAM.

So how does the DuckDB version of the code do? Well it takes less than 0.5 seconds and one doesn't need to worry about whether to compute() or not. In teaching from Empirical Research in Accounting: Tools and Methods in 2023, I used the PostgreSQL version. But in 2024, I switched over to using the DuckDB/parquet. I simply shared a folder of parquet files (about 5GB) with my students and had them set up DATA_DIR accordingly. Appendix E of our book explains how to work with parquet files and includes a Python script for getting all files needed for the analyses in the book.

An additional benefit of the DuckDB/parquet approach is that it is something that students can carry over to their own research using data from sources other than WRDS.

要查看或添加评论,请登录

Ian Gow的更多文章

  • db2pq: A Python library for making parquet data

    db2pq: A Python library for making parquet data

    If you use WRDS data, there are many benefits to storing the data as parquet files. The parquet format is described in…

    2 条评论
  • Write SQL without writing SQL

    Write SQL without writing SQL

    I figure there is some merit to the claim that “SQL has become an even more indispensable too for the savvy analyst or…

    5 条评论
  • Missing Form APs?

    Missing Form APs?

    In May of 2024, I posted on LinkedIn a brief note about working with data from Form APs filed with the PCAOB. In a…

    4 条评论
  • Working with dates and times

    Working with dates and times

    In a recent post on LinkedIn, I mentioned that one goal of Empirical Research in Accounting: Tools and Methods…

    1 条评论
  • Recommendation: Chanticleer podcast

    Recommendation: Chanticleer podcast

    For Australian students of business and finance, I strongly recommend the Chanticleer podcast as a great way to stay on…

    2 条评论
  • WRDS at Speed

    WRDS at Speed

    Users of data from Wharton Research Data Services (WRDS) may find the newly updated version of my Python package…

  • Gino's response to Data Colada

    Gino's response to Data Colada

    For some the big legal filing this week was the indictment of Donald Trump in Washington DC. But others in academia…

    5 条评论
  • Videos for our research course

    Videos for our research course

    Two weeks I posted a short piece about the PhD course book that Tony Ding and I are working on. One thing I feel may…

  • An accounting research course book

    An accounting research course book

    For those of you involved in academic accounting research, please check out the work-in-process draft of the course…

    8 条评论
  • The purpose of an accounting professor?

    The purpose of an accounting professor?

    Recently, Professor Alex Edmans of London Business School posted a paper entitled “The Purpose of a Finance Professor”…

    7 条评论

社区洞察

其他会员也浏览了