WRDS at Speed

WRDS at Speed

Users of data from Wharton Research Data Services (WRDS) may find the newly updated version of my Python package wrds2pg useful. This package started life as converted-Perl code for getting WRDS SAS data into my PostgreSQL database (this was before the WRDS PostreSQL database existed).

Recently, with the help of Evan Jo of Queen’s University, I have expanded the capability of the package to support parquet files and compressed CSV files. The parquet format is a “an open standards-based format widely used by big data systems” and is emerging as a kind of “modern CSV” (see here for some discussion of the benefits of the format).

The original idea of the wrds2pg package was to make it easy to maintain a local PostgreSQL database, which is often more convenient than using the WRDS database. The addition of support for parquet file makes it possible to get many of the benefits of a local PostgreSQL database with less set-up. Additionally, parquet files are very fast.

What are some benefits of having a package like? wrds2pg? Here I discuss two:

Increased reproducibility. Looking through the code-and-data repository provided by the Journal of Accounting Research, one often sees code referring to data sets such as CRSPmonthly79to18.sas7bdat or Compustat.dta. While one can often guess how such data sets were created by the authors, the need to guess increases the cost of replication relative to code that starts with crsp.msf or comp.funda, canonical versions of which are supplied by WRDS. The functions supplied by wrds2pg produce data sets crsp.msf or comp.funda that are functionally equivalent to the versions supplied by WRDS.

This approach dramatically increases reproducibility of code. For example, a core promise of my (work-in-progress) book with Tony Ding, Empirical Research in Accounting: Tools and Methods, is that all tables and plots in the book can be reproduced by a reader with a WRDS account and a computer. This is true whether one uses the WRDS PostgreSQL database, or one’s own database populated using wrds2pg. And with very minor modifications to the code, the same is true using a parquet repository created using wrds2pg.

Blazing performance. A local PostgreSQL database will be very fast, but the data analysis using parquet files can reach another level. For example, the data steps shown here take nearly a minute on the WRDS servers (using SAS and the 16.5 GB version of crsp.dsf). But on my three-year-old Mac mini using a 2.1 GB parquet version of crsp.dsf, they take under three seconds. (The parquet file contains the same data, but is smaller due to compression.)

Guidance on setting up a PostgreSQL database of your own is provided here. We explain how to set up a parquet data repository here. Quarto templates to run code and do the exercises in our book using the WRDS PostgreSQL database (or your own) can be found here. Some templates using parquet data are also included there. (Parquet templates for all chapters should be available early in 2024.) As the wrds2pg package is hosted on PyPI (thanks to Jingyu Zhang), one can install it using standard Python tools (pip3 install wrds2pg).

Many code examples can be found in our book. We have a contract with CRC Press to publish the book in print form and it should be available mid-2024. The book will remain online even after publication. Accompanying the book are an R package farr, the templates referred to above, and (for instructors) solution manuals to the exercises in the book. Comments and suggestions are welcomed.


要查看或添加评论,请登录

Ian Gow的更多文章

  • db2pq: A Python library for making parquet data

    db2pq: A Python library for making parquet data

    If you use WRDS data, there are many benefits to storing the data as parquet files. The parquet format is described in…

    2 条评论
  • DuckDB/parquet versus WRDS PostgreSQL

    DuckDB/parquet versus WRDS PostgreSQL

    From a coding perspective, Empirical Research in Accounting: Tools and Methods is kind of two books in one. While the…

  • Write SQL without writing SQL

    Write SQL without writing SQL

    I figure there is some merit to the claim that “SQL has become an even more indispensable too for the savvy analyst or…

    5 条评论
  • Missing Form APs?

    Missing Form APs?

    In May of 2024, I posted on LinkedIn a brief note about working with data from Form APs filed with the PCAOB. In a…

    4 条评论
  • Working with dates and times

    Working with dates and times

    In a recent post on LinkedIn, I mentioned that one goal of Empirical Research in Accounting: Tools and Methods…

    1 条评论
  • Recommendation: Chanticleer podcast

    Recommendation: Chanticleer podcast

    For Australian students of business and finance, I strongly recommend the Chanticleer podcast as a great way to stay on…

    2 条评论
  • Gino's response to Data Colada

    Gino's response to Data Colada

    For some the big legal filing this week was the indictment of Donald Trump in Washington DC. But others in academia…

    5 条评论
  • Videos for our research course

    Videos for our research course

    Two weeks I posted a short piece about the PhD course book that Tony Ding and I are working on. One thing I feel may…

  • An accounting research course book

    An accounting research course book

    For those of you involved in academic accounting research, please check out the work-in-process draft of the course…

    8 条评论
  • The purpose of an accounting professor?

    The purpose of an accounting professor?

    Recently, Professor Alex Edmans of London Business School posted a paper entitled “The Purpose of a Finance Professor”…

    7 条评论

社区洞察

其他会员也浏览了