Users of data from Wharton Research Data Services (WRDS) may find the newly updated version of my Python package wrds2pg useful. This package started life as converted-Perl code for getting WRDS SAS data into my PostgreSQL database (this was before the WRDS PostreSQL database existed).
Recently, with the help of Evan Jo of Queen’s University, I have expanded the capability of the package to support parquet files and compressed CSV files. The parquet format is a “an open standards-based format widely used by big data systems” and is emerging as a kind of “modern CSV” (see here for some discussion of the benefits of the format).
The original idea of the wrds2pg package was to make it easy to maintain a local PostgreSQL database, which is often more convenient than using the WRDS database. The addition of support for parquet file makes it possible to get many of the benefits of a local PostgreSQL database with less set-up. Additionally, parquet files are very fast.
What are some benefits of having a package like? wrds2pg? Here I discuss two:
Increased reproducibility. Looking through the code-and-data repository provided by the Journal of Accounting Research, one often sees code referring to data sets such as CRSPmonthly79to18.sas7bdat or Compustat.dta. While one can often guess how such data sets were created by the authors, the need to guess increases the cost of replication relative to code that starts with crsp.msf or comp.funda, canonical versions of which are supplied by WRDS. The functions supplied by wrds2pg produce data sets crsp.msf or comp.funda that are functionally equivalent to the versions supplied by WRDS.
This approach dramatically increases reproducibility of code. For example, a core promise of my (work-in-progress) book with Tony Ding, Empirical Research in Accounting: Tools and Methods, is that all tables and plots in the book can be reproduced by a reader with a WRDS account and a computer. This is true whether one uses the WRDS PostgreSQL database, or one’s own database populated using wrds2pg. And with very minor modifications to the code, the same is true using a parquet repository created using wrds2pg.
Blazing performance. A local PostgreSQL database will be very fast, but the data analysis using parquet files can reach another level. For example, the data steps shown here take nearly a minute on the WRDS servers (using SAS and the 16.5 GB version of crsp.dsf). But on my three-year-old Mac mini using a 2.1 GB parquet version of crsp.dsf, they take under three seconds. (The parquet file contains the same data, but is smaller due to compression.)
Guidance on setting up a PostgreSQL database of your own is provided here. We explain how to set up a parquet data repository here. Quarto templates to run code and do the exercises in our book using the WRDS PostgreSQL database (or your own) can be found here. Some templates using parquet data are also included there. (Parquet templates for all chapters should be available early in 2024.) As the wrds2pg package is hosted on PyPI (thanks to Jingyu Zhang), one can install it using standard Python tools (pip3 install wrds2pg).
Many code examples can be found in our book. We have a contract with CRC Press to publish the book in print form and it should be available mid-2024. The book will remain online even after publication. Accompanying the book are an R package farr, the templates referred to above, and (for instructors) solution manuals to the exercises in the book. Comments and suggestions are welcomed.