db2pq: A Python library for making parquet data
If you use WRDS data, there are many benefits to storing the data as parquet files.
The parquet format is described in R for Data Science (Wickham et al., 2023, p. 393) as “an open standards-based format widely used by big data systems.” Parquet files provide a format optimized for data analysis, with a rich type system. More details on the parquet format can be found in Chapter 22 of R for Data Science.
About a year ago, I created a Python library (db2pq) that efficiently extracts data from WRDS PostgreSQL server and saves it as a local parquet file. The parquet files are generally much smaller than the original PostgreSQL tables or equivalent SAS data files. Significant data steps typically run about 10 times faster with parquet files on a laptop typically than they would with SAS on the WRDS server.
While Empirical Research in Accounting: Tools and Methods ostensibly focuses on direct access to the WRDS PostgreSQL database, all code in the book is also offered in a form that works with parquet files (see here, for example). If a reader has about 8 GB of space available, the parquet approach is generally faster and easier.
In creating db2pq, I needed to address a number of challenges. First, I wanted to limit the RAM consumed in downloaded large data tables. Here pyarrow batches provided a solution. Second, I wanted to store metadata (“last updated” from WRDS) in the parquet files that would allow me to skip costly downloads if I already have the latest data. Finally, I needed a source for such “last updated” metadata. For this I needed to look at the WRDS SAS files, as WRDS does not store such data with its PostgreSQL tables. (Note that core functions of db2pq actually work with any PostgreSQL database, not just the WRDS database. Because the code uses the Ibis framework, with minor modifications it could probably work with any supported backend.)
The db2pq packages manages to solve these issues and is easy to install and use. If you have Python installed on your computer (see here) and have set yourself up to connect to the WRDS database (see here), then you can install db2pq by simply typing pip install db2pq at the terminal. I have a script here that I have saved as update_tables_pq_alt.py on my computer.
I can tell my computer my WRDS ID and then run the script:
% export WRDS_ID=iangow
% ./update_tables_pq_alt.py
Updated crsp.ccmxpf_lnkhist is available.
Getting from WRDS.
Beginning file download at 2025-03-10 15:25:27 UTC.
Completed file download at 2025-03-10 15:25:32 UTC.
Updated crsp.dsf is available.
Getting from WRDS.
Beginning file download at 2025-03-10 15:25:37 UTC.
Completed file download at 2025-03-10 15:34:36 UTC.
The script continues to download all the WRDS data needed to run all the code in Empirical Research in Accounting: Tools and Methods.
There is also a function update_schema() that can be used to update every table found in a directory if more recent data are available on WRDS:
% python3
Python 3.12.7 (v3.12.7:0b05ead877f, Sep 30 2024, 23:18:00) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from db2pq import update_schema
>>> update_schema(“crsp")
crsp.ccmxpf_lnkhist already up to date.
crsp.msi already up to date.
crsp.dsf already up to date.
Updated crsp.msedelist is available.
Getting from WRDS.
Beginning file download at 2025-03-11 14:01:43 UTC.
Completed file download at 2025-03-11 14:01:50 UTC.
Updated crsp.msf_v2 is available.
Getting from WRDS.
Beginning file download at 2025-03-11 14:02:00 UTC.
100% ▕████████████████████████████████████
Completed file download at 2025-03-11 14:02:55 UTC.
Assistant Professor at Aalto University | Accounting Digitalization & Analytics
1 周Ian, I'm having a hard time handling SEC log files, which I now have as one huge DuckDB file. Querying, for example, Tesla's 10-K downloads over time takes ages. Any suggestions?