db2pq: A Python library for making parquet data

db2pq: A Python library for making parquet data

If you use WRDS data, there are many benefits to storing the data as parquet files.

The parquet format is described in R for Data Science (Wickham et al., 2023, p. 393) as “an open standards-based format widely used by big data systems.” Parquet files provide a format optimized for data analysis, with a rich type system. More details on the parquet format can be found in Chapter 22 of R for Data Science.

About a year ago, I created a Python library (db2pq) that efficiently extracts data from WRDS PostgreSQL server and saves it as a local parquet file. The parquet files are generally much smaller than the original PostgreSQL tables or equivalent SAS data files. Significant data steps typically run about 10 times faster with parquet files on a laptop typically than they would with SAS on the WRDS server.

While Empirical Research in Accounting: Tools and Methods ostensibly focuses on direct access to the WRDS PostgreSQL database, all code in the book is also offered in a form that works with parquet files (see here, for example). If a reader has about 8 GB of space available, the parquet approach is generally faster and easier.

In creating db2pq, I needed to address a number of challenges. First, I wanted to limit the RAM consumed in downloaded large data tables. Here pyarrow batches provided a solution. Second, I wanted to store metadata (“last updated” from WRDS) in the parquet files that would allow me to skip costly downloads if I already have the latest data. Finally, I needed a source for such “last updated” metadata. For this I needed to look at the WRDS SAS files, as WRDS does not store such data with its PostgreSQL tables. (Note that core functions of db2pq actually work with any PostgreSQL database, not just the WRDS database. Because the code uses the Ibis framework, with minor modifications it could probably work with any supported backend.)

The db2pq packages manages to solve these issues and is easy to install and use. If you have Python installed on your computer (see here) and have set yourself up to connect to the WRDS database (see here), then you can install db2pq by simply typing pip install db2pq at the terminal. I have a script here that I have saved as update_tables_pq_alt.py on my computer.

I can tell my computer my WRDS ID and then run the script:

% export WRDS_ID=iangow
% ./update_tables_pq_alt.py
Updated crsp.ccmxpf_lnkhist is available.
Getting from WRDS.
Beginning file download at 2025-03-10 15:25:27 UTC.
Completed file download at 2025-03-10 15:25:32 UTC.
Updated crsp.dsf is available.
Getting from WRDS.
Beginning file download at 2025-03-10 15:25:37 UTC.
Completed file download at 2025-03-10 15:34:36 UTC.        

The script continues to download all the WRDS data needed to run all the code in Empirical Research in Accounting: Tools and Methods.

There is also a function update_schema() that can be used to update every table found in a directory if more recent data are available on WRDS:

% python3
Python 3.12.7 (v3.12.7:0b05ead877f, Sep 30 2024, 23:18:00) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from db2pq import update_schema
>>> update_schema(“crsp")
crsp.ccmxpf_lnkhist already up to date.
crsp.msi already up to date.
crsp.dsf already up to date.
Updated crsp.msedelist is available.
Getting from WRDS.
Beginning file download at 2025-03-11 14:01:43 UTC.
Completed file download at 2025-03-11 14:01:50 UTC.
Updated crsp.msf_v2 is available.
Getting from WRDS.
Beginning file download at 2025-03-11 14:02:00 UTC.
100% ▕████████████████████████████████████
Completed file download at 2025-03-11 14:02:55 UTC.        
Jukka Sihvonen

Assistant Professor at Aalto University | Accounting Digitalization & Analytics

1 周

Ian, I'm having a hard time handling SEC log files, which I now have as one huge DuckDB file. Querying, for example, Tesla's 10-K downloads over time takes ages. Any suggestions?

回复

要查看或添加评论,请登录

Ian Gow的更多文章

  • DuckDB/parquet versus WRDS PostgreSQL

    DuckDB/parquet versus WRDS PostgreSQL

    From a coding perspective, Empirical Research in Accounting: Tools and Methods is kind of two books in one. While the…

  • Write SQL without writing SQL

    Write SQL without writing SQL

    I figure there is some merit to the claim that “SQL has become an even more indispensable too for the savvy analyst or…

    5 条评论
  • Missing Form APs?

    Missing Form APs?

    In May of 2024, I posted on LinkedIn a brief note about working with data from Form APs filed with the PCAOB. In a…

    4 条评论
  • Working with dates and times

    Working with dates and times

    In a recent post on LinkedIn, I mentioned that one goal of Empirical Research in Accounting: Tools and Methods…

    1 条评论
  • Recommendation: Chanticleer podcast

    Recommendation: Chanticleer podcast

    For Australian students of business and finance, I strongly recommend the Chanticleer podcast as a great way to stay on…

    2 条评论
  • WRDS at Speed

    WRDS at Speed

    Users of data from Wharton Research Data Services (WRDS) may find the newly updated version of my Python package…

  • Gino's response to Data Colada

    Gino's response to Data Colada

    For some the big legal filing this week was the indictment of Donald Trump in Washington DC. But others in academia…

    5 条评论
  • Videos for our research course

    Videos for our research course

    Two weeks I posted a short piece about the PhD course book that Tony Ding and I are working on. One thing I feel may…

  • An accounting research course book

    An accounting research course book

    For those of you involved in academic accounting research, please check out the work-in-process draft of the course…

    8 条评论
  • The purpose of an accounting professor?

    The purpose of an accounting professor?

    Recently, Professor Alex Edmans of London Business School posted a paper entitled “The Purpose of a Finance Professor”…

    7 条评论

社区洞察