登录查看更多内容

WRDS at Speed

Ian Gow

Professor

发布日期: 2023年12月14日

Users of data from Wharton Research Data Services (WRDS) may find the newly updated version of my Python package wrds2pg useful. This package started life as converted-Perl code for getting WRDS SAS data into my PostgreSQL database (this was before the WRDS PostreSQL database existed).

Recently, with the help of Evan Jo of Queen’s University, I have expanded the capability of the package to support parquet files and compressed CSV files. The parquet format is a “an open standards-based format widely used by big data systems” and is emerging as a kind of “modern CSV” (see here for some discussion of the benefits of the format).

The original idea of the wrds2pg package was to make it easy to maintain a local PostgreSQL database, which is often more convenient than using the WRDS database. The addition of support for parquet file makes it possible to get many of the benefits of a local PostgreSQL database with less set-up. Additionally, parquet files are very fast.

What are some benefits of having a package like? wrds2pg? Here I discuss two:

Increased reproducibility. Looking through the code-and-data repository provided by the Journal of Accounting Research, one often sees code referring to data sets such as CRSPmonthly79to18.sas7bdat or Compustat.dta. While one can often guess how such data sets were created by the authors, the need to guess increases the cost of replication relative to code that starts with crsp.msf or comp.funda, canonical versions of which are supplied by WRDS. The functions supplied by wrds2pg produce data sets crsp.msf or comp.funda that are functionally equivalent to the versions supplied by WRDS.

领英推荐

MongoDB Series - Part 1 - The Basics

Shrey Batra 1 年前

Perfecting the Median: Transforming Hive UDAFs to…

Aliz 1 年前

Testing DBtune, showing PostgreSQL double buffering…

Franck Pachot 9 个月前

This approach dramatically increases reproducibility of code. For example, a core promise of my (work-in-progress) book with Tony Ding, Empirical Research in Accounting: Tools and Methods, is that all tables and plots in the book can be reproduced by a reader with a WRDS account and a computer. This is true whether one uses the WRDS PostgreSQL database, or one’s own database populated using wrds2pg. And with very minor modifications to the code, the same is true using a parquet repository created using wrds2pg.

Blazing performance. A local PostgreSQL database will be very fast, but the data analysis using parquet files can reach another level. For example, the data steps shown here take nearly a minute on the WRDS servers (using SAS and the 16.5 GB version of crsp.dsf). But on my three-year-old Mac mini using a 2.1 GB parquet version of crsp.dsf, they take under three seconds. (The parquet file contains the same data, but is smaller due to compression.)

Guidance on setting up a PostgreSQL database of your own is provided here. We explain how to set up a parquet data repository here. Quarto templates to run code and do the exercises in our book using the WRDS PostgreSQL database (or your own) can be found here. Some templates using parquet data are also included there. (Parquet templates for all chapters should be available early in 2024.) As the wrds2pg package is hosted on PyPI (thanks to Jingyu Zhang), one can install it using standard Python tools (pip3 install wrds2pg).

Many code examples can be found in our book. We have a contract with CRC Press to publish the book in print form and it should be available mid-2024. The book will remain online even after publication. Accompanying the book are an R package farr, the templates referred to above, and (for instructors) solution manuals to the exercises in the book. Comments and suggestions are welcomed.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

要查看或添加评论，请登录

Ian Gow的更多文章

db2pq: A Python library for making parquet data

2025年3月11日

db2pq: A Python library for making parquet data

If you use WRDS data, there are many benefits to storing the data as parquet files. The parquet format is described in…

2 条评论
DuckDB/parquet versus WRDS PostgreSQL

2025年1月16日

DuckDB/parquet versus WRDS PostgreSQL

From a coding perspective, Empirical Research in Accounting: Tools and Methods is kind of two books in one. While the…
Write SQL without writing SQL

2024年12月16日

Write SQL without writing SQL

I figure there is some merit to the claim that “SQL has become an even more indispensable too for the savvy analyst or…

5 条评论
Missing Form APs?

2024年12月1日

Missing Form APs?

In May of 2024, I posted on LinkedIn a brief note about working with data from Form APs filed with the PCAOB. In a…

4 条评论
Working with dates and times

2024年8月13日

Working with dates and times

In a recent post on LinkedIn, I mentioned that one goal of Empirical Research in Accounting: Tools and Methods…

1 条评论
Recommendation: Chanticleer podcast

2024年7月22日

Recommendation: Chanticleer podcast

For Australian students of business and finance, I strongly recommend the Chanticleer podcast as a great way to stay on…

2 条评论
Gino's response to Data Colada

2023年8月6日

Gino's response to Data Colada

For some the big legal filing this week was the indictment of Donald Trump in Washington DC. But others in academia…

5 条评论
Videos for our research course

2022年6月30日

Videos for our research course

Two weeks I posted a short piece about the PhD course book that Tony Ding and I are working on. One thing I feel may…
An accounting research course book

2022年6月16日

An accounting research course book

For those of you involved in academic accounting research, please check out the work-in-process draft of the course…

8 条评论
The purpose of an accounting professor?

2022年2月2日

The purpose of an accounting professor?

Recently, Professor Alex Edmans of London Business School posted a paper entitled “The Purpose of a Finance Professor”…

7 条评论

See all articles

WRDS at Speed

Ian Gow

Professor

领英推荐

Ian Gow的更多文章

社区洞察

其他会员也浏览了

Postgres for Everything IRL

Building a Data Pipeline with SQL, Python, and Azure Fabric

Journey To Database World: Part 7 (Document Database - MongoDB As Example)

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

Integration of Pandas with Postgres Database

Why You Shouldn't Use An ORM With DynamoDB

Create A Flask App To Use PostgreSQL Database

Data Ingestion with Spark Scala and SQL through JDBC

Top 15 futuristic database you have never heard of

Apache Spark Basics 101: select() vs. selectExpr()

领英推荐

Ian Gow的更多文章

db2pq: A Python library for making parquet data

DuckDB/parquet versus WRDS PostgreSQL

Write SQL without writing SQL

Missing Form APs?

Working with dates and times

Recommendation: Chanticleer podcast

Gino's response to Data Colada

Videos for our research course

An accounting research course book

The purpose of an accounting professor?

社区洞察

其他会员也浏览了

Postgres for Everything IRL

Building a Data Pipeline with SQL, Python, and Azure Fabric

Journey To Database World: Part 7 (Document Database - MongoDB As Example)

Real-Time OLAP with Apache Pinot and Kafka: Practical Project

Integration of Pandas with Postgres Database

Why You Shouldn't Use An ORM With DynamoDB

Create A Flask App To Use PostgreSQL Database

Data Ingestion with Spark Scala and SQL through JDBC

Top 15 futuristic database you have never heard of

Apache Spark Basics 101: select() vs. selectExpr()