登录查看更多内容

C++ DataFrame ala Pandas or R’s data.frame

Hossein Moein

Author of C++ DataFrame | Data Infrastructure | Software Engineering

发布日期: 2020年5月6日

I started implementing a C++ data frame about 3 years ago on my spare time (https://github.com/hosseinmoein/DataFrame). I was motivated by a few factors; first I wanted to enrich the C++ ecosystem which significantly lags behind languages such as Python, JS, and even Java. Second, I wanted to create an analytical tool that could scale way behind the limits of Pandas and R and at the same time outperform them. Third, I wanted to generically solve a problem within algorithmic-trading industry which does its research in Python and implements its production in C++.

I started by setting some principles for the project as follows in no particular order:

Support any type either built-in or user defined without needing new code
Never chase pointers via linked lists, std::any, pointer to base, ..., including virtual function calls
Never use more space than you need. Think unions, std::variant, ...
Have all column data in continuous memory space. Also, be mindful of cache-line aliasing misses between multiple columns
Avoid copying data as much as possible. Unfortunately, sometimes you have to
Use multi-threading but only when it makes sense
Do not attempt to protect the user against “garbage in, garbage out”

The most challenging part was to establish a framework within which I could accomplish item 1. As you know, C++ is a statically and somewhat strongly typed language. That means all types must be known at compile time. So, to achieve item 1., with constraints applied with items 2. and 3., I basically had to come up with a “truly” heterogeneous container. That lies at the heart of DataFrame. It makes the interface a bit strange, if you are not used to it. But it allowed me to achieve my principles.

I implemented heterogeneity by using static vectors stored in hash tables keyed by object pointers (i.e. the this pointer). So, to find a named vector, you have to do two hash look ups; one to find the hetero-object associated with the name, and one to find the particular typed vector you are looking for. After that you are dealing with standard or standard-like vectors with contiguous memory space (adjusted to avoid cache-line aliasing misses). This is not too bad, since finding vectors is not something you do frequently as compared to analyzing data.

Now that I have built a framework to contain any type of data, I could start with designing the interface. I separated the interface into two categories:

Slicing & dicing, joining, and groupby'ing, ... the data
Analytical algorithms being statistical, machine-learning, financial analysis …

I used regular parameterized methods to implement item 1. For item 2., I chose the visitor pattern. For example to calculate a rolling standard-deviation, you do something like:

SimpleRollAdopter<StdVisitor<double, size_t>, double, size_t>??std_roller;

领英推荐

Data Analysis with Python: Concatenating Datasets with…

Benjamin Bennett Alexander 7 个月前

AIML26-Pandas and Python Tips and Tricks for Data…

Dr. Alok Tiwari 2 年前

The lambda() and more

Can Arslan 2 年前

df.visit<double>("IBM returns", std_roller)

You could do far more complicated visits on single or multiple columns. The reason to select a visitor to implement analytics is because analytics may have multiple states and results that could not simply be returned from a method. So, in this particular example “std_roller” holds everything related to the result.

I also implemented views. Views are a special kind of DataFrame (they appear exactly as a regular DataFrame) that refer to a slice or parts (continuous or disjoint) of another DataFrame. So if you change something in a view the corresponding data item(s) in the original DataFrame will also change. Views are somewhat restricted in functionality. For example, you cannot create or delete columns, etc.

Threads are dangerous and counterproductive, if not used wisely. Not only threads could slowdown your process, but they could make your process incorrect. DataFrame is inherently thread-unsafe because of using static vectors. So, I provide hooks to inject spin-locks to protect static parts in case the user decides to use multi-threading. By default there is no protection. I also provide asynchronous interfaces. For example I have sort() and sort_async(). The latter returns a future. Also, in a few algorithms I allow the user to choose if she wants to use multiple threads.

And finally I am reaching the weakest part of my implementation, namely persisting the data. Although I provide functionality to read/write data to/from CSV, JSON formats (and it works fine), it is implemented in a kind of haphazard way. I would love to enhance and expand this part in the future and also be able to exchange data with Apache Arrow project.?

In the past few weeks, I have significantly enhanced documentation both in terms of content and format. I encourage you to visit the repository at https://github.com/hosseinmoein/DataFrame and I appreciate feedbacks. I have some statistics about DataFrame performance that you may find interesting.

I also accept contributions from people with expertise, time, and interest.

fur kan

?stanbul Teknik üniversitesi e?itim kurumunda ??renci

3 年

I am using the "dataframe" library that you created for my university assignment. But I can't do dataframe filtering. can you help me ?

Ivan Vaghi

Entrepreneur & Tech Strategist | Thinking solutions for data driven clients in Finance, Banking, Fintech and Energy

4 年

Looks very interesting. Where could I find some code samples of the library in use? Does the library also support pivots?

查看更多评论

要查看或添加评论，请登录

Hossein Moein的更多文章

C++ DataFrame vs. Polars

2023年11月21日

C++ DataFrame vs. Polars

You have probably heard of Polars DataFrame. It is implemented in Rust and ported with zero-overhead to Python (as long…

2 条评论
One Flew Over the Matrix

2017年10月22日

One Flew Over the Matrix

Have you ever needed a good matrix math library, say in C++? If so, continue reading and implement your own. It is not…

5 条评论

C++ DataFrame ala Pandas or R’s data.frame

Hossein Moein

Author of C++ DataFrame | Data Infrastructure | Software Engineering

领英推荐

Hossein Moein的更多文章

社区洞察

其他会员也浏览了

C++23: This and That

Ducks vs Pythons: How to write Iceberg tables using PyArrow and analyze them using Polars

PyData London

Use Python In Power Query to retrieve H3 indices

Pandas - Create DataFrame

Using Python Pandas to turn ISO Country Codes into a string to use as values for a SQL Query

Data Analysis by Example in Python, BigQuery and Q

Hello GraphQL :A Practical Guide

Mutable vs Immutable Data Types: Choosing the Right One

From Minutes to Seconds: Supercharging Python for the Billion Row Challenge

领英推荐

Hossein Moein的更多文章

C++ DataFrame vs. Polars

One Flew Over the Matrix

社区洞察

其他会员也浏览了

C++23: This and That

Ducks vs Pythons: How to write Iceberg tables using PyArrow and analyze them using Polars

PyData London

Use Python In Power Query to retrieve H3 indices

Pandas - Create DataFrame

Using Python Pandas to turn ISO Country Codes into a string to use as values for a SQL Query

Data Analysis by Example in Python, BigQuery and Q

Hello GraphQL :A Practical Guide

Mutable vs Immutable Data Types: Choosing the Right One

From Minutes to Seconds: Supercharging Python for the Billion Row Challenge