C++ DataFrame ala Pandas or R’s data.frame

C++ DataFrame ala Pandas or R’s data.frame

No alt text provided for this image

I started implementing a C++ data frame about 3 years ago on my spare time (https://github.com/hosseinmoein/DataFrame). I was motivated by a few factors; first I wanted to enrich the C++ ecosystem which significantly lags behind languages such as Python, JS, and even Java. Second, I wanted to create an analytical tool that could scale way behind the limits of Pandas and R and at the same time outperform them. Third, I wanted to generically solve a problem within algorithmic-trading industry which does its research in Python and implements its production in C++.


I started by setting some principles for the project as follows in no particular order:

  1. Support any type either built-in or user defined without needing new code
  2. Never chase pointers via linked lists, std::any, pointer to base, ..., including virtual function calls
  3. Never use more space than you need. Think unions, std::variant, ...
  4. Have all column data in continuous memory space. Also, be mindful of cache-line aliasing misses between multiple columns
  5. Avoid copying data as much as possible. Unfortunately, sometimes you have to
  6. Use multi-threading but only when it makes sense
  7. Do not attempt to protect the user against “garbage in, garbage out”

The most challenging part was to establish a framework within which I could accomplish item 1. As you know, C++ is a statically and somewhat strongly typed language. That means all types must be known at compile time. So, to achieve item 1., with constraints applied with items 2. and 3., I basically had to come up with a “truly” heterogeneous container. That lies at the heart of DataFrame. It makes the interface a bit strange, if you are not used to it. But it allowed me to achieve my principles.

I implemented heterogeneity by using static vectors stored in hash tables keyed by object pointers (i.e. the this pointer). So, to find a named vector, you have to do two hash look ups; one to find the hetero-object associated with the name, and one to find the particular typed vector you are looking for. After that you are dealing with standard or standard-like vectors with contiguous memory space (adjusted to avoid cache-line aliasing misses). This is not too bad, since finding vectors is not something you do frequently as compared to analyzing data.


Now that I have built a framework to contain any type of data, I could start with designing the interface. I separated the interface into two categories:

  1. Slicing & dicing, joining, and groupby'ing, ... the data
  2. Analytical algorithms being statistical, machine-learning, financial analysis …

I used regular parameterized methods to implement item 1. For item 2., I chose the visitor pattern. For example to calculate a rolling standard-deviation, you do something like:

SimpleRollAdopter<StdVisitor<double, size_t>, double, size_t>??std_roller;

df.visit<double>("IBM returns", std_roller)

You could do far more complicated visits on single or multiple columns. The reason to select a visitor to implement analytics is because analytics may have multiple states and results that could not simply be returned from a method. So, in this particular example “std_roller” holds everything related to the result.


I also implemented views. Views are a special kind of DataFrame (they appear exactly as a regular DataFrame) that refer to a slice or parts (continuous or disjoint) of another DataFrame. So if you change something in a view the corresponding data item(s) in the original DataFrame will also change. Views are somewhat restricted in functionality. For example, you cannot create or delete columns, etc.


Threads are dangerous and counterproductive, if not used wisely. Not only threads could slowdown your process, but they could make your process incorrect. DataFrame is inherently thread-unsafe because of using static vectors. So, I provide hooks to inject spin-locks to protect static parts in case the user decides to use multi-threading. By default there is no protection. I also provide asynchronous interfaces. For example I have sort() and sort_async(). The latter returns a future. Also, in a few algorithms I allow the user to choose if she wants to use multiple threads.


And finally I am reaching the weakest part of my implementation, namely persisting the data. Although I provide functionality to read/write data to/from CSV, JSON formats (and it works fine), it is implemented in a kind of haphazard way. I would love to enhance and expand this part in the future and also be able to exchange data with Apache Arrow project.?


In the past few weeks, I have significantly enhanced documentation both in terms of content and format. I encourage you to visit the repository at https://github.com/hosseinmoein/DataFrame and I appreciate feedbacks. I have some statistics about DataFrame performance that you may find interesting.

I also accept contributions from people with expertise, time, and interest.

fur kan

?stanbul Teknik üniversitesi e?itim kurumunda ??renci

3 年

I am using the "dataframe" library that you created for my university assignment. But I can't do dataframe filtering. can you help me ?

回复
Ivan Vaghi

Entrepreneur & Tech Strategist | Thinking solutions for data driven clients in Finance, Banking, Fintech and Energy

4 年

Looks very interesting. Where could I find some code samples of the library in use? Does the library also support pivots?

回复

要查看或添加评论,请登录

Hossein Moein的更多文章

  • C++ DataFrame vs. Polars

    C++ DataFrame vs. Polars

    You have probably heard of Polars DataFrame. It is implemented in Rust and ported with zero-overhead to Python (as long…

    2 条评论
  • One Flew Over the Matrix

    One Flew Over the Matrix

    Have you ever needed a good matrix math library, say in C++? If so, continue reading and implement your own. It is not…

    5 条评论

社区洞察

其他会员也浏览了