Apache Arrow: The future of in-memory columnar data, like dataframes

Apache Arrow: The future of in-memory columnar data, like dataframes

I mentioned Parquet briefly in my last article, if you have not read it, do take a look here.

Parquet provides efficient columnar data processing and storage. I think it was born because of the need for an on-disk columnar format that improves compression and processing.

Today I want to talk about Apache Arrow which originated from the need for an in-memory columnar format for processing and cross-language data transfer.

What is Apache Arrow?

As the website says it is a software development platform that can be used by high-performance applications to process and transfer data, aka can be used for in-memory analytics.

Before I go further it is important to know about dataframe, I will assume not everyone reading this is aware of dataframe and sometimes they confuse it will some data structure or data API, etc. To me, a dataframe is a tabular dataset that is used for data manipulation in programming languages, the most commonly used dataframes are the pandas dataframes.

No alt text provided for this image

Apache Arrow can be a game changer for OLAP and get them working like realtime queries answered in milliseconds.


No alt text provided for this image


It promotes : DATA IS THE NEW API        


Apache arrow itself is not a storage or a processing engine, it enables high performance processing and transfer of columnar data. It can provide 1000x improvement over reading from CSV, you can see some proven results posted by people over the internet. I have personally yet not found a use case of that improvement but have seen 30x improvement over playing with a small data.


BUT HOW ?

No alt text provided for this image


Useful Links:

Apache Arrow Overview: https://arrow.apache.org/overview/

Apache Arrow Python Support: https://arrow.apache.org/docs/python/index.html#

Apache Arrow Pandas Integration: https://arrow.apache.org/docs/python/pandas.html





要查看或添加评论,请登录

Paras Nigam的更多文章

  • Amazon Kinesis vs Amazon SQS

    Amazon Kinesis vs Amazon SQS

    When we start work on a use case that needs queuing and processing of streaming data, Kinesis and SQS are the first to…

  • Data Science: Common file format and their usage

    Data Science: Common file format and their usage

    Few common data file formats used in the textual data science projects CSV CSV stands for 'Comma Separated Values', it…

    3 条评论
  • Governments need to shake hands with Behavioral Security

    Governments need to shake hands with Behavioral Security

    Historically governments have been profiling the criminals, terrorists, enemies and other most wanted people. This is…

  • Python __name__ value initialization and usage

    Python __name__ value initialization and usage

    Python is an interpreted language, whenever python interpreter reads the code file it executes all the code found in…

  • Python *args and **kwargs

    Python *args and **kwargs

    *args and **kwargs are two python magic variables which are commonly used to pass arguments into the functions. The…

  • The Wickedly Confusing Highway of Learning

    The Wickedly Confusing Highway of Learning

    An exercise well executed for past six months and its results. Action: It was few months back when I thought of…

  • The inherently dumb data requires Algorithm Economy

    The inherently dumb data requires Algorithm Economy

    Over the past few years every conversation I had with a CxO, product visionaries or a developer comes back to the…

  • Top 10 Product RoadMap Tips

    Top 10 Product RoadMap Tips

    Collaborative Roadmap: A product roadmap provides context around day-to-day activities of the team members. Even though…

社区洞察

其他会员也浏览了