登录查看更多内容

Apache Arrow: The future of in-memory columnar data, like dataframes

Paras Nigam

VP, Engineering | AI & Cybersecurity | Generative AI Expert | GCC & GVC | IIM Calcutta | MIT USA | 3AI Thought Leader | Entrepreneur |

发布日期: 2022年2月25日

I mentioned Parquet briefly in my last article, if you have not read it, do take a look here.

Parquet provides efficient columnar data processing and storage. I think it was born because of the need for an on-disk columnar format that improves compression and processing.

Today I want to talk about Apache Arrow which originated from the need for an in-memory columnar format for processing and cross-language data transfer.

What is Apache Arrow?

As the website says it is a software development platform that can be used by high-performance applications to process and transfer data, aka can be used for in-memory analytics.

Before I go further it is important to know about dataframe, I will assume not everyone reading this is aware of dataframe and sometimes they confuse it will some data structure or data API, etc. To me, a dataframe is a tabular dataset that is used for data manipulation in programming languages, the most commonly used dataframes are the pandas dataframes.

Apache Arrow can be a game changer for OLAP and get them working like realtime queries answered in milliseconds.

It promotes : DATA IS THE NEW API

Apache arrow itself is not a storage or a processing engine, it enables high performance processing and transfer of columnar data. It can provide 1000x improvement over reading from CSV, you can see some proven results posted by people over the internet. I have personally yet not found a use case of that improvement but have seen 30x improvement over playing with a small data.

BUT HOW ?

Useful Links:

Apache Arrow Overview: https://arrow.apache.org/overview/

Apache Arrow Python Support: https://arrow.apache.org/docs/python/index.html#

Apache Arrow Pandas Integration: https://arrow.apache.org/docs/python/pandas.html

要查看或添加评论，请登录

Paras Nigam的更多文章

Amazon Kinesis vs Amazon SQS

2022年3月12日

Amazon Kinesis vs Amazon SQS

When we start work on a use case that needs queuing and processing of streaming data, Kinesis and SQS are the first to…
Data Science: Common file format and their usage

2022年2月21日

Data Science: Common file format and their usage

Few common data file formats used in the textual data science projects CSV CSV stands for 'Comma Separated Values', it…

3 条评论
Governments need to shake hands with Behavioral Security

2019年6月8日

Governments need to shake hands with Behavioral Security

Historically governments have been profiling the criminals, terrorists, enemies and other most wanted people. This is…
Python __name__ value initialization and usage

2018年4月26日

Python __name__ value initialization and usage

Python is an interpreted language, whenever python interpreter reads the code file it executes all the code found in…
Python *args and **kwargs

2018年4月26日

Python *args and **kwargs

*args and **kwargs are two python magic variables which are commonly used to pass arguments into the functions. The…
The Wickedly Confusing Highway of Learning

2016年5月20日

The Wickedly Confusing Highway of Learning

An exercise well executed for past six months and its results. Action: It was few months back when I thought of…
The inherently dumb data requires Algorithm Economy

2016年4月13日

The inherently dumb data requires Algorithm Economy

Over the past few years every conversation I had with a CxO, product visionaries or a developer comes back to the…
Top 10 Product RoadMap Tips

2016年4月13日

Top 10 Product RoadMap Tips

Collaborative Roadmap: A product roadmap provides context around day-to-day activities of the team members. Even though…

See all articles

Apache Arrow: The future of in-memory columnar data, like dataframes

Paras Nigam

VP, Engineering | AI & Cybersecurity | Generative AI Expert | GCC & GVC | IIM Calcutta | MIT USA | 3AI Thought Leader | Entrepreneur |

Paras Nigam的更多文章

社区洞察

其他会员也浏览了

Week of May 13th

Using the alexmerced/datanotebook Docker Image

Lakes, Lakehouses, Warehouse and.....MDM?

Is DuckDB Useful For Parsing JSON? Yes, Definitely.

How fast can a Duck be?

Crafting an Object Engine

Learn How to configure Trino with Hudi and Hive Metastore with MINIO Object Store Developer Guide

Importing data from the Danish building registry - JSON version

Implementing SQLAlchecmy's Joined Table Inheritance

Deploying a Database from the Ground Up: A Tale of Syntax, Sweat, and Semicolons

Paras Nigam的更多文章

Amazon Kinesis vs Amazon SQS

Data Science: Common file format and their usage

Governments need to shake hands with Behavioral Security

Python __name__ value initialization and usage

Python *args and **kwargs

The Wickedly Confusing Highway of Learning

The inherently dumb data requires Algorithm Economy

Top 10 Product RoadMap Tips

社区洞察

其他会员也浏览了

Week of May 13th

Using the alexmerced/datanotebook Docker Image

Lakes, Lakehouses, Warehouse and.....MDM?

Is DuckDB Useful For Parsing JSON? Yes, Definitely.

How fast can a Duck be?

Crafting an Object Engine

Learn How to configure Trino with Hudi and Hive Metastore with MINIO Object Store Developer Guide

Importing data from the Danish building registry - JSON version

Implementing SQLAlchecmy's Joined Table Inheritance

Deploying a Database from the Ground Up: A Tale of Syntax, Sweat, and Semicolons

Python name value initialization and usage