登录查看更多内容

Using Fast loading libraries like Vaex

Amit Jain

Actively looking for new job | 7.2+ YoE as a Data Scientist

发布日期: 2021年12月15日

Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets.

It calculates statistics such as mean, sum, count, standard deviation, etc, on an N-dimensional grid for more than a billion (10^9) samples/rows per second.

Visualization is done using histograms, density plots, and 3d volume rendering, allowing interactive exploration of big data.

Vaex uses memory mapping, zero memory copy policy, and lazy computations for best performance (no memory wasted).

Now, we will implement the vaex library in the randomly generated dataset to observe the performance.

Step 1: import a vaex library

import vaex

?Step 2: we need to convert the CSV file to the hdf5 file using the vaex library.

#converting csv to hdf5 format

df=vaex.from_csv('dataset_vaex.csv',convert=True)

?After executing the above code, a dataset_vaex.csv.hdf5 file is generated in your working directory.

It is observed that it took less time to convert CSV to hdf5 file which is less time relative to the size of the file.

领英推荐

Matplotlib: The Foundation of Data Visualization in…

Shakil Khan 5 个月前

Advanced Data cleaning technique in Python

Hemant D. 3 个月前

Spark 3.0 : Adaptive Query Execution & Dynamic…

Deepak Rajak 4 年前

Step 3. Reading hdf5 file using vaex.

Now we need to open hdf5 file by open function in the vaex library.

%%time

#opening hdf5 file

df_vaex=vaex.open('dataset_vaex.csv.hdf5')

print(df_vaex.head())

After observing the above code if we see the output, it looks like it took less time to read an hdf5 file by this we can understand how fast it is executed to read a 3GB hdf5 file. This is the actual advantage of the vaex library.

By using vaex we can be able to perform different operations on the large data frames like-

Expression System
Out of core data frame
Fast groupby / aggregations
Fast and efficient join

Why vaex:

Performance: works with huge tabular data, processes rows/second
Lazy / Virtual columns: compute on the fly, without wasting ram
Memory efficient: no memory copies when doing filtering/selections/subsets.
Visualization: directly supported, a one-liner is often enough.
User friendly API: you will only need to deal with the DataFrame object, and tab completion + docstring will help you out: ds.mean<tab>, feels very similar to Pandas.
Lean: separated into multiple packages
Jupyter integration: vaex-jupyter will give you interactive visualization and selection in the Jupyter notebook and Jupyter lab.

For more information, kindly check my github profile:

Max Yu

Trading Strategy Generator

2 年

Is it faster than R Data.Table for Billion Rows and Polars for 10 Million Rows when doing GroupBy 8 Columns using CSV file format?

要查看或添加评论，请登录

Amit Jain的更多文章

How to install WML(Watson Machine Learning) using catalog in Openshift

2022年9月14日

How to install WML(Watson Machine Learning) using catalog in Openshift

WML Installation process Step 1: Login into https://cloud.ibm.
Shapash : Machine Learning Interpretable & Understandable

2021年12月15日

Shapash : Machine Learning Interpretable & Understandable

Shapash is a Python library which aims to make machine learning interpretable and understandable by everyone. It…

1 条评论
Azure Cognitive Services

2021年12月14日

Azure Cognitive Services

What is Azure Cognitive Services? Cognitive Services brings AI within reach of every developer and data scientist. With…
Autoviz & Autovizwidget

2021年11月24日

Autoviz & Autovizwidget

Autoviz is an open-source python library that mainly works on visualizing the relationship of the data, it can find the…

3 条评论
Exploratory Data Analysis using pandas visual analysis library

2021年11月12日

Exploratory Data Analysis using pandas visual analysis library

Pandas Visual Analysis is an open-source python library which is used to visually analyze the data and that too in just…
Exploratory Data Analysis Using D-Tale Library

2021年11月11日

Exploratory Data Analysis Using D-Tale Library

D-Tale for interactive data exploration D-Tale is python library allows us to visualize a Pandas DataFrame. D-Tale…

1 条评论
Exploratory Data Analysis Using Pandas Profiling

2021年11月10日

Exploratory Data Analysis Using Pandas Profiling

Pandas profiling is an open-source Python module with which we can quickly do an exploratory data analysis, it also…

2 条评论
Exploratory Data Analysis with Sweetviz

2021年9月8日

Exploratory Data Analysis with Sweetviz

Sweetviz is an open-source pandas-based library to perform the primary EDA task. It also generates a summarized report…
Python program to check available slots for Covid vaccination centers in your nearest pin code

2021年5月3日

Python program to check available slots for Covid vaccination centers in your nearest pin code

Here is the Python script which checks the available slots for Covid-19 vaccination centers pin code wise from CoWIN…

1 条评论

See all articles

Using Fast loading libraries like Vaex

Amit Jain

Actively looking for new job | 7.2+ YoE as a Data Scientist

领英推荐

Amit Jain的更多文章

社区洞察

其他会员也浏览了

5 Alternatives to Matplotlib That Make Data Visualization a Breeze

?? Top Python Libraries for Data Science ??

How to Write More Efficient Pandas Code in Python

DQ Outlier Detection with Interquartile Range (IQR) in Python

Beyond Pandas: How to tame your large Datasets in Python

Day 3: Exploring Python Data Types – Built-in Types, Getting, and Setting Data Types (Boy and Monk Series)

Random Data Generation ( Important Topic )

The Chance Framework: How to Explain A/B Test Results to Managers Using Probability (Without p-values)

Differences Between 'datetime64[ns]' and 'Timestamp' in Pandas

Building a Decision Tree from Scratch: Gini Impurity Explained with Python

领英推荐

Amit Jain的更多文章

How to install WML(Watson Machine Learning) using catalog in Openshift

Shapash : Machine Learning Interpretable & Understandable

Azure Cognitive Services

Autoviz & Autovizwidget

Exploratory Data Analysis using pandas visual analysis library

Exploratory Data Analysis Using D-Tale Library

Exploratory Data Analysis Using Pandas Profiling

Exploratory Data Analysis with Sweetviz

Python program to check available slots for Covid vaccination centers in your nearest pin code

社区洞察

其他会员也浏览了

5 Alternatives to Matplotlib That Make Data Visualization a Breeze

?? Top Python Libraries for Data Science ??

How to Write More Efficient Pandas Code in Python

DQ Outlier Detection with Interquartile Range (IQR) in Python

Beyond Pandas: How to tame your large Datasets in Python

Day 3: Exploring Python Data Types – Built-in Types, Getting, and Setting Data Types (Boy and Monk Series)

Random Data Generation ( Important Topic )

The Chance Framework: How to Explain A/B Test Results to Managers Using Probability (Without p-values)

Differences Between 'datetime64[ns]' and 'Timestamp' in Pandas

Building a Decision Tree from Scratch: Gini Impurity Explained with Python