登录查看更多内容

How to Handle Large Data for Machine Learning

Rohan Chikorde

VP - AIML at BNY Mellon | 17k+ followers | AIML Corporate Trainer | University Professor | Speaker

发布日期: 2021年6月30日

Many times, data scientist or analyst finds difficulty to fit large data (multiple #GB/#TB) into memory and this is a common problem in the data science world.

This article will help you with a couple of ways to handle huge #data to solve #datascience problems.

1) Progressive Loading

The Keras deep learning library offers a feature for progressively loading image files and is called?flow_from_directory.

Pandas library that can?load large CSV files in chunks.

2) #Dask

Dask?is a parallel computing library, which scales?#NumPy, pandas, and?#scikit?module for fast computation and low memory. It uses the fact that a single machine has more than one core, and dask utilizes this fact for parallel computation.

3) Using Fast loading libraries like #Vaex

Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates?statistics?such as mean, sum, count, standard deviation, etc, on an?N-dimensional grid?for more than a billion (10^9) samples/rows per second. Visualization is done using histograms, density plots, and 3d volume rendering, allowing interactive exploration of big data. #Vaex uses memory mapping, zero memory copy policy, and lazy computations for best performance (no memory wasted).

4) Change the Data Format

Is your data stored in raw ASCII text, like a CSV file??

Perhaps you can speed up data loading and use less memory by using another data format. A good example is a binary format like #GRIB, #NetCDF, or #HDF. There are many command-line tools that you can use to transform one data format into another that do not require the entire dataset to be loaded into memory. Using another format may allow you to store the data in a more compact form that saves memory, such as 2-byte integers, or 4-byte floats.

5) Object Size reduction with correct datatypes

Generally, the memory usage of the data frame can be reduced by converting them to correct datatypes. Almost all the datasets include object?datatype which is generally in string format which is not memory efficient. When you consider the date, categorical features like region, city, place names were in the string which takes more memory so if we convert these to respective data types like DateTime, categorical which makes memory usage reduced by more than 10 times as consumed before.

6) Use a Relational Database

#Relational #databases provide a standard way of storing and accessing very large datasets.

Internally, the data is stored on a disk can be progressively loaded in batches and can be queried using a standard query language (SQL).

7) A Big Data Platform

In some cases, you may need to resort to a big data platform.

That is, a platform designed for handling very large datasets, that allows you to use data transforms and machine learning algorithms on top of it.

Two #good examples are #Hadoop with the?Mahout?machine learning library and #Spark with the?#MLLib?library.

8) Store your files in the #Parquet format?available in #Pandas (Python Data Analysis Library).

If you have trouble reading large data files in CSV format, the right optimization is to read the CSV file and then store it in Parquet format to speed up future readings.

9) Choose your Machine Learning model carefully.

Indeed, some models are known to scroll better than others on very large volumes of data. For Boosting algorithms, #LightGBM is much more efficient than XGBoost for very large volumes of data. All this with similar performance in terms of accuracy.

领英推荐

How to become a Data Scientist 2023

Alexis Veras 1 年前

Adventures in Data Science: From Wrangling Rogue Data…

Igor Rodrigues 3 个月前

ML - Pipeline

Sourabh Singh 2 年前

10) Allocate More #Memory

Some machine learning tools or libraries may be limited by a default memory configuration. Check if you can re-configure your tool or library to allocate more memory.

A good example is Weka, where you can?increase the memory as a parameter?when starting the application.

11) Work with a Smaller Sample

Are you sure you need to work with all of the data?

Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller sample to work through your problem before fitting a final model on all of your data (using progressive data loading techniques).

I think this is a good practice in general for machine learning to give you quick spot-checks of algorithms and turnaround of results.

12) Use a Computer with More Memory

Do you have to work on your computer?

Perhaps you can get access to a much larger computer with an order of magnitude more memory.

For example, a good option is to rent compute time on a cloud service like Amazon Web Services (#AWS) or #Azure or Google Cloud Platform (#GCP) that offers machines with tens of gigabytes of RAM for less than a US dollar per hour.

---------------------------------------------------------------------------------------------------------------

Hope you find this article helpful. Follow me for interesting articles in the world of data science, machine learning, deep learning, artificial intelligence.

Happy Learning. Enjoy !!!

References and Credits:

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

https://keras.io/api/preprocessing/image/

https://vijayabhaskar96.medium.com/tutorial-on-keras-imagedatagenerator-with-flow-from-dataframe-8bd5776e45c1

https://towardsdatascience.com/machine-learning-with-big-data-86bcb39f2f0b

https://medium.com/analytics-vidhya/how-to-deal-with-large-datasets-in-machine-learning-61b966a338fe

https://www.askpython.com/python/examples/handling-large-datasets-machine-learning

https://www.analyticsvidhya.com/blog/2018/08/dask-big-datasets-machine_learning-python/

https://machinelearningmastery.com/large-data-files-machine-learning/

https://assets.weforum.org/editor/66qF74aouhl5mREnEb1zaAypEGqljfNjfvcKhx1VeMY.jpg

Tharollo Tevin Dikgale

Data Scientist | Data Analyst | Data Viz | Python | Machine Learning

1 年

Thank you Mantsali Sekoli.

1 次回应

The AI Journal

3 年

use pikkc.com

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

How to Handle Large Data for Machine Learning

Rohan Chikorde

VP - AIML at BNY Mellon | 17k+ followers | AIML Corporate Trainer | University Professor | Speaker

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

ML - Pipeline

Demystifying the Confusion Matrix: Unveiling Your 5-Detector's Performance (Part 4)

Dimensionality Reduction — Can PCA improve the performance of a classification model?

Simplifying key Data Science Concepts! (drafted by Dr Ratika Datta)

The 40 NumPy Methods Data Scientists Use All the Time

What happens when a Database Developer becomes Data Science Developer?

Document Classification

Data Structures and Algorithms

Learning Data Science with Kaggle's Titantic: Machine Learning from Disaster

AIML23- Handling Large Data in Less Memory- Part-01

领英推荐

Key Steps to Learn Machine Learning in 2024

2024年3月10日

From Content to Art: An Introduction to Neural Style Transfer using Python and TensorFlow

2023年2月15日

Dask vs Spark

2021年7月8日

Quick Understanding: Instance segmentation vs. Semantic segmentation in Image Analysis

2020年3月12日

Configure Deep Learning Architecture

2019年1月6日

Recurrent Neural Networks (#RNN) and #LSTM- Deep Learning

2018年10月18日

Deep Learning vs Traditional Machine Learning... Which one I should use?

2018年8月25日

Use Cases of Deep Learning

2018年7月28日

Simplifying Deep Learning - Part II

2018年2月10日

Simplifying Deep Learning - Part I

2017年11月19日

社区洞察

其他会员也浏览了

ML - Pipeline

Demystifying the Confusion Matrix: Unveiling Your 5-Detector's Performance (Part 4)

Dimensionality Reduction — Can PCA improve the performance of a classification model?

Simplifying key Data Science Concepts! (drafted by Dr Ratika Datta)

The 40 NumPy Methods Data Scientists Use All the Time

What happens when a Database Developer becomes Data Science Developer?

Document Classification

Data Structures and Algorithms

Learning Data Science with Kaggle's Titantic: Machine Learning from Disaster

AIML23- Handling Large Data in Less Memory- Part-01