How to Handle Large Data for Machine Learning
Rohan Chikorde
VP - AIML at BNY Mellon | 17k+ followers | AIML Corporate Trainer | University Professor | Speaker
Many times, data scientist or analyst finds difficulty to fit large data (multiple #GB/#TB) into memory and this is a common problem in the data science world.
This article will help you with a couple of ways to handle huge #data to solve #datascience problems.
1) Progressive Loading
The Keras deep learning library offers a feature for progressively loading image files and is called?flow_from_directory.
Pandas library that can?load large CSV files in chunks.
2) #Dask
Dask?is a parallel computing library, which scales?#NumPy, pandas, and?#scikit?module for fast computation and low memory. It uses the fact that a single machine has more than one core, and dask utilizes this fact for parallel computation.
3) Using Fast loading libraries like #Vaex
Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates?statistics?such as mean, sum, count, standard deviation, etc, on an?N-dimensional grid?for more than a billion (10^9) samples/rows per second. Visualization is done using histograms, density plots, and 3d volume rendering, allowing interactive exploration of big data. #Vaex uses memory mapping, zero memory copy policy, and lazy computations for best performance (no memory wasted).
4) Change the Data Format
Is your data stored in raw ASCII text, like a CSV file??
Perhaps you can speed up data loading and use less memory by using another data format. A good example is a binary format like #GRIB, #NetCDF, or #HDF. There are many command-line tools that you can use to transform one data format into another that do not require the entire dataset to be loaded into memory. Using another format may allow you to store the data in a more compact form that saves memory, such as 2-byte integers, or 4-byte floats.
5) Object Size reduction with correct datatypes
Generally, the memory usage of the data frame can be reduced by converting them to correct datatypes. Almost all the datasets include object?datatype which is generally in string format which is not memory efficient. When you consider the date, categorical features like region, city, place names were in the string which takes more memory so if we convert these to respective data types like DateTime, categorical which makes memory usage reduced by more than 10 times as consumed before.
6) Use a Relational Database
#Relational #databases provide a standard way of storing and accessing very large datasets.
Internally, the data is stored on a disk can be progressively loaded in batches and can be queried using a standard query language (SQL).
7) A Big Data Platform
In some cases, you may need to resort to a big data platform.
That is, a platform designed for handling very large datasets, that allows you to use data transforms and machine learning algorithms on top of it.
Two #good examples are #Hadoop with the?Mahout?machine learning library and #Spark with the?#MLLib?library.
8) Store your files in the #Parquet format?available in #Pandas (Python Data Analysis Library).
If you have trouble reading large data files in CSV format, the right optimization is to read the CSV file and then store it in Parquet format to speed up future readings.
9) Choose your Machine Learning model carefully.
Indeed, some models are known to scroll better than others on very large volumes of data. For Boosting algorithms, #LightGBM is much more efficient than XGBoost for very large volumes of data. All this with similar performance in terms of accuracy.
领英推荐
10) Allocate More #Memory
Some machine learning tools or libraries may be limited by a default memory configuration. Check if you can re-configure your tool or library to allocate more memory.
A good example is Weka, where you can?increase the memory as a parameter?when starting the application.
11) Work with a Smaller Sample
Are you sure you need to work with all of the data?
Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller sample to work through your problem before fitting a final model on all of your data (using progressive data loading techniques).
I think this is a good practice in general for machine learning to give you quick spot-checks of algorithms and turnaround of results.
12) Use a Computer with More Memory
Do you have to work on your computer?
Perhaps you can get access to a much larger computer with an order of magnitude more memory.
For example, a good option is to rent compute time on a cloud service like Amazon Web Services (#AWS) or #Azure or Google Cloud Platform (#GCP) that offers machines with tens of gigabytes of RAM for less than a US dollar per hour.
---------------------------------------------------------------------------------------------------------------
Hope you find this article helpful. Follow me for interesting articles in the world of data science, machine learning, deep learning, artificial intelligence.
Happy Learning. Enjoy !!!
References and Credits:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
https://keras.io/api/preprocessing/image/
https://vijayabhaskar96.medium.com/tutorial-on-keras-imagedatagenerator-with-flow-from-dataframe-8bd5776e45c1
https://towardsdatascience.com/machine-learning-with-big-data-86bcb39f2f0b
https://medium.com/analytics-vidhya/how-to-deal-with-large-datasets-in-machine-learning-61b966a338fe
https://www.askpython.com/python/examples/handling-large-datasets-machine-learning
https://www.analyticsvidhya.com/blog/2018/08/dask-big-datasets-machine_learning-python/
https://machinelearningmastery.com/large-data-files-machine-learning/
https://assets.weforum.org/editor/66qF74aouhl5mREnEb1zaAypEGqljfNjfvcKhx1VeMY.jpg
?
?
?
Data Scientist | Data Analyst | Data Viz | Python | Machine Learning
1 年Thank you Mantsali Sekoli.
use pikkc.com