How to Handle Large Data for Machine Learning
https://assets.weforum.org/editor/66qF74aouhl5mREnEb1zaAypEGqljfNjfvcKhx1VeMY.jpg

How to Handle Large Data for Machine Learning

Many times, data scientist or analyst finds difficulty to fit large data (multiple #GB/#TB) into memory and this is a common problem in the data science world.

This article will help you with a couple of ways to handle huge #data to solve #datascience problems.

1) Progressive Loading

The Keras deep learning library offers a feature for progressively loading image files and is called?flow_from_directory.

Pandas library that can?load large CSV files in chunks.

2) #Dask

Dask?is a parallel computing library, which scales?#NumPy, pandas, and?#scikit?module for fast computation and low memory. It uses the fact that a single machine has more than one core, and dask utilizes this fact for parallel computation.

3) Using Fast loading libraries like #Vaex

Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It calculates?statistics?such as mean, sum, count, standard deviation, etc, on an?N-dimensional grid?for more than a billion (10^9) samples/rows per second. Visualization is done using histograms, density plots, and 3d volume rendering, allowing interactive exploration of big data. #Vaex uses memory mapping, zero memory copy policy, and lazy computations for best performance (no memory wasted).

4) Change the Data Format

Is your data stored in raw ASCII text, like a CSV file??

Perhaps you can speed up data loading and use less memory by using another data format. A good example is a binary format like #GRIB, #NetCDF, or #HDF. There are many command-line tools that you can use to transform one data format into another that do not require the entire dataset to be loaded into memory. Using another format may allow you to store the data in a more compact form that saves memory, such as 2-byte integers, or 4-byte floats.

5) Object Size reduction with correct datatypes

Generally, the memory usage of the data frame can be reduced by converting them to correct datatypes. Almost all the datasets include object?datatype which is generally in string format which is not memory efficient. When you consider the date, categorical features like region, city, place names were in the string which takes more memory so if we convert these to respective data types like DateTime, categorical which makes memory usage reduced by more than 10 times as consumed before.

6) Use a Relational Database

#Relational #databases provide a standard way of storing and accessing very large datasets.

Internally, the data is stored on a disk can be progressively loaded in batches and can be queried using a standard query language (SQL).

7) A Big Data Platform

In some cases, you may need to resort to a big data platform.

That is, a platform designed for handling very large datasets, that allows you to use data transforms and machine learning algorithms on top of it.

Two #good examples are #Hadoop with the?Mahout?machine learning library and #Spark with the?#MLLib?library.

8) Store your files in the #Parquet format?available in #Pandas (Python Data Analysis Library).

If you have trouble reading large data files in CSV format, the right optimization is to read the CSV file and then store it in Parquet format to speed up future readings.

9) Choose your Machine Learning model carefully.

Indeed, some models are known to scroll better than others on very large volumes of data. For Boosting algorithms, #LightGBM is much more efficient than XGBoost for very large volumes of data. All this with similar performance in terms of accuracy.

10) Allocate More #Memory

Some machine learning tools or libraries may be limited by a default memory configuration. Check if you can re-configure your tool or library to allocate more memory.

A good example is Weka, where you can?increase the memory as a parameter?when starting the application.

11) Work with a Smaller Sample

Are you sure you need to work with all of the data?

Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller sample to work through your problem before fitting a final model on all of your data (using progressive data loading techniques).

I think this is a good practice in general for machine learning to give you quick spot-checks of algorithms and turnaround of results.

12) Use a Computer with More Memory

Do you have to work on your computer?

Perhaps you can get access to a much larger computer with an order of magnitude more memory.

For example, a good option is to rent compute time on a cloud service like Amazon Web Services (#AWS) or #Azure or Google Cloud Platform (#GCP) that offers machines with tens of gigabytes of RAM for less than a US dollar per hour.

---------------------------------------------------------------------------------------------------------------

Hope you find this article helpful. Follow me for interesting articles in the world of data science, machine learning, deep learning, artificial intelligence.

Happy Learning. Enjoy !!!

References and Credits:

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

https://keras.io/api/preprocessing/image/

https://vijayabhaskar96.medium.com/tutorial-on-keras-imagedatagenerator-with-flow-from-dataframe-8bd5776e45c1

https://towardsdatascience.com/machine-learning-with-big-data-86bcb39f2f0b

https://medium.com/analytics-vidhya/how-to-deal-with-large-datasets-in-machine-learning-61b966a338fe

https://www.askpython.com/python/examples/handling-large-datasets-machine-learning

https://www.analyticsvidhya.com/blog/2018/08/dask-big-datasets-machine_learning-python/

https://machinelearningmastery.com/large-data-files-machine-learning/

https://assets.weforum.org/editor/66qF74aouhl5mREnEb1zaAypEGqljfNjfvcKhx1VeMY.jpg

?

?

?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了