登录查看更多内容

Data Cleansing with Apache Spark and Optimus

Favio Vazquez

CTO | Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

发布日期: 2017年8月17日

Outdated, inaccurate, or duplicated data won’t drive optimal data driven solutions. When data is inaccurate, leads are harder to track and nurture, and insights may be flawed. The data on which you base your big data strategy must be accurate, up-to-date, as complete as possible, and should not contain duplicate entries. Clean data results in better decisions.

Cleaning data is the most time-consuming and least enjoyable data science task (until Optimus), but one of the most important ones. No one can start a data science, machine learning or data driven solution without being sure that the data that they’ll be consuming is at its optimal stage. Although several data cleansing solutions exists, none of them can keep up with the emergence of Big Data, or they are really hard to use.

Right now more and more companies are entering (or at least trying to enter) the Big Data and Machine Learning revolution. All of the data driven approaches need to clean, wrangle, normalize and fix the data that will be input to the models they want to create, and with Optimus we are launching an easy to use, easy to deploy to production, and open source framework to clean and analyze data in a parallel fashion using state of the art technologies, that can be used by small, medium, big industries or even startups that want to create data science solutions and don’t have the money to pay lots of data scientists and create their own cluster to clean the data they are going to use.

Optimus is the missing library for cleaning and pre-processing data in a distributed fashion. It uses all the power of Apache Spark to do so. It implements several handy tools for data wrangling and munging that will make your life much easier. The first obvious advantage over any other public data cleaning library is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand.

Requirements:

Apache Spark 2.2.0
Python 3.5

Installation (Windows, Mac & Linux):

In your terminal just type:

pip install optimuspyspark

For a complete documentation on how to use it please visit our GitHub repository:

https://github.com/ironmussa/Optimus

If you want a peak of what can Optimus do for you check out this Demo:

https://nbviewer.jupyter.org/github/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb

Contributors:
Project Manager: Argenis León.
Original developers: Andrea Rosales, Hugo Reyes, Alberto Bonsanto.
Principal developer and maintainer: Favio Vázquez.

License:

Apache 2.0 ? Iron.

Joe H ☆

Dog Dad and full-time nerd.

6 年

Nevin Yilmaz looks like a good solution to me. Check it out.

Dhanapal .

7 年

Thanks for sharing ....

1 次回应

Kirk Borne, Ph.D.

LinkedIn Top Voice, Thinkers360 Top 25 Overall Thought Leader, Founder of Data Leadership Group (Data Scientist. Top Influencer. Speaker. Trainer. Consultant. Astrophysicist). Advisor to PrimeAI and other AI startups.

7 年

Very nice work! Thank you for sharing it with us.

3 次回应

查看更多评论

要查看或添加评论，请登录

Favio Vazquez的更多文章

S2-E5: Exploring and Preparing Data

2021年4月14日

S2-E5: Exploring and Preparing Data

Hello! And welcome to the fifth episode from the Data Science Now newsletter about the project: Basics of Data Science.…

11 条评论
S2-E4: Data Collection

2020年7月2日

S2-E4: Data Collection

Hello! And welcome to the fourth episode from the Data Science Now newsletter about the project: Basics of Data…

16 条评论
S2-E3: Business Understanding. Part 2.

2020年6月10日

S2-E3: Business Understanding. Part 2.

Hello! And welcome to the second episode from the Data Science Now newsletter about the project: Basics of Data…

3 条评论
S2-E2: Business Understanding. Part 1.

2020年5月27日

S2-E2: Business Understanding. Part 1.

Hello! And welcome to the first the Data Science Now newsletter about the project Basics of Data Science. Let me remind…

4 条评论
S2-E1: Basics of Data Science

2020年5月19日

S2-E1: Basics of Data Science

Hello! And welcome to a new season of the Data Science Now newsletter. In this season, we will be discussing the basics…

5 条评论
Episode 10: Best Books to Study Machine Learning

2020年4月24日

Episode 10: Best Books to Study Machine Learning

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the best books…

13 条评论
Episode 9: How Netflix Recommends Shows and Movies

2020年3月28日

Episode 9: How Netflix Recommends Shows and Movies

I want to thank Daniel Mora, most of what you are seeing in this newsletter comes from him and his analysis. Thank you…

5 条评论
Episode 8: Understanding the coronavirus (COVID-19) with Data Science

2020年3月23日

Episode 8: Understanding the coronavirus (COVID-19) with Data Science

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about how to download…

3 条评论
Episode 7: Programming languages for Data Science

2020年3月20日

Episode 7: Programming languages for Data Science

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the most…

4 条评论
Episode 6: Business understanding for Data Science

2020年3月20日

Episode 6: Business understanding for Data Science

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the importance…

See all articles

Data Cleansing with Apache Spark and Optimus

Favio Vazquez

CTO | Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

Requirements:

Installation (Windows, Mac & Linux):

License:

Favio Vazquez的更多文章

社区洞察

其他会员也浏览了

Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

Simplifying Data Processing with PySpark on Amazon EMR: Best Practices, Optimization, and Security

Subject: ?? DATA Pill #124 - SQL Has Problems, RAG API, QueryGPT

?? DATA Pill #112 - Decodable vs. Amazon MSF, Flink SQL - changelog and races

?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

DATA Pill #070 - 3 dbt SQL engines, Machine Learning Platform at Walmart

Database for recommendation systems, content generators, or any AI solution that relies on vector-based data

Leadership in Open-Source: The Transformative Role of Big Tech CTOs in Modern Data Platform Evolution

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark

Requirements:

Installation (Windows, Mac & Linux):

License:

Favio Vazquez的更多文章

S2-E5: Exploring and Preparing Data

S2-E4: Data Collection

S2-E3: Business Understanding. Part 2.

S2-E2: Business Understanding. Part 1.

S2-E1: Basics of Data Science

Episode 10: Best Books to Study Machine Learning

Episode 9: How Netflix Recommends Shows and Movies

Episode 8: Understanding the coronavirus (COVID-19) with Data Science

Episode 7: Programming languages for Data Science

Episode 6: Business understanding for Data Science

社区洞察

其他会员也浏览了

Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

Simplifying Data Processing with PySpark on Amazon EMR: Best Practices, Optimization, and Security

Subject: ?? DATA Pill #124 - SQL Has Problems, RAG API, QueryGPT

?? DATA Pill #112 - Decodable vs. Amazon MSF, Flink SQL - changelog and races

?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

DATA Pill #070 - 3 dbt SQL engines, Machine Learning Platform at Walmart

Database for recommendation systems, content generators, or any AI solution that relies on vector-based data

Leadership in Open-Source: The Transformative Role of Big Tech CTOs in Modern Data Platform Evolution

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark