Data Cleansing with Apache Spark and Optimus

Data Cleansing with Apache Spark and Optimus

Outdated, inaccurate, or duplicated data won’t drive optimal data driven solutions. When data is inaccurate, leads are harder to track and nurture, and insights may be flawed. The data on which you base your big data strategy must be accurate, up-to-date, as complete as possible, and should not contain duplicate entries. Clean data results in better decisions.

Cleaning data is the most time-consuming and least enjoyable data science task (until Optimus), but one of the most important ones. No one can start a data science, machine learning or data driven solution without being sure that the data that they’ll be consuming is at its optimal stage. Although several data cleansing solutions exists, none of them can keep up with the emergence of Big Data, or they are really hard to use.

Right now more and more companies are entering (or at least trying to enter) the Big Data and Machine Learning revolution. All of the data driven approaches need to clean, wrangle, normalize and fix the data that will be input to the models they want to create, and with Optimus we are launching an easy to use, easy to deploy to production, and open source framework to clean and analyze data in a parallel fashion using state of the art technologies, that can be used by small, medium, big industries or even startups that want to create data science solutions and don’t have the money to pay lots of data scientists and create their own cluster to clean the data they are going to use.

Optimus is the missing library for cleaning and pre-processing data in a distributed fashion. It uses all the power of Apache Spark to do so. It implements several handy tools for data wrangling and munging that will make your life much easier. The first obvious advantage over any other public data cleaning library is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand.

Requirements:

  • Apache Spark 2.2.0
  • Python 3.5

Installation (Windows, Mac & Linux):

In your terminal just type:

pip install optimuspyspark

For a complete documentation on how to use it please visit our GitHub repository:

https://github.com/ironmussa/Optimus

If you want a peak of what can Optimus do for you check out this Demo:

https://nbviewer.jupyter.org/github/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb



License:

Apache 2.0 ? Iron.


Joe H ☆

Dog Dad and full-time nerd.

6 年

Nevin Yilmaz looks like a good solution to me. Check it out.

回复
Dhanapal .

Generative AI Expert | Machine Learning | Deep Learning | NLP | Multi-Cloud Deployment with GPU | MLOPs | Data Privacy & Security Enthusiast

7 年

Thanks for sharing ....

Kirk Borne, Ph.D.

LinkedIn Top Voice, Thinkers360 Top 25 Overall Thought Leader, Founder of Data Leadership Group (Data Scientist. Top Influencer. Speaker. Trainer. Consultant. Astrophysicist). Advisor to PrimeAI and other AI startups.

7 年

Very nice work! Thank you for sharing it with us.

要查看或添加评论,请登录

Favio Vazquez的更多文章

  • S2-E5: Exploring and Preparing Data

    S2-E5: Exploring and Preparing Data

    Hello! And welcome to the fifth episode from the Data Science Now newsletter about the project: Basics of Data Science.…

    11 条评论
  • S2-E4: Data Collection

    S2-E4: Data Collection

    Hello! And welcome to the fourth episode from the Data Science Now newsletter about the project: Basics of Data…

    16 条评论
  • S2-E3: Business Understanding. Part 2.

    S2-E3: Business Understanding. Part 2.

    Hello! And welcome to the second episode from the Data Science Now newsletter about the project: Basics of Data…

    3 条评论
  • S2-E2: Business Understanding. Part 1.

    S2-E2: Business Understanding. Part 1.

    Hello! And welcome to the first the Data Science Now newsletter about the project Basics of Data Science. Let me remind…

    4 条评论
  • S2-E1: Basics of Data Science

    S2-E1: Basics of Data Science

    Hello! And welcome to a new season of the Data Science Now newsletter. In this season, we will be discussing the basics…

    5 条评论
  • Episode 10: Best Books to Study Machine Learning

    Episode 10: Best Books to Study Machine Learning

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the best books…

    13 条评论
  • Episode 9: How Netflix Recommends Shows and Movies

    Episode 9: How Netflix Recommends Shows and Movies

    I want to thank Daniel Mora, most of what you are seeing in this newsletter comes from him and his analysis. Thank you…

    5 条评论
  • Episode 8: Understanding the coronavirus (COVID-19) with Data Science

    Episode 8: Understanding the coronavirus (COVID-19) with Data Science

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about how to download…

    3 条评论
  • Episode 7: Programming languages for Data Science

    Episode 7: Programming languages for Data Science

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the most…

    4 条评论
  • Episode 6: Business understanding for Data Science

    Episode 6: Business understanding for Data Science

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the importance…

社区洞察

其他会员也浏览了