登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Improve your Data Science workflow with Optimus

Favio Vazquez

Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

发布日期: 2017年9月25日

+ 关注

Optimus 1.0.2 is out now!

The new release of Optimus comes with a lot of new functionalities that will make your life much easier.

We implemented a data pipeline for data cleansing and exploration that can be personalized as much as you want to fit into your project.

New Features

Read dataset from a URL to a Spark DataFrame

With Optimus 1.0.2 you can read a DataFrame from the web and save it in a Spark DataFrame with a simple function:

# Import optimus
import optimus as op
# Instance of Utilities class
tools = op.Utilities()
# Reading df from web
url = "https://raw.githubusercontent.com/ironmussa/Optimus-examples/master/examples/foo.csv"
df = tools.read_url(path=url)

The new read_url function receives an url with a CSV or a JSON and we will parse it for you and create a Spark DataFrame with its information.

Count items for each id

Lets say you have a dataset from bills, like bills from a restaurant. You would like to count how many "items" (like popcorn, soda, beers, etc.) are for each bill. Doing this is not that hard in Pandas or Dplyr but what if you have Big Data? Optimus to the rescue:

Lets create a sample dataframe:

# Import optimus
?import optimus as op
# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
# Importing optimus
import optimus as op

# Building a simple dataframe:
schema = StructType([
        StructField("bill_id", IntegerType(), True),
        StructField("foods", StringType(), True)])

id_ = [1, 2, 2, 3, 3, 3, 3, 4, 4]
foods = ['Pizza', 'Pizza', 'Beer', 'Hamburger', 'Beer', 'Beer', 'Beer', 'Pizza', 'Beer']


# Dataframe:
df = op.spark.createDataFrame(list(zip(id_, foods)), schema=schema)
df.show()
+-------+---------+
|bill_id|    foods|
+-------+---------+
|      1|    Pizza|
|      2|    Pizza|
|      2|     Beer|
|      3|Hamburger|
|      3|     Beer|
|      3|     Beer|
|      3|     Beer|
|      4|    Pizza|
|      4|     Beer|
+-------+---------+

Lets instantiate the transformer:

transformer = op.DataFrameTransformer(df)

We would like to count how many Beers are in each bill, so with Optimus we just do:

df_count = transformer.explode_table(col_id="bill_id",col_search="foods",new_col_feature="beer_count",search_string="Beer")

df_count.show()

And this will return:

+-------+----------+
|bill_id|beer_count|
+-------+----------+
|      3|         3|
|      4|         1|
|      2|         1|
+-------+----------+

Plot the correlation matrix from a Spark DataFrame

Is very common in a lot of Data Science exercises to create Machine Learning or Deep Learning models with the features we have created. But we need to be sure that the variables we are using are not correlated or have the same information, or we will create a bias in our model.

This function that Optimus introduces will work with Big Data or even a sample DataFrame, which is the different from using a Pandas or something like it. Check it out.

Lets create a sample DataFrame:

# Import optimus
?import optimus as op
data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
        (Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
        (Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
df = op.spark.createDataFrame(data, ["features"])

Lets instantiate the analyzer:

analyzer = op.DataFrameAnalyzer(df)

And now with Optimus's power:

analyzer.correlation(vec_col,method="pearson")

and we will get:

And voilà.

Please check this and other examples at:

https://github.com/ironmussa/Optimus-examples

And remember visiting our webpage for more information and docs:

https://hioptimus.com

Contributors:

Project Manager: Argenis León.

Original developers: Andrea Rosales, Hugo Reyes, Alberto Bonsanto.

Principal developer and maintainer: Favio Vázquez.

License:

Apache 2.0 ? Iron.

Improve your Data Science workflow with Optimus

Favio Vazquez

Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

Optimus 1.0.2 is out now!

New Features

Read dataset from a URL to a Spark DataFrame

Count items for each id

Plot the correlation matrix from a Spark DataFrame

更多精彩文章

社区洞察

Optimus 1.0.2 is out now!

New Features

Read dataset from a URL to a Spark DataFrame

Count items for each id

Plot the correlation matrix from a Spark DataFrame

S2-E5: Exploring and Preparing Data

2021年4月14日

S2-E4: Data Collection

2020年7月2日

S2-E3: Business Understanding. Part 2.

2020年6月10日

S2-E2: Business Understanding. Part 1.

2020年5月27日

S2-E1: Basics of Data Science

2020年5月19日

Episode 10: Best Books to Study Machine Learning

2020年4月24日

Episode 9: How Netflix Recommends Shows and Movies

2020年3月28日

Episode 8: Understanding the coronavirus (COVID-19) with Data Science

2020年3月23日

Episode 7: Programming languages for Data Science

2020年3月20日

Episode 6: Business understanding for Data Science

2020年3月20日

社区洞察