Improve your Data Science workflow with Optimus

Improve your Data Science workflow with Optimus

Optimus 1.0.2 is out now!

The new release of Optimus comes with a lot of new functionalities that will make your life much easier.

We implemented a data pipeline for data cleansing and exploration that can be personalized as much as you want to fit into your project.

New Features

Read dataset from a URL to a Spark DataFrame

With Optimus 1.0.2 you can read a DataFrame from the web and save it in a Spark DataFrame with a simple function:

# Import optimus
import optimus as op
# Instance of Utilities class
tools = op.Utilities()
# Reading df from web
url = "https://raw.githubusercontent.com/ironmussa/Optimus-examples/master/examples/foo.csv"
df = tools.read_url(path=url)

The new read_url function receives an url with a CSV or a JSON and we will parse it for you and create a Spark DataFrame with its information.

Count items for each id

Lets say you have a dataset from bills, like bills from a restaurant. You would like to count how many "items" (like popcorn, soda, beers, etc.) are for each bill. Doing this is not that hard in Pandas or Dplyr but what if you have Big Data? Optimus to the rescue:

Lets create a sample dataframe:

# Import optimus
?import optimus as op
# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
# Importing optimus
import optimus as op

# Building a simple dataframe:
schema = StructType([
        StructField("bill_id", IntegerType(), True),
        StructField("foods", StringType(), True)])

id_ = [1, 2, 2, 3, 3, 3, 3, 4, 4]
foods = ['Pizza', 'Pizza', 'Beer', 'Hamburger', 'Beer', 'Beer', 'Beer', 'Pizza', 'Beer']


# Dataframe:
df = op.spark.createDataFrame(list(zip(id_, foods)), schema=schema)
df.show()
+-------+---------+
|bill_id|    foods|
+-------+---------+
|      1|    Pizza|
|      2|    Pizza|
|      2|     Beer|
|      3|Hamburger|
|      3|     Beer|
|      3|     Beer|
|      3|     Beer|
|      4|    Pizza|
|      4|     Beer|
+-------+---------+

Lets instantiate the transformer:

transformer = op.DataFrameTransformer(df)

We would like to count how many Beers are in each bill, so with Optimus we just do:

df_count = transformer.explode_table(col_id="bill_id",col_search="foods",new_col_feature="beer_count",search_string="Beer")

df_count.show()

And this will return:

+-------+----------+
|bill_id|beer_count|
+-------+----------+
|      3|         3|
|      4|         1|
|      2|         1|
+-------+----------+

Plot the correlation matrix from a Spark DataFrame

Is very common in a lot of Data Science exercises to create Machine Learning or Deep Learning models with the features we have created. But we need to be sure that the variables we are using are not correlated or have the same information, or we will create a bias in our model.

This function that Optimus introduces will work with Big Data or even a sample DataFrame, which is the different from using a Pandas or something like it. Check it out.

Lets create a sample DataFrame:

# Import optimus
?import optimus as op
data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
        (Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
        (Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
df = op.spark.createDataFrame(data, ["features"])

Lets instantiate the analyzer:

analyzer = op.DataFrameAnalyzer(df)

And now with Optimus's power:

analyzer.correlation(vec_col,method="pearson")

and we will get:

And voilà.

Please check this and other examples at:

https://github.com/ironmussa/Optimus-examples

And remember visiting our webpage for more information and docs:

https://hioptimus.com

Contributors:

Project Manager: Argenis León.

Original developers: Andrea RosalesHugo ReyesAlberto Bonsanto.

Principal developer and maintainer: Favio Vázquez.

License:

Apache 2.0 ? Iron.






Leonardo T.

Data Scientist, Mathematician, MLOPS ,NLP LLMS, Data Engineer, AI

6 年

Hola? como estas ?? vi en Medium que se puede colaborar? haciendo post en tu pagina " Ciencia y Datos'', hay posibilidad de participar?? Soy tambien de Venezuela !!!

回复

要查看或添加评论,请登录

社区洞察