Improve your Data Science workflow with Optimus
Favio Vazquez
Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer
Optimus 1.0.2 is out now!
The new release of Optimus comes with a lot of new functionalities that will make your life much easier.
We implemented a data pipeline for data cleansing and exploration that can be personalized as much as you want to fit into your project.
New Features
Read dataset from a URL to a Spark DataFrame
With Optimus 1.0.2 you can read a DataFrame from the web and save it in a Spark DataFrame with a simple function:
# Import optimus
import optimus as op
# Instance of Utilities class
tools = op.Utilities()
# Reading df from web
url = "https://raw.githubusercontent.com/ironmussa/Optimus-examples/master/examples/foo.csv"
df = tools.read_url(path=url)
The new read_url function receives an url with a CSV or a JSON and we will parse it for you and create a Spark DataFrame with its information.
Count items for each id
Lets say you have a dataset from bills, like bills from a restaurant. You would like to count how many "items" (like popcorn, soda, beers, etc.) are for each bill. Doing this is not that hard in Pandas or Dplyr but what if you have Big Data? Optimus to the rescue:
Lets create a sample dataframe:
# Import optimus
?import optimus as op
# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
# Importing optimus
import optimus as op
# Building a simple dataframe:
schema = StructType([
StructField("bill_id", IntegerType(), True),
StructField("foods", StringType(), True)])
id_ = [1, 2, 2, 3, 3, 3, 3, 4, 4]
foods = ['Pizza', 'Pizza', 'Beer', 'Hamburger', 'Beer', 'Beer', 'Beer', 'Pizza', 'Beer']
# Dataframe:
df = op.spark.createDataFrame(list(zip(id_, foods)), schema=schema)
df.show()
+-------+---------+
|bill_id| foods|
+-------+---------+
| 1| Pizza|
| 2| Pizza|
| 2| Beer|
| 3|Hamburger|
| 3| Beer|
| 3| Beer|
| 3| Beer|
| 4| Pizza|
| 4| Beer|
+-------+---------+
Lets instantiate the transformer:
transformer = op.DataFrameTransformer(df)
We would like to count how many Beers are in each bill, so with Optimus we just do:
df_count = transformer.explode_table(col_id="bill_id",col_search="foods",new_col_feature="beer_count",search_string="Beer")
df_count.show()
And this will return:
+-------+----------+
|bill_id|beer_count|
+-------+----------+
| 3| 3|
| 4| 1|
| 2| 1|
+-------+----------+
Plot the correlation matrix from a Spark DataFrame
Is very common in a lot of Data Science exercises to create Machine Learning or Deep Learning models with the features we have created. But we need to be sure that the variables we are using are not correlated or have the same information, or we will create a bias in our model.
This function that Optimus introduces will work with Big Data or even a sample DataFrame, which is the different from using a Pandas or something like it. Check it out.
Lets create a sample DataFrame:
# Import optimus
?import optimus as op
data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
(Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
(Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
(Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
df = op.spark.createDataFrame(data, ["features"])
Lets instantiate the analyzer:
analyzer = op.DataFrameAnalyzer(df)
And now with Optimus's power:
analyzer.correlation(vec_col,method="pearson")
and we will get:
And voilà.
Please check this and other examples at:
https://github.com/ironmussa/Optimus-examples
And remember visiting our webpage for more information and docs:
Contributors:
Project Manager: Argenis León.
Original developers: Andrea Rosales, Hugo Reyes, Alberto Bonsanto.
Principal developer and maintainer: Favio Vázquez.
License:
Apache 2.0 ? Iron.
Data Scientist, Mathematician, MLOPS ,NLP LLMS, Data Engineer, AI
6 年Hola? como estas ?? vi en Medium que se puede colaborar? haciendo post en tu pagina " Ciencia y Datos'', hay posibilidad de participar?? Soy tambien de Venezuela !!!