Towards Easy and Fast Data Science Workflows with Optimus
Favio Vazquez
Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer
Optimus 1.0.3 is out now!
With this new version we have improved testing, some bugs were fixed and we added some improvements in general.
We are in the process of building the best framework for a complete Data Science workflow so hang with us and wait for the paradigm shift we're impulsing.
New Features
Plotting Spark Dataframes easily.
With this new version of Optimus you can create histograms directly from Spark DataFrames, even if you have a really big dataset. Lets asume you have a file named 'foo.csv' in your current directory (is the one from our GitHub example https://github.com/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb):
# Import optimus
import optimus as op
# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe in this case, local file
# system (hard drive of the pc) is used.
df = tools.read_csv(path="foo.csv", delimiter_mark=',')
Lets instantiate the analyzer:
analyzer = op.DataFrameAnalyzer(df)
And now lets plot an histogram for some column using the plot_hist function:
analyzer.plot_hist("price","numerical")
And we will get:
Really easy right?
Get the frequencies for values inside the specified columns.
With the new get_frequencies() function you can get the frequencies for values inside the specified columns. This method outputs a Spark Dataframe with counts per existing values in each column.
To use it, first lets create a sample DataFrame:
import random
import optimus as op
from pyspark.sql.types import StringType, StructType, IntegerType, FloatType, DoubleType, StructField
schema = StructType(
[
StructField("strings", StringType(), True),
StructField("integers", IntegerType(), True),
StructField("integers2", IntegerType(), True),
StructField("floats", FloatType(), True),
StructField("double", DoubleType(), True)
]
)
size = 200
# Generating strings column:
foods = [' pizza! ', 'pizza', 'PIZZA;', 'pizza', 'pízza?', 'Pizza', 'Piz;za']
foods = [foods[random.randint(0,6)] for count in range(size)]
# Generating integer column:
num_col_1 = [random.randint(0,9) for number in range(size)]
# Generating integer column:
num_col_2 = [random.randint(0,9) for number in range(size)]
# Generating integer column:
num_col_3 = [random.random() for number in range(size)]
# Generating integer column:
num_col_4 = [random.random() for number in range(size)]
# Building DataFrame
df = op.spark.createDataFrame(list(zip(foods, num_col_1, num_col_2, num_col_3, num_col_4)),schema=schema)
Now lets instantiate the Analyzer:
analyzer = op.DataFrameAnalyzer(df)
And finally lets find the frequency for the columns "strings" and "integers" in the DataFrame:
# Get frequency DataFrame
df_counts = analyzer.get_frequency(["strings", "integers"], True)
You will get:
+-----------------+-----+
| strings|count|
+-----------------+-----+
| pizza| 48|
+-----------------+-----+
| Piz;za| 38|
+-----------------+-----+
| Pizza| 37|
+-----------------+-----+
| pízza?| 29|
+-----------------+-----+
| pizza! | 25|
+-----------------+-----+
| PIZZA;| 23|
+-----------------+-----+
+--------+-----+
|integers|count|
+--------+-----+
| 8| 31|
+--------+-----+
| 5| 24|
+--------+-----+
| 1| 24|
+--------+-----+
| 9| 20|
+--------+-----+
| 6| 20|
+--------+-----+
| 2| 19|
+--------+-----+
| 3| 19|
+--------+-----+
| 0| 17|
+--------+-----+
| 4| 14|
+--------+-----+
| 7| 12|
+--------+-----+
Pretty simple right?
----------------------------------------------------------------------------------------------------------
Please check this and other examples at:
https://github.com/ironmussa/Optimus-examples
And remember visiting our webpage for more information and docs:
Contributors:
- Project Manager: Argenis León.
- Original developers: Andrea Rosales, Hugo Reyes, Alberto Bonsanto.
- Principal developer and maintainer: Favio Vázquez.
License:
Apache 2.0 ? Iron.