登录查看更多内容

Towards Easy and Fast Data Science Workflows with Optimus

Favio Vazquez

Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

发布日期: 2017年10月2日

+ 关注

Optimus 1.0.3 is out now!

With this new version we have improved testing, some bugs were fixed and we added some improvements in general.

We are in the process of building the best framework for a complete Data Science workflow so hang with us and wait for the paradigm shift we're impulsing.

New Features

Plotting Spark Dataframes easily.

With this new version of Optimus you can create histograms directly from Spark DataFrames, even if you have a really big dataset. Lets asume you have a file named 'foo.csv' in your current directory (is the one from our GitHub example https://github.com/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb):

# Import optimus
import optimus as op
# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe in this case, local file 
# system (hard drive of the pc) is used.
df = tools.read_csv(path="foo.csv", delimiter_mark=',')

Lets instantiate the analyzer:

analyzer = op.DataFrameAnalyzer(df)

And now lets plot an histogram for some column using the plot_hist function:

analyzer.plot_hist("price","numerical")

And we will get:

Really easy right?

Get the frequencies for values inside the specified columns.

With the new get_frequencies() function you can get the frequencies for values inside the specified columns. This method outputs a Spark Dataframe with counts per existing values in each column.

To use it, first lets create a sample DataFrame:

import random
import optimus as op
from pyspark.sql.types import StringType, StructType, IntegerType, FloatType, DoubleType, StructField

schema = StructType(
        [
        StructField("strings", StringType(), True),
        StructField("integers", IntegerType(), True),
        StructField("integers2", IntegerType(), True),
        StructField("floats",  FloatType(), True),
        StructField("double",  DoubleType(), True)
        ]
)

size = 200
# Generating strings column:
foods = ['    pizza!       ', 'pizza', 'PIZZA;', 'pizza', 'pízza?', 'Pizza', 'Piz;za']
foods = [foods[random.randint(0,6)] for count in range(size)]
# Generating integer column:
num_col_1 = [random.randint(0,9) for number in range(size)]
# Generating integer column:
num_col_2 = [random.randint(0,9) for number in range(size)]
# Generating integer column:
num_col_3 = [random.random() for number in range(size)]
# Generating integer column:
num_col_4 = [random.random() for number in range(size)]

# Building DataFrame
df = op.spark.createDataFrame(list(zip(foods, num_col_1, num_col_2, num_col_3, num_col_4)),schema=schema)

Now lets instantiate the Analyzer:

analyzer = op.DataFrameAnalyzer(df)

And finally lets find the frequency for the columns "strings" and "integers" in the DataFrame:

# Get frequency DataFrame
df_counts = analyzer.get_frequency(["strings", "integers"], True)

You will get:

+-----------------+-----+
|          strings|count|
+-----------------+-----+
|            pizza|   48|
+-----------------+-----+
|           Piz;za|   38|
+-----------------+-----+
|            Pizza|   37|
+-----------------+-----+
|           pízza?|   29|
+-----------------+-----+
|    pizza!       |   25|
+-----------------+-----+
|           PIZZA;|   23|
+-----------------+-----+

+--------+-----+
|integers|count|
+--------+-----+
|       8|   31|
+--------+-----+
|       5|   24|
+--------+-----+
|       1|   24|
+--------+-----+
|       9|   20|
+--------+-----+
|       6|   20|
+--------+-----+
|       2|   19|
+--------+-----+
|       3|   19|
+--------+-----+
|       0|   17|
+--------+-----+
|       4|   14|
+--------+-----+
|       7|   12|
+--------+-----+

Pretty simple right?

----------------------------------------------------------------------------------------------------------

Please check this and other examples at:

https://github.com/ironmussa/Optimus-examples

And remember visiting our webpage for more information and docs:

https://hioptimus.com

Contributors:

Project Manager: Argenis León.
Original developers: Andrea Rosales, Hugo Reyes, Alberto Bonsanto.
Principal developer and maintainer: Favio Vázquez.

License:

Apache 2.0 ? Iron.

Towards Easy and Fast Data Science Workflows with Optimus

Favio Vazquez

Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

Optimus 1.0.3 is out now!

New Features

Plotting Spark Dataframes easily.

更多精彩文章

社区洞察

其他会员也浏览了

Windowing Functions

Handling Big Data with XGBoost and Azure Databricks: From EDA to Deployment

End-To-End Data Processing

Quantitive Data Humanism with Pokemon

Generating 1 Billion Rows of Complex Synthetic Data ??

Searching for the Fundamental Truths in Data Science: A Review of Data Science for Business

Improve your Data Science workflow with Optimus

Week of May 13th

Optimus 1.0.3 is out now!

New Features

Plotting Spark Dataframes easily.

S2-E5: Exploring and Preparing Data

2021年4月14日

S2-E4: Data Collection

2020年7月2日

S2-E3: Business Understanding. Part 2.

2020年6月10日

S2-E2: Business Understanding. Part 1.

2020年5月27日

S2-E1: Basics of Data Science

2020年5月19日

Episode 10: Best Books to Study Machine Learning

2020年4月24日

Episode 9: How Netflix Recommends Shows and Movies

2020年3月28日

Episode 8: Understanding the coronavirus (COVID-19) with Data Science

2020年3月23日

Episode 7: Programming languages for Data Science

2020年3月20日

Episode 6: Business understanding for Data Science

2020年3月20日

社区洞察

其他会员也浏览了

Windowing Functions

Handling Big Data with XGBoost and Azure Databricks: From EDA to Deployment

End-To-End Data Processing

Quantitive Data Humanism with Pokemon

Generating 1 Billion Rows of Complex Synthetic Data ??

Searching for the Fundamental Truths in Data Science: A Review of Data Science for Business

Improve your Data Science workflow with Optimus

Week of May 13th