登录查看更多内容

Simplifying Apache Spark usage with Optimus

Favio Vazquez

CTO | Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

发布日期: 2017年10月11日

+ 关注

Optimus 1.0.4 is out now!

With this new version we have improved I/O functionality, some bugs were fixed and we added some improvements in general.

We are in the process of building the best framework for a complete Data Science workflow so hang with us and wait for the paradigm shift we're impulsing.

New Features

Write a Spark Data Frame as a CSV easily

We all love Apache Spark, but it is not a secret that the way it works could be a little weird. For example, lets say that you have finished cleaning and preparing your data (with Optimus of course, more here) and you want to write your Data Frame to disk.

You will have to do something like this:

df.write.option("header", "true").csv("/path/to/file.csv")

Not that hard, but not as simple as Pandas with Python or simple R, for example in Pandas this looks like:

df.to_csv(header=True,'/path/to/file.csv')

That is simpler and more intuitive. Of course there are a lot of options you can pass yo the Pandas to_csv() function or the write() in Spark like the separator, the write mode, or the representation of the null values.

The problem is that with spark is not that intuitive to pass options. Enter Optimus.

With this new version of Optimus we have mimic Pandas functionalities for reading and writing Spark Data Frames as CSV, or parquet, etc.

So, let's start. Lets asume you have a file named 'foo.csv' in your current directory (is the one from our GitHub example https://github.com/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb):

# Import optimus
import optimus as op
# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe in this case, local file 
# system (hard drive of the pc) is used.
df = tools.read_csv(path="foo.csv", sep=',')

As simple as a read_csv().

And now lets try writing a Data Frame to disk. Let's create a sample Data Frame:

# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
# Importing optimus
import optimus as op

# Building a simple dataframe:
schema = StructType([
        StructField("bill_id", IntegerType(), True),
        StructField("foods", StringType(), True)])

id_ = [1, 2, 2, 3, 3, 3, 3, 4, 4]
foods = ['Pizza', 'Pizza', 'Beer', 'Hamburger', 'Beer', 'Beer', 'Beer', 'Pizza', 'Beer']


# Dataframe:
df = op.spark.createDataFrame(list(zip(id_, foods)), schema=schema)

And now the Optimus new amazing feature, to_csv(),

# Instantiation of DataTransformer class:
transformer = op.DataFrameTransformer(df)

# Write DF as CSV
transformer.to_csv("path/to/file.csv")

And that's it. Simple right?

There is a trick. This will create a folder with the name “file.csv” in the specified path, and inside it will be te CSV with the contents. But with the read_csv function you can just pass the name “file.csv” and Optimus will understand.

With to_csv() you can set the behavior of the save operation when data already exists using the "mode" argument:

“append”: Append contents of this DataFrame to existing data.
“overwrite” (default case): Overwrite existing data.
“ignore”: Silently ignore this operation if data already exists.
“error”: Throw an exception if data already exists.

And with the "sep" argument you can set the single character as a separator for each field and value. If None is set, it uses the default value (”,”).

You can also pass all this options (all existing is spark):

compression – compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
sep – sets the single character as a separator for each field and value. If None is set, it uses the default value, ",".
quote – sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
escape – sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, "\".
escapeQuotes – a flag indicating whether values containing quotes should always be enclosed in quotes. If None is set, it uses the default valuetrue, escaping all values containing a quote character.
quoteAll – a flag indicating whether all values should always be enclosed in quotes. If None is set, it uses the default value false, only escaping values containing a quote character.
header – writes the names of columns as the first line. If None is set, it uses the default value, false.
nullValue – sets the string representation of a null value. If None is set, it uses the default value, empty string.
dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type. If None is set, it uses the default value, yyyy-MM-dd.
timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. If None is set, it uses the default value, yyyy-MM-dd'T'HH:mm:ss.SSSXXX.
ignoreLeadingWhiteSpace – a flag indicating whether or not leading whitespaces from values being written should be skipped. If None is set, it uses the default value, true.
ignoreTrailingWhiteSpace – a flag indicating whether or not trailing whitespaces from values being written should be skipped. If None is set, it uses the default value, true.

----------------------------------------------------------------------------------------------------------

Please check this and other examples at:

https://github.com/ironmussa/Optimus-examples

And remember visiting our webpage for more information and docs:

https://hioptimus.com

Contributors:

Project Manager: Argenis León.
Original developers: Andrea Rosales, Hugo Reyes, Alberto Bonsanto.
Principal developer and maintainer: Favio Vázquez.

License:

Apache 2.0 ? Iron.

Francisco Sosa

Ingeniero en Liderazgo de Ingeniería y Mantenimiento Hotelero | Innovación y Sostenibilidad | Experto en Eficiencia y Tecnología

1 年

Favio, gracias! por compartir!????

Paula Talavera

Empresaria Visionaria y Líder en Bienes Raíces de Alta Gama en Cancún | Fundadora de Everest Inmobiliaria | Experta en Ventas y Atención al Cliente con más de 25 A?os de Experiencia | Socia AMPI

1 年

Favio, gracias! por compartir!!!

Emanuel V. Ferrero

Director of Product & Engineering | MIT Awarded & Google Knight Fellow | AI, Fintech, Payments

6 年

Great work Favio!!!!! Keep going on!

Sandeep Anand

SDE

6 年

Pankajini Nahak , this is a great read

Tamer Ibrahim

Cybersecurity PreSales | Security Solutions Architect - EMEA at Google Cloud Security

7 年

Is there any comparison between optimus and pandas?

查看更多评论

要查看或添加评论，请登录

Favio Vazquez的更多文章

S2-E5: Exploring and Preparing Data

2021年4月14日

S2-E5: Exploring and Preparing Data

Hello! And welcome to the fifth episode from the Data Science Now newsletter about the project: Basics of Data Science.…

11 条评论
S2-E4: Data Collection

2020年7月2日

S2-E4: Data Collection

Hello! And welcome to the fourth episode from the Data Science Now newsletter about the project: Basics of Data…

16 条评论
S2-E3: Business Understanding. Part 2.

2020年6月10日

S2-E3: Business Understanding. Part 2.

Hello! And welcome to the second episode from the Data Science Now newsletter about the project: Basics of Data…

3 条评论
S2-E2: Business Understanding. Part 1.

2020年5月27日

S2-E2: Business Understanding. Part 1.

Hello! And welcome to the first the Data Science Now newsletter about the project Basics of Data Science. Let me remind…

4 条评论
S2-E1: Basics of Data Science

2020年5月19日

S2-E1: Basics of Data Science

Hello! And welcome to a new season of the Data Science Now newsletter. In this season, we will be discussing the basics…

5 条评论
Episode 10: Best Books to Study Machine Learning

2020年4月24日

Episode 10: Best Books to Study Machine Learning

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the best books…

13 条评论
Episode 9: How Netflix Recommends Shows and Movies

2020年3月28日

Episode 9: How Netflix Recommends Shows and Movies

I want to thank Daniel Mora, most of what you are seeing in this newsletter comes from him and his analysis. Thank you…

5 条评论
Episode 8: Understanding the coronavirus (COVID-19) with Data Science

2020年3月23日

Episode 8: Understanding the coronavirus (COVID-19) with Data Science

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about how to download…

3 条评论
Episode 7: Programming languages for Data Science

2020年3月20日

Episode 7: Programming languages for Data Science

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the most…

4 条评论
Episode 6: Business understanding for Data Science

2020年3月20日

Episode 6: Business understanding for Data Science

Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the importance…

See all articles

Simplifying Apache Spark usage with Optimus

Favio Vazquez

CTO | Lead AI Scientist | LinkedIn Top Voice | AI & ML Evangelist | Drummer

Optimus 1.0.4 is out now!

New Features

Write a Spark Data Frame as a CSV easily

Favio Vazquez的更多文章

社区洞察

其他会员也浏览了

Mastering Spark Session Creation and Configuration in Apache Spark

Expedite Apache Spark Queries with Bloom Filter Indexing

Apache Spark : The Shuffle

How to implement Apache Spark in Data Processing and Analytics?

Handling Nested Schema in Apache Spark

Apache Spark - Memory Allocation

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

Apache Spark 101: Window Functions

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

Spark Tidbits - Lesson 9

Optimus 1.0.4 is out now!

New Features

Write a Spark Data Frame as a CSV easily

Favio Vazquez的更多文章

S2-E5: Exploring and Preparing Data

S2-E4: Data Collection

S2-E3: Business Understanding. Part 2.

S2-E2: Business Understanding. Part 1.

S2-E1: Basics of Data Science

Episode 10: Best Books to Study Machine Learning

Episode 9: How Netflix Recommends Shows and Movies

Episode 8: Understanding the coronavirus (COVID-19) with Data Science

Episode 7: Programming languages for Data Science

Episode 6: Business understanding for Data Science

社区洞察

其他会员也浏览了

Mastering Spark Session Creation and Configuration in Apache Spark

Expedite Apache Spark Queries with Bloom Filter Indexing

Apache Spark : The Shuffle

How to implement Apache Spark in Data Processing and Analytics?

Handling Nested Schema in Apache Spark

Apache Spark - Memory Allocation

Exploring Apache Beam's ParDo Function: A Key for Parallel Processing

Apache Spark 101: Window Functions

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

Spark Tidbits - Lesson 9