Simplifying Apache Spark usage with Optimus

Simplifying Apache Spark usage with Optimus

Optimus 1.0.4 is out now!

With this new version we have improved I/O functionality, some bugs were fixed and we added some improvements in general.

We are in the process of building the best framework for a complete Data Science workflow so hang with us and wait for the paradigm shift we're impulsing.

New Features

Write a Spark Data Frame as a CSV easily

We all love Apache Spark, but it is not a secret that the way it works could be a little weird. For example, lets say that you have finished cleaning and preparing your data (with Optimus of course, more here) and you want to write your Data Frame to disk.

You will have to do something like this:

df.write.option("header", "true").csv("/path/to/file.csv")

Not that hard, but not as simple as Pandas with Python or simple R, for example in Pandas this looks like:

df.to_csv(header=True,'/path/to/file.csv')

That is simpler and more intuitive. Of course there are a lot of options you can pass yo the Pandas to_csv() function or the write() in Spark like the separator, the write mode, or the representation of the null values.

The problem is that with spark is not that intuitive to pass options. Enter Optimus.

With this new version of Optimus we have mimic Pandas functionalities for reading and writing Spark Data Frames as CSV, or parquet, etc.

So, let's start. Lets asume you have a file named 'foo.csv' in your current directory (is the one from our GitHub example https://github.com/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb):

# Import optimus
import optimus as op
# Instance of Utilities class
tools = op.Utilities()
# Reading dataframe in this case, local file 
# system (hard drive of the pc) is used.
df = tools.read_csv(path="foo.csv", sep=',')

As simple as a read_csv().

And now lets try writing a Data Frame to disk. Let's create a sample Data Frame:

# Importing sql types
from pyspark.sql.types import StringType, IntegerType, StructType, StructField
# Importing optimus
import optimus as op

# Building a simple dataframe:
schema = StructType([
        StructField("bill_id", IntegerType(), True),
        StructField("foods", StringType(), True)])

id_ = [1, 2, 2, 3, 3, 3, 3, 4, 4]
foods = ['Pizza', 'Pizza', 'Beer', 'Hamburger', 'Beer', 'Beer', 'Beer', 'Pizza', 'Beer']


# Dataframe:
df = op.spark.createDataFrame(list(zip(id_, foods)), schema=schema)

And now the Optimus new amazing feature, to_csv(),

# Instantiation of DataTransformer class:
transformer = op.DataFrameTransformer(df)

# Write DF as CSV
transformer.to_csv("path/to/file.csv")

And that's it. Simple right?

There is a trick. This will create a folder with the name “file.csv” in the specified path, and inside it will be te CSV with the contents. But with the read_csv function you can just pass the name “file.csv” and Optimus will understand.

With to_csv() you can set the behavior of the save operation when data already exists using the "mode" argument:

  • “append”: Append contents of this DataFrame to existing data.
  • “overwrite” (default case): Overwrite existing data.
  • “ignore”: Silently ignore this operation if data already exists.
  • “error”: Throw an exception if data already exists.

And with the "sep" argument you can set the single character as a separator for each field and value. If None is set, it uses the default value (”,”).

You can also pass all this options (all existing is spark):

  • compression – compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
  • sep – sets the single character as a separator for each field and value. If None is set, it uses the default value, ",".
  • quote – sets the single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, ". If you would like to turn off quotations, you need to set an empty string.
  • escape – sets the single character used for escaping quotes inside an already quoted value. If None is set, it uses the default value, "\".
  • escapeQuotes – a flag indicating whether values containing quotes should always be enclosed in quotes. If None is set, it uses the default valuetrue, escaping all values containing a quote character.
  • quoteAll – a flag indicating whether all values should always be enclosed in quotes. If None is set, it uses the default value false, only escaping values containing a quote character.
  • header – writes the names of columns as the first line. If None is set, it uses the default value, false.
  • nullValue – sets the string representation of a null value. If None is set, it uses the default value, empty string.
  • dateFormat – sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type. If None is set, it uses the default value, yyyy-MM-dd.
  • timestampFormat – sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type. If None is set, it uses the default value, yyyy-MM-dd'T'HH:mm:ss.SSSXXX.
  • ignoreLeadingWhiteSpace – a flag indicating whether or not leading whitespaces from values being written should be skipped. If None is set, it uses the default value, true.
  • ignoreTrailingWhiteSpace – a flag indicating whether or not trailing whitespaces from values being written should be skipped. If None is set, it uses the default value, true.

----------------------------------------------------------------------------------------------------------

Please check this and other examples at:

https://github.com/ironmussa/Optimus-examples

And remember visiting our webpage for more information and docs:

https://hioptimus.com

Contributors:

License:

Apache 2.0 ? Iron.



Francisco Sosa

Ingeniero en Liderazgo de Ingeniería y Mantenimiento Hotelero | Innovación y Sostenibilidad | Experto en Eficiencia y Tecnología

8 个月

Favio, gracias! por compartir!????

回复
Paula Talavera

Empresaria Visionaria y Líder en Bienes Raíces de Alta Gama en Cancún | Fundadora de Everest Inmobiliaria | Experta en Ventas y Atención al Cliente con más de 25 A?os de Experiencia | Socia AMPI

9 个月

Favio, gracias! por compartir!!!

回复
Emanuel V. Ferrero

Director of Product & Engineering | MIT Awarded & Google Knight Fellow | AI, Fintech, Payments

6 年

Great work Favio!!!!! Keep going on!

回复

Pankajini Nahak , this is a great read

回复
Tamer Ibrahim

Cybersecurity PreSales | Security Solutions Architect - EMEA at Google Cloud Security

6 年

Is there any comparison between optimus and pandas?

回复

要查看或添加评论,请登录

Favio Vazquez的更多文章

  • S2-E5: Exploring and Preparing Data

    S2-E5: Exploring and Preparing Data

    Hello! And welcome to the fifth episode from the Data Science Now newsletter about the project: Basics of Data Science.…

    11 条评论
  • S2-E4: Data Collection

    S2-E4: Data Collection

    Hello! And welcome to the fourth episode from the Data Science Now newsletter about the project: Basics of Data…

    16 条评论
  • S2-E3: Business Understanding. Part 2.

    S2-E3: Business Understanding. Part 2.

    Hello! And welcome to the second episode from the Data Science Now newsletter about the project: Basics of Data…

    3 条评论
  • S2-E2: Business Understanding. Part 1.

    S2-E2: Business Understanding. Part 1.

    Hello! And welcome to the first the Data Science Now newsletter about the project Basics of Data Science. Let me remind…

    4 条评论
  • S2-E1: Basics of Data Science

    S2-E1: Basics of Data Science

    Hello! And welcome to a new season of the Data Science Now newsletter. In this season, we will be discussing the basics…

    5 条评论
  • Episode 10: Best Books to Study Machine Learning

    Episode 10: Best Books to Study Machine Learning

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the best books…

    13 条评论
  • Episode 9: How Netflix Recommends Shows and Movies

    Episode 9: How Netflix Recommends Shows and Movies

    I want to thank Daniel Mora, most of what you are seeing in this newsletter comes from him and his analysis. Thank you…

    5 条评论
  • Episode 8: Understanding the coronavirus (COVID-19) with Data Science

    Episode 8: Understanding the coronavirus (COVID-19) with Data Science

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about how to download…

    3 条评论
  • Episode 7: Programming languages for Data Science

    Episode 7: Programming languages for Data Science

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the most…

    4 条评论
  • Episode 6: Business understanding for Data Science

    Episode 6: Business understanding for Data Science

    Hello! And welcome to a new edition of the Data Science Now newsletter. In this session, I talked about the importance…

社区洞察

其他会员也浏览了