RDD vs DataFrame vs DataSet
Guilherme Lisb?a
Senior Data Engineer | Google Cloud Certified | Python | GCP | Bigquery | Dataflow
Talking about Spark there are some things we need to know. The first is about RDD, DataFrame, and DataSet. The main difference between them is the data struct.
The RDD ( Resilient Distributed Datasets ) is a collection of data distributed between many machines in the cluster. That kind of struct is so helpful when we're talking about parallel processing.
The DataFrame is the same as we commonly use on pandas, is a collection of data organized into named columns, simply a table in a common database.
About DataSet we're talking of the best of the two words, it provides the functionality of a type-safe, object-oriented programming interface of the RDD API and performance benefits of the Catalyst query optimizer and off-heap storage mechanism of a DataFrame.
Use Area
RDD ( Resilient Distributed Datasets ) we can use RDDs for low-level transformation and actions on your dataset and ?When you need high-level abstractions
We both DataFrame and dataset API when we need a high level of abstraction, for unstructured data, such as media streams or streams of text, in the high-level expression. For example, filter, maps, aggregation, sum,?SQL queries, and columnar access. When we don't need to impose a schema, such as columnar format while processing or accessing data attributes by name or column