登录查看更多内容

RDD vs DataFrame vs DataSet

Guilherme Lisb?a

Senior Data Engineer | Google Cloud Certified | Python | GCP | Bigquery | Dataflow

发布日期: 2022年4月6日

Talking about Spark there are some things we need to know. The first is about RDD, DataFrame, and DataSet. The main difference between them is the data struct.

The RDD ( Resilient Distributed Datasets ) is a collection of data distributed between many machines in the cluster. That kind of struct is so helpful when we're talking about parallel processing.

The DataFrame is the same as we commonly use on pandas, is a collection of data organized into named columns, simply a table in a common database.

About DataSet we're talking of the best of the two words, it provides the functionality of a type-safe, object-oriented programming interface of the RDD API and performance benefits of the Catalyst query optimizer and off-heap storage mechanism of a DataFrame.

Use Area

RDD ( Resilient Distributed Datasets ) we can use RDDs for low-level transformation and actions on your dataset and ?When you need high-level abstractions

We both DataFrame and dataset API when we need a high level of abstraction, for unstructured data, such as media streams or streams of text, in the high-level expression. For example, filter, maps, aggregation, sum,?SQL queries, and columnar access. When we don't need to impose a schema, such as columnar format while processing or accessing data attributes by name or column

要查看或添加评论，请登录

Guilherme Lisb?a的更多文章

Dataform

2023年9月29日

Dataform

Dataform é uma ferramenta de gerenciamento e transforma??o de dados que permite aos analistas construir e manter…
How to Use Advanced Techniques in BigQuery to Optimize Your Queries

2023年4月4日

How to Use Advanced Techniques in BigQuery to Optimize Your Queries

BigQuery is a data analytics tool from Google that allows users to perform complex queries on large datasets with ease.…
SQL Tips and Tricks

2023年3月28日

SQL Tips and Tricks

Here we'll propose a advantage use cases to improve your query performance and make your query less complex. Common…
ETL x ELT

2023年2月15日

ETL x ELT

When we think in how to bring data from data sources and databases into a lakehouse we can use ETL or ELT. Extract :…
Ducks Pattern

2021年6月20日

Ducks Pattern

The Ducks Pattern is a way to organize the Redux structure and the Redux is a library of state control based on Flux…
Root import

2021年1月26日

Root import

The root import is a tool that helps us to easily import a module from anywhere that is. React Js First of all, you…
Por que usar Typescript ?

2020年12月10日

Por que usar Typescript ?

Essa é a pergunta que eu sempre escuto sempre que estou sugerindo que seja realizado usando Typescript. Por que usar…
How to use Redux with AsyncStorage

2020年12月4日

How to use Redux with AsyncStorage

First of all, may you’re wondering why should you or anyone use the Redux with AsyncStorage, because is easier to use…

See all articles

RDD vs DataFrame vs DataSet

Guilherme Lisb?a

Senior Data Engineer | Google Cloud Certified | Python | GCP | Bigquery | Dataflow

Use Area

Guilherme Lisb?a的更多文章

社区洞察

其他会员也浏览了

Grind 75 - 23 - Maximum Depth of Binary Tree

Different Ways of Creating a DataFrame in Spark

Day #4 of Rust 100 Days Challenge: Data Types

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Publications & Projects

Pandas

Spark: Under the Hood - Part 2

# Inserting Data into a Milvus Collection: A Step-by-Step Guide

Tackling Two Major Challenges in Computer Science: Caching and Naming Conventions

Use Area

Guilherme Lisb?a的更多文章

Dataform

How to Use Advanced Techniques in BigQuery to Optimize Your Queries

SQL Tips and Tricks

ETL x ELT

Ducks Pattern

Root import

Por que usar Typescript ?

How to use Redux with AsyncStorage

社区洞察

其他会员也浏览了

Grind 75 - 23 - Maximum Depth of Binary Tree

Different Ways of Creating a DataFrame in Spark

Day #4 of Rust 100 Days Challenge: Data Types

PyIceberg Now Supports Upsert: Simplify Data Management Without Spark!

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands-on Labs

Publications & Projects

Pandas

Spark: Under the Hood - Part 2

# Inserting Data into a Milvus Collection: A Step-by-Step Guide

Tackling Two Major Challenges in Computer Science: Caching and Naming Conventions