RDD vs DataFrame vs DataSet

RDD vs DataFrame vs DataSet

Talking about Spark there are some things we need to know. The first is about RDD, DataFrame, and DataSet. The main difference between them is the data struct.

The RDD ( Resilient Distributed Datasets ) is a collection of data distributed between many machines in the cluster. That kind of struct is so helpful when we're talking about parallel processing.

The DataFrame is the same as we commonly use on pandas, is a collection of data organized into named columns, simply a table in a common database.

About DataSet we're talking of the best of the two words, it provides the functionality of a type-safe, object-oriented programming interface of the RDD API and performance benefits of the Catalyst query optimizer and off-heap storage mechanism of a DataFrame.

Use Area

RDD ( Resilient Distributed Datasets ) we can use RDDs for low-level transformation and actions on your dataset and ?When you need high-level abstractions

We both DataFrame and dataset API when we need a high level of abstraction, for unstructured data, such as media streams or streams of text, in the high-level expression. For example, filter, maps, aggregation, sum,?SQL queries, and columnar access. When we don't need to impose a schema, such as columnar format while processing or accessing data attributes by name or column


要查看或添加评论,请登录

Guilherme Lisb?a的更多文章

  • Dataform

    Dataform

    Dataform é uma ferramenta de gerenciamento e transforma??o de dados que permite aos analistas construir e manter…

  • How to Use Advanced Techniques in BigQuery to Optimize Your Queries

    How to Use Advanced Techniques in BigQuery to Optimize Your Queries

    BigQuery is a data analytics tool from Google that allows users to perform complex queries on large datasets with ease.…

  • SQL Tips and Tricks

    SQL Tips and Tricks

    Here we'll propose a advantage use cases to improve your query performance and make your query less complex. Common…

  • ETL x ELT

    ETL x ELT

    When we think in how to bring data from data sources and databases into a lakehouse we can use ETL or ELT. Extract :…

  • Ducks Pattern

    Ducks Pattern

    The Ducks Pattern is a way to organize the Redux structure and the Redux is a library of state control based on Flux…

  • Root import

    Root import

    The root import is a tool that helps us to easily import a module from anywhere that is. React Js First of all, you…

  • Por que usar Typescript ?

    Por que usar Typescript ?

    Essa é a pergunta que eu sempre escuto sempre que estou sugerindo que seja realizado usando Typescript. Por que usar…

  • How to use Redux with AsyncStorage

    How to use Redux with AsyncStorage

    First of all, may you’re wondering why should you or anyone use the Redux with AsyncStorage, because is easier to use…

社区洞察

其他会员也浏览了