Different ways of creating a Dataframe in Pyspark

Different ways of creating a Dataframe in Pyspark

Using spark.read

Using spark.sql

Using spark.table

Using spark.range

Range gives a one column dataframe

Creating a Dataframe from a local list

Two step process of creating a Dataframe

If we want to explicitly specify the column names and not to go with the default values.

One Step Process of creating a Dataframe

To enforce the schema explicitly

Approach 1 - Fixing only the column names

Approach 2 - Fixing the column names and Datatypes

Creating Dataframe from RDD

Credits - Sumit Mittal sir



要查看或添加评论,请登录

Nikhil G R的更多文章

  • Introduction to DBT (Data Build Tool)

    Introduction to DBT (Data Build Tool)

    dbt is an open-source command-line tool that enables data engineers and analysts to transform data in their warehouse…

  • DIFFERENCES IN SQL

    DIFFERENCES IN SQL

    WHERE vs HAVING WHERE and HAVING clauses are both used in SQL to filter data. WHERE WHERE clause should be used before…

  • Introduction to Azure Databricks (Part 2)

    Introduction to Azure Databricks (Part 2)

    DBFS (Databricks File System) It is a Distributed File System. It is mounted into a databricks workspace.

  • Introduction to Azure Databricks (Part 1)

    Introduction to Azure Databricks (Part 1)

    Databricks is a company created by the creators of Apache Spark. It is an Apache Spark based unified analytics platform…

  • Aggregate and Window Functions in Pyspark

    Aggregate and Window Functions in Pyspark

    Aggregate Functions These are the functions where the number of output rows will always be less than the number of…

  • Dataframes and Spark SQL Table

    Dataframes and Spark SQL Table

    Dataframes These are in the form of RDDs with some structure/schema which is not persistent as it is available only in…

  • Dataframe Reader API

    Dataframe Reader API

    We can read the different format of files using the Dataframe Reader API. Standard way to create a Dataframe Instead of…

  • repartition vs coalesce in pyspark

    repartition vs coalesce in pyspark

    repartition There can be a case if we need to increase or decrease partitions to get more parallesism. repartition can…

    2 条评论
  • Apache Spark on YARN Architecture

    Apache Spark on YARN Architecture

    Before going through the Spark architecture, let us understand the Hadoop ecosystem. The core components of Hadoop are…

  • Introduction to Apache spark

    Introduction to Apache spark

    Apache Spark is a Distributed Computing Framework. Before going into Apache Spark let us understand what are the…

    1 条评论

社区洞察

其他会员也浏览了