This article is part of the "DP-600 Lab Summary Series" that summarises the key lessons of each of Microsoft's Lab exercises as well as explaining the code (where relevant).
The aim is use these as study notes and refresher for the DP-600 exams especially for those who are beginners or newbies to Microsoft Fabric (just like me).
I suggest you first complete the Lab exercise and then come back to this article for the study notes.
- Apache Spark is an open-source engine for distributed data processing.
- It is designed for processing large-scale data sets across clusters of computers.
- Spark provides APIs in multiple languages including Java, Scala, Python, and SQL.
- It is used for various data processing tasks including ETL (Extract, Transform, Load), machine learning, and data analytics.
- Fabric provides a workspace where you can perform data engineering tasks.
- Workspaces can be created with different licensing modes including Trial, Premium, or Fabric.
- The workspace provides an environment for creating and managing data lakehouses.
Creating a Lakehouse and Uploading Files:
- A lakehouse is a storage architecture that combines the best features of data lakes and data warehouses.
- You can create a lakehouse in Fabric to store your data files.
- Files can be uploaded to the lakehouse from your local machine or other sources.
- Notebooks provide an interactive environment for writing and running code.
- You can use notebooks to explore and analyze data using Apache Spark.
- Notebooks support multiple languages such as Python, Scala, and SQL.
Loading Data into a DataFrame:
- DataFrames are a distributed collection of data organized into named columns.
- You can load data into a DataFrame from various sources including CSV files, databases, and streaming data sources.
- DataFrames support schema inference and explicit schema definition.
Exploring Data in a DataFrame:
- DataFrames provide functions for filtering, grouping, and aggregating data.
- You can use these functions to explore the structure and content of your data.
- DataFrames offer a powerful API for data manipulation and analysis.
- Data transformation involves modifying the structure or content of data.
- Spark provides functions and methods for transforming data in DataFrames.
- Transformation tasks may include adding or removing columns, filtering rows, and aggregating data.
- Transformed data can be saved in various formats including Parquet, ORC, and JSON.
- Partitioning data can improve performance when querying large datasets.
- Spark provides options for saving data in partitioned and non-partitioned formats.
Working with Tables and SQL:
- Spark supports SQL queries on DataFrames and tables.
- You can create temporary or permanent tables in a Spark session.
- SQL queries can be used to analyze and transform data in Spark.
Visualizing Data with Spark:
- Spark provides basic charting capabilities for visualizing data in notebooks.
- Additional libraries like matplotlib and seaborn can be used for more advanced charting.
- Charts can help in understanding patterns and trends in the data.
Explaination of codes used throughout the Lab
df = spark.read.format("csv").option("header","true").load("Files/orders/2019.csv")
# df now is a Spark DataFrame containing CSV data from "Files/orders/2019.csv".
display(df)
This code reads a CSV file named "2019.csv" located in the "Files/orders" directory using Apache Spark's DataFrame API. Here's a breakdown of each part of the code:
- spark.read.format("csv"): This specifies that you want to read a CSV file using the Spark DataFrame API.
- .option("header","true"): This sets an option to indicate that the first row of the CSV file contains the header, which will be used as column names in the DataFrame.
- .load("Files/orders/2019.csv"): This loads the CSV file located at "Files/orders/2019.csv" into a DataFrame. The load method is used to load data from a specific file path.
- df: This assigns the loaded DataFrame to a variable named df, which you can use to perform operations on the CSV data.
- display(df): This is a command used in Databricks notebooks to display the contents of the DataFrame df. It's not a standard Spark method and is specific to Databricks. In a standard Spark environment, you would typically use df.show() to display the DataFrame contents.
df = spark.read.format("csv").option("header","false").load("Files/orders/2019.csv")
# df now is a Spark DataFrame containing CSV data from "Files/orders/2019.csv".
display(df)
This code is similar to the previous code, but with a slight difference in the options used:
- spark.read.format("csv"): Specifies that you want to read a CSV file using the Spark DataFrame API.
- .option("header","false"): Sets an option to indicate that the CSV file does not contain a header row. Spark will use default column names (like c0, c1, etc.) for the columns.
- .load("Files/orders/2019.csv"): Loads the CSV file located at "Files/orders/2019.csv" into a DataFrame.
- df: Assigns the loaded DataFrame to a variable named df.
- display(df): Displays the contents of the DataFrame df using Databricks' display function.
Since you've set "header" to "false", Spark will treat the first row of the CSV file as data rather than column names. This is useful when your CSV file doesn't have a header row.
from pyspark.sql.types import *
orderSchema = StructType([
StructField("SalesOrderNumber", StringType()),
StructField("SalesOrderLineNumber", IntegerType()),
StructField("OrderDate", DateType()),
StructField("CustomerName", StringType()),
StructField("Email", StringType()),
StructField("Item", StringType()),
StructField("Quantity", IntegerType()),
StructField("UnitPrice", FloatType()),
StructField("Tax", FloatType())
])
df = spark.read.format("csv").schema(orderSchema).load("Files/orders/2019.csv")
display(df)
This code defines a custom schema for the DataFrame and then uses that schema to read a CSV file into a DataFrame. Here's what each part does:
- from pyspark.sql.types import *: This imports all the data types and structures needed to define the schema for the DataFrame.
- orderSchema = StructType([...]): This defines a custom schema for the DataFrame. Each StructField specifies a column name along with its data type. For example, "SalesOrderNumber" is a string, "SalesOrderLineNumber" is an integer, and so on.
- df = spark.read.format("csv").schema(orderSchema).load("Files/orders/2019.csv"): This reads the CSV file located at "Files/orders/2019.csv" into a DataFrame. The schema(orderSchema) part specifies that the DataFrame should use the custom schema defined earlier.
- display(df): This displays the contents of the DataFrame df using Databricks' display function.
Using a custom schema can be beneficial when you want to enforce a specific data structure or when you want to avoid inferring the schema from the data, which can sometimes lead to incorrect schema inference.
from pyspark.sql.types import *
orderSchema = StructType([
StructField("SalesOrderNumber", StringType()),
StructField("SalesOrderLineNumber", IntegerType()),
StructField("OrderDate", DateType()),
StructField("CustomerName", StringType()),
StructField("Email", StringType()),
StructField("Item", StringType()),
StructField("Quantity", IntegerType()),
StructField("UnitPrice", FloatType()),
StructField("Tax", FloatType())
])
df = spark.read.format("csv").schema(orderSchema).load("Files/orders/*.csv")
display(df)
This code is similar to the previous code, but with a small change in the load method to read multiple CSV files. Here's what each part does:
- from pyspark.sql.types import *: This imports all the data types and structures needed to define the schema for the DataFrame.
- orderSchema = StructType([...]): This defines a custom schema for the DataFrame. Each StructField specifies a column name along with its data type.
- df = spark.read.format("csv").schema(orderSchema).load("Files/orders/*.csv"): This reads all CSV files in the "Files/orders" directory into a single DataFrame. The *.csv syntax is used to read all files with a CSV extension in the specified directory.
- display(df): This displays the contents of the DataFrame df using Databricks' display function.
Reading multiple CSV files like this can be useful when you have data split across multiple files and you want to combine them into a single DataFrame for analysis.
df['CustomerName', 'Email']
print(customers.count())
print(customers.distinct().count())
display(customers.distinct())
This code operates on the DataFrame df to extract the "CustomerName" and "Email" columns, create a new DataFrame called customers, and then perform some operations on it. Here's what each part does:
- customers = df['CustomerName', 'Email']: This creates a new DataFrame customers containing only the "CustomerName" and "Email" columns from the original DataFrame df.
- print(customers.count()): This prints the number of rows in the customers DataFrame, which effectively gives the total number of customer records.
- print(customers.distinct().count()): This prints the number of distinct rows in the customers DataFrame. The distinct() function removes duplicate rows, so this will give the count of unique customer records based on both "CustomerName" and "Email".
- display(customers.distinct()): This displays the distinct rows in the customers DataFrame using Databricks' display function. This will show a table containing unique customer records based on both "CustomerName" and "Email".
customers = df.select("CustomerName", "Email").where(df['Item']=='Road-250 Red, 52')
print(customers.count())
print(customers.distinct().count())
display(customers.distinct())
This code filters the DataFrame df to select only rows where the "Item" column is equal to 'Road-250 Red, 52', and then performs operations on the resulting DataFrame customers. Here's a breakdown of each part:
- customers = df.select("CustomerName", "Email").where(df['Item']=='Road-250 Red, 52'): This creates a new DataFrame customers by selecting only the "CustomerName" and "Email" columns from the original DataFrame df, but only for rows where the "Item" column is equal to 'Road-250 Red, 52'.
- print(customers.count()): This prints the number of rows in the customers DataFrame, which effectively gives the total number of customers who purchased the item 'Road-250 Red, 52'.
- print(customers.distinct().count()): This prints the number of distinct rows in the customers DataFrame. Since this DataFrame was already filtered based on the item, this count represents the number of unique customers who purchased the item.
- display(customers.distinct()): This displays the distinct rows in the customers DataFrame using Databricks' display function. This will show a table containing unique customer records for those who purchased the item 'Road-250 Red, 52', including their "CustomerName" and "Email".
productSales = df.select("Item", "Quantity").groupBy("Item").sum()
display(productSales)
This code calculates the total quantity sold for each item in the DataFrame df and then displays the result. Here's a breakdown of each part:
- productSales = df.select("Item", "Quantity").groupBy("Item").sum(): This creates a new DataFrame productSales by selecting the "Item" and "Quantity" columns from the original DataFrame df, grouping the rows by "Item", and then calculating the sum of the "Quantity" for each group (i.e., for each item).
- display(productSales): This displays the productSales DataFrame, showing the total quantity sold for each item. The display function is specific to Databricks notebooks and is used to visualize DataFrames. In a standard Spark environment, you would typically use productSales.show() to display the DataFrame.
from pyspark.sql.functions import *
yearlySales = df.select(year(col("OrderDate")).alias("Year")).groupBy("Year").count().orderBy("Year")
display(yearlySales)
This code calculates the total number of sales made in each year based on the "OrderDate" column in the DataFrame df and then displays the result. Here's a breakdown of each part:
- from pyspark.sql.functions import *: This imports all functions from the pyspark.sql.functions module. These functions are used for various DataFrame operations.
- yearlySales = df.select(year(col("OrderDate")).alias("Year")).groupBy("Year").count().orderBy("Year"): This creates a new DataFrame yearlySales by first selecting the year component of the "OrderDate" column using the year function, and then renaming this new column as "Year" using the alias function. Next, it groups the rows by the "Year" column, calculates the count of rows in each group using the count function, and finally orders the result by the "Year" column using the orderBy function.
- display(yearlySales): This displays the yearlySales DataFrame, showing the total number of sales made in each year. The display function is specific to Databricks notebooks and is used to visualize DataFrames. In a standard Spark environment, you would typically use yearlySales.show() to display the DataFrame.
from pyspark.sql.functions import *
## Create Year and Month columns
transformed_df = df.withColumn("Year", year(col("OrderDate"))).withColumn("Month", month(col("OrderDate")))
# Create the new FirstName and LastName fields
transformed_df = transformed_df.withColumn("FirstName", split(col("CustomerName"), " ").getItem(0)).withColumn("LastName", split(col("CustomerName"), " ").getItem(1))
# Filter and reorder columns
transformed_df = transformed_df["SalesOrderNumber", "SalesOrderLineNumber", "OrderDate", "Year", "Month", "FirstName", "LastName", "Email", "Item", "Quantity", "UnitPrice", "Tax"]
# Display the first five orders
display(transformed_df.limit(5))
This code transforms the DataFrame df by adding new columns for "Year" and "Month" extracted from the "OrderDate" column, as well as splitting the "CustomerName" column into "FirstName" and "LastName". It then filters and reorders the columns, and finally displays the first five orders. Here's a breakdown of each part:
- from pyspark.sql.functions import *: This imports all functions from the pyspark.sql.functions module, which are used for various DataFrame operations.
- transformed_df = df.withColumn("Year", year(col("OrderDate"))).withColumn("Month", month(col("OrderDate"))): This creates a new DataFrame transformed_df by adding two new columns, "Year" and "Month", to the original DataFrame df. The withColumn function is used to add these columns, with the year and month functions extracting the year and month components from the "OrderDate" column, respectively.
- transformed_df = transformed_df.withColumn("FirstName", split(col("CustomerName"), " ").getItem(0)).withColumn("LastName", split(col("CustomerName"), " ").getItem(1)): This further transforms the transformed_df DataFrame by splitting the "CustomerName" column into "FirstName" and "LastName". The split function is used to split the "CustomerName" column by spaces, and then the getItem(0) and getItem(1) functions are used to extract the first and second parts, respectively, which correspond to the first name and last name.
- transformed_df = transformed_df["SalesOrderNumber", "SalesOrderLineNumber", "OrderDate", "Year", "Month", "FirstName", "LastName", "Email", "Item", "Quantity", "UnitPrice", "Tax"]: This filters and reorders the columns in the transformed_df DataFrame, selecting only the specified columns and arranging them in the specified order.
- display(transformed_df.limit(5)): This displays the first five rows of the transformed_df DataFrame using Databricks' display function. This is a way to visualize the DataFrame in a Databricks notebook. In a standard Spark environment, you would typically use transformed_df.show
(5) to display the first five rows.
transformed_df.write.mode("overwrite").parquet('Files/transformed_data/orders')
print ("Transformed data saved!")
This code writes the transformed DataFrame transformed_df to disk in the Parquet file format. Parquet is a columnar storage format that is efficient for both storage and processing. Here's a breakdown of the code:
- transformed_df.write.mode("overwrite").parquet('Files/transformed_data/orders'): This writes the DataFrame transformed_df to the specified file path 'Files/transformed_data/orders' in Parquet format. The mode("overwrite") parameter specifies that if the file already exists, it should be overwritten.
- print("Transformed data saved!"): This prints a message indicating that the transformed data has been saved to disk.
The Parquet file format is commonly used in big data processing with Spark due to its efficiency and compatibility with the Spark ecosystem.
orders_df = spark.read.format("parquet").load("Files/transformed_data/orders")
display(orders_df)
This code reads the Parquet file containing the transformed data back into a DataFrame orders_df and then displays the contents of the DataFrame. Here's a breakdown of each part:
- orders_df = spark.read.format("parquet").load("Files/transformed_data/orders"): This reads the Parquet file located at "Files/transformed_data/orders" into a DataFrame orders_df using the read.format("parquet") method. Parquet files can be efficiently read back into DataFrames, which makes them suitable for storing and processing large datasets.
- display(orders_df): This displays the contents of the DataFrame orders_df using Databricks' display function. The display function is specific to Databricks notebooks and is used to visualize DataFrames. In a standard Spark environment, you would typically use orders_df.show() to display the DataFrame.
orders_df.write.partitionBy("Year","Month").mode("overwrite").parquet("Files/partitioned_data")
print ("Transformed data saved!")
This code writes the DataFrame orders_df to disk in the Parquet file format, partitioned by the "Year" and "Month" columns. Partitioning data can improve query performance, especially when filtering by the partitioned columns. Here's a breakdown of the code:
- orders_df.write.partitionBy("Year","Month").mode("overwrite").parquet("Files/partitioned_data"): This writes the DataFrame orders_df to the specified file path "Files/partitioned_data" in Parquet format, partitioned by the "Year" and "Month" columns. The partitionBy("Year","Month") method specifies the columns to use for partitioning.
- print("Transformed data saved!"): This prints a message indicating that the transformed data has been saved to disk.
After running this code, the data will be saved in a partitioned Parquet format, which can be beneficial for performance optimizations in Spark SQL queries.
orders_2021_df = spark.read.format("parquet").load("Files/partitioned_data/Year=2021/Month=*")
display(orders_2021_df)
This code reads the data partitioned for the year 2021 from the Parquet files into a DataFrame orders_2021_df and then displays the contents of the DataFrame. Here's a breakdown of each part:
- orders_2021_df = spark.read.format("parquet").load("Files/partitioned_data/Year=2021/Month=*): This reads the Parquet files located in the directory "Files/partitioned_data/Year=2021/" into a DataFrame orders_2021_df. The * is a wildcard that matches all values for the "Month" partition column.
- display(orders_2021_df): This displays the contents of the DataFrame orders_2021_df using Databricks' display function. The display function is specific to Databricks notebooks and is used to visualize DataFrames. In a standard Spark environment, you would typically use orders_2021_df.show() to display the DataFrame.
# Create a new table
df.write.format("delta").saveAsTable("salesorders")
# Get the table description
spark.sql("DESCRIBE EXTENDED salesorders").show(truncate=False)
This code creates a new Delta table named "salesorders" from the DataFrame df and then retrieves and displays the extended description of the table. Here's a breakdown of each part:
- df.write.format("delta").saveAsTable("salesorders"): This creates a new Delta table named "salesorders" from the DataFrame df using the write.format("delta").saveAsTable("salesorders") method chain. Delta is an optimized version of Parquet that adds features like ACID transactions, schema enforcement, and time travel capabilities.
- spark.sql("DESCRIBE EXTENDED salesorders").show(truncate=False): This retrieves the extended description of the "salesorders" table using a SQL query. The DESCRIBE EXTENDED statement provides detailed information about the table, including its schema, storage format, and other properties. The show(truncate=False) method is used to display the full content of the description without truncation.
df = spark.sql("SELECT * FROM [your_lakehouse].salesorders LIMIT 1000")
display(df)
This code reads the first 1000 rows from the "salesorders" Delta table located in the "your_lakehouse" lakehouse (assuming "your_lakehouse" is a placeholder for your actual lakehouse name) into a DataFrame df and then displays the contents of the DataFrame. Here's a breakdown of each part:
- df = spark.sql("SELECT FROM [your_lakehouse].salesorders LIMIT 1000"): This uses a SQL query to select all columns (SELECT ) from the "salesorders" Delta table located in the "your_lakehouse" lakehouse and limits the result to the first 1000 rows (LIMIT 1000). The result is stored in the DataFrame df.
- display(df): This displays the contents of the DataFrame df using Databricks' display function. The display function is specific to Databricks notebooks and is used to visualize DataFrames. In a standard Spark environment, you would typically use df.show() to display the DataFrame.
%%sql
SELECT YEAR(OrderDate) AS OrderYear,
SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue
FROM salesorders
GROUP BY YEAR(OrderDate)
ORDER BY OrderYear;
This code uses the %%sql magic command in a Databricks notebook to run a SQL query against the "salesorders" Delta table to calculate the gross revenue for each year. Here's a breakdown of the SQL query:
- %%sql: This is a magic command in Databricks notebooks that allows you to write and execute SQL queries in a cell.
- SELECT YEAR(OrderDate) AS OrderYear, SUM((UnitPrice Quantity) + Tax) AS GrossRevenue: This selects the year component of the "OrderDate" column as "OrderYear" and calculates the gross revenue for each year by summing the total revenue per order (UnitPrice Quantity) and adding the tax.
- FROM salesorders: This specifies that the data is being selected from the "salesorders" Delta table.
- GROUP BY YEAR(OrderDate): This groups the data by the year component of the "OrderDate" column, so that the SUM function is applied to each group (i.e., each year).
- ORDER BY OrderYear: This orders the results by the "OrderYear" column in ascending order.
Running this query will return a table with two columns: "OrderYear" and "GrossRevenue", showing the gross revenue for each year.
%%sql
SELECT * FROM salesorders
This code uses the %%sql magic command in a Databricks notebook to run a SQL query against the "salesorders" Delta table to select all columns and all rows from the table. Here's a breakdown of the SQL query:
- %%sql: This is a magic command in Databricks notebooks that allows you to write and execute SQL queries in a cell.
- SELECT FROM salesorders: This selects all columns ( means all columns) from the "salesorders" Delta table. The SELECT * statement retrieves all columns and FROM salesorders specifies the table from which to select the data.
Running this query will return the entire contents of the "salesorders" Delta table, displaying all columns and rows.
sqlQuery = "SELECT CAST(YEAR(OrderDate) AS CHAR(4)) AS OrderYear, \
SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue \
FROM salesorders \
GROUP BY CAST(YEAR(OrderDate) AS CHAR(4)) \
ORDER BY OrderYear"
df_spark = spark.sql(sqlQuery)
df_spark.show()
This code defines a SQL query as a string and then uses that query to create a DataFrame df_spark by running the query against the "salesorders" Delta table. Here's a breakdown of each part:
- sqlQuery = "SELECT CAST(YEAR(OrderDate) AS CHAR(4)) AS OrderYear, ...": This defines a SQL query as a string. The query calculates the gross revenue for each year by summing the total revenue per order (UnitPrice * Quantity) and adding the tax. It also casts the year component of the "OrderDate" column as a string and renames it as "OrderYear". The results are grouped by the "OrderYear" column and ordered by the same column.
- df_spark = spark.sql(sqlQuery): This creates a new DataFrame df_spark by running the SQL query against the "salesorders" Delta table using the spark.sql method.
- df_spark.show(): This displays the contents of the DataFrame df_spark using the show() method, which shows the first 20 rows of the DataFrame. If you want to show more or fewer rows, you can specify the number in the show() method (e.g., df_spark.show(50) to show the first 50 rows).
Running this code will execute the SQL query and display the calculated gross revenue for each year in the "salesorders" Delta table.
from matplotlib import pyplot as plt
# matplotlib requires a Pandas dataframe, not a Spark one
df_sales = df_spark.toPandas()
# Create a bar plot of revenue by year
plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'])
# Display the plot
plt.show()
This code uses Matplotlib to create a bar plot of the gross revenue by year. Here's a breakdown of each part:
- from matplotlib import pyplot as plt: This imports the pyplot module from Matplotlib, which provides functions for creating plots.
- df_sales = df_spark.toPandas(): This converts the Spark DataFrame df_spark to a Pandas DataFrame df_sales. Matplotlib works with Pandas DataFrames, so this conversion is necessary.
- plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue']): This creates a bar plot using the bar function from Matplotlib. It uses the "OrderYear" column as the x-axis (the years) and the "GrossRevenue" column as the height of the bars (the revenue).
- plt.show(): This displays the plot. plt.show() is necessary to actually show the plot in the output.
Make sure you have Matplotlib installed (pip install matplotlib) before running this code.
from matplotlib import pyplot as plt
# Clear the plot area
plt.clf()
# Create a bar plot of revenue by year
plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
# Customize the chart
plt.title('Revenue by Year')
plt.xlabel('Year')
plt.ylabel('Revenue')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=45)
# Show the figure
plt.show()
This code uses Matplotlib to create a customized bar plot of the gross revenue by year. Here's a breakdown of each part:
- from matplotlib import pyplot as plt: This imports the pyplot module from Matplotlib, which provides functions for creating plots.
- plt.clf(): This clears the current plot, in case there is any existing plot in the plot area.
- plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange'): This creates a bar plot using the bar function from Matplotlib. It uses the "OrderYear" column as the x-axis (the years), the "GrossRevenue" column as the height of the bars (the revenue), and sets the bar color to orange.
- plt.title('Revenue by Year'): This sets the title of the plot to "Revenue by Year".
- plt.xlabel('Year'): This sets the label for the x-axis to "Year".
- plt.ylabel('Revenue'): This sets the label for the y-axis to "Revenue".
- plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7): This customizes the grid lines on the plot. It sets the color to a light grey ('#95a5a6'), the line style to dashed ('--'), the line width to 2, the grid to be displayed only on the y-axis, and the transparency (alpha) to 0.7.
- plt.xticks(rotation=45): This rotates the x-axis labels by 45 degrees to improve readability.
- plt.show(): This displays the customized plot.
This code will create a bar plot with a customized appearance for the revenue by year data.
from matplotlib import pyplot as plt
# Clear the plot area
plt.clf()
# Create a Figure
fig = plt.figure(figsize=(8,3))
# Create a bar plot of revenue by year
plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
# Customize the chart
plt.title('Revenue by Year')
plt.xlabel('Year')
plt.ylabel('Revenue')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)
plt.xticks(rotation=45)
# Show the figure
plt.show()
This code snippet uses Matplotlib to create a bar plot with a custom figure size. Here's a breakdown of each part:
- from matplotlib import pyplot as plt: Importing the pyplot module from Matplotlib.
- plt.clf(): Clearing the current plot, if any.
- fig = plt.figure(figsize=(8,3)): Creating a new figure with a specific size. The figsize parameter specifies the width and height of the figure in inches.
- plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange'): Creating a bar plot using the bar function. The "OrderYear" column is used as the x-axis (years), the "GrossRevenue" column is the height of the bars (revenue), and the bars are colored orange.
- Customizing the chart:
- plt.show(): Displaying the figure.
This code will create a bar plot with the specified customizations and figure size.
from matplotlib import pyplot as plt
# Clear the plot area
plt.clf()
# Create a figure for 2 subplots (1 row, 2 columns)
fig, ax = plt.subplots(1, 2, figsize = (10,4))
# Create a bar plot of revenue by year on the first axis
ax[0].bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'], color='orange')
ax[0].set_title('Revenue by Year')
# Create a pie chart of yearly order counts on the second axis
yearly_counts = df_sales['OrderYear'].value_counts()
ax[1].pie(yearly_counts)
ax[1].set_title('Orders per Year')
ax[1].legend(yearly_counts.keys().tolist())
# Add a title to the Figure
fig.suptitle('Sales Data')
# Show the figure
plt.show()
This code snippet uses Matplotlib to create a figure with two subplots: a bar plot of revenue by year and a pie chart of yearly order counts. Here's a breakdown of each part:
- from matplotlib import pyplot as plt: Importing the pyplot module from Matplotlib.
- plt.clf(): Clearing the current plot, if any.
- fig, ax = plt.subplots(1, 2, figsize = (10,4)): Creating a figure with 1 row and 2 columns for the subplots, and setting the figure size to (10, 4) inches. The subplots function returns a figure (fig) and an array of axes (ax) corresponding to each subplot.
- Creating the bar plot on the first axis (ax[0]):
- Creating the pie chart on the second axis (ax[1]):
- fig.suptitle('Sales Data'): Adding a title to the entire figure.
- plt.show(): Displaying the figure with both subplots.
This code will create a figure with two subplots: a bar plot of revenue by year and a pie chart of yearly order counts.
import seaborn as sns
# Clear the plot area
plt.clf()
# Create a bar chart
ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
plt.show()
This code uses the Seaborn library to create a bar plot of the gross revenue by year. Here's a breakdown of each part:
- import seaborn as sns: This imports the Seaborn library, which provides high-level interface for drawing attractive and informative statistical graphics.
- plt.clf(): This clears the current plot, if any.
- ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales): This creates a bar plot using Seaborn's barplot function. It takes the DataFrame df_sales as input and specifies the "OrderYear" column for the x-axis and the "GrossRevenue" column for the y-axis.
- plt.show(): This displays the bar plot.
Seaborn provides a high-level interface for creating visually appealing plots with less code compared to Matplotlib.
import seaborn as sns
# Clear the plot area
plt.clf()
# Set the visual theme for seaborn
sns.set_theme(style="whitegrid")
# Create a bar chart
ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
plt.show()
This code uses the Seaborn library to create a bar plot of the gross revenue by year with a white grid visual theme. Here's a breakdown of each part:
- import seaborn as sns: This imports the Seaborn library.
- plt.clf(): This clears the current plot, if any.
- sns.set_theme(style="whitegrid"): This sets the visual theme for Seaborn to "whitegrid", which will display a white background with grid lines.
- ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales): This creates a bar plot using Seaborn's barplot function. It takes the DataFrame df_sales as input and specifies the "OrderYear" column for the x-axis and the "GrossRevenue" column for the y-axis.
- plt.show(): This displays the bar plot with the specified visual theme.
Setting the visual theme with sns.set_theme(style="whitegrid") affects the overall appearance of the plot, making it visually consistent with the specified theme.
import seaborn as sns
# Clear the plot area
plt.clf()
# Create a line chart
ax = sns.lineplot(x="OrderYear", y="GrossRevenue", data=df_sales)
plt.show()
This code uses the Seaborn library to create a line plot of the gross revenue by year. Here's a breakdown of each part:
- import seaborn as sns: This imports the Seaborn library.
- plt.clf(): This clears the current plot, if any.
- ax = sns.lineplot(x="OrderYear", y="GrossRevenue", data=df_sales): This creates a line plot using Seaborn's lineplot function. It takes the DataFrame df_sales as input and specifies the "OrderYear" column for the x-axis and the "GrossRevenue" column for the y-axis.
- plt.show(): This displays the line plot.
Seaborn's lineplot function is used to visualize trends over time or any other continuous variable. In this case, it's showing the trend of gross revenue over the years.
Hope you found this useful.
Be on the lookout for the next article in this series.
Disclaimer: The summaries in this article were provided with the assistance of AI. AI is not perfect and it can make mistakes. Its always recommended to double check the content with reliable resources.