登录查看更多内容

Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.

Musili Adebayo

Data Engineer | Fabric Analytics Engineer | Data Analyst | Azure | Power BI | Python | SQL

发布日期: 2024年8月21日

This project uses Bing News API within Azure to create an end-to-end Data Engineering project using Microsoft Fabric to determine the Olympic news sentiments for each news.

End-to-End Azure Data Engineering using Microsoft Fabric Architecture.

Project Overview

For this project, I analyzed the news sentiment analysis of the recently concluded Olympic Games Paris 2024. I used the Bing Search v7 engine to get all the news from Microsoft Bing Search Engine and determine the sentiments from each piece of news within a specific period.?

I picked the Bing API because working with APIs will always be part of any data engineering process at one point in your career. For this project, I carried out the following:

Create a Bing Search Resource in Azure.
Data Ingestion.?
Data Transformation with Incremental Loading.
Sentiment Analysis with Incremental Loading.
Data Visualization and Reporting in Power BI.
Set up Alerts with Data Activator with notifications on Teams.

This is a project I am excited about as we get to wear the hat of a Data Analyst + Data Engineer + Data Scientist. Sounds interesting right?

So, let's get started;

Prerequisite for this?project

You need a Microsoft Fabric Trial account. Create one here.
You need an Azure Free Trial Account or a Student account.?

1. Create a Bing Search Resource in?Azure.

I created a resource group called AzureXfabric_rg and used the Azure marketplace to search for the Bing Search v7 API to create a Bing Search resource. I picked the F1 (3 calls per seconds and 1k Calls per month) pricing tier because it is free. See other pricing tier.

2. Data Ingestion.

For our data lake, I created a Microsoft Fabric Lakehouse called the bing_olympic_news_db. I will be connecting to the Bing Search resource API we created earlier in Azure through a Data Pipeline (olympic-news-ingestion-pipeline). Furthermore, We will also configure our Data Source and Destination.

2.1 Configuring the Data?Source?

To connect to the source data we will be using the Rest connection that is available in Fabric to connect to the Bing Search resource API in Azure. Click on Source > Create new connection > Rest API and set up the connection as seen below.

Rest API connection using the Endpoint on

To successfully connect to the Bing Search Resource News Search APIs v7 we will need to input the following parameter for the Sourced data:

1. The Endpoints to get the Base URL to connect to Azure?https://api.bing.microsoft.com/v7.0/news/search).

2. The Query parameter based on our search term?—?(?q=olympic+news&count=100&freshness=Week). If you are familiar with using parameter you can make it dynamic.

3. The Header to get the required header (Ocp-Apim-Subscription-Key). This is the subscription key that you received when you signed up in Azure while creating a Bing resource.

Data Ingestion from the Bing Resource New Search API.

2.2 Setting the Data Destination.

For the Destination, we are going to set our destination to the bing_olympic_news_db Lakehouse we created earlier. I also created a file path called olympic-news.json and set the file format to JSON.

Setting up the destination to the Lakehouse and the file path and file format.

Our data was successfully ingested and saved to the file section of our Lakehouse as olympi-news.json.

领英推荐

What Is Big Data Technologies: How To Learn?…

Ze Learning Labb 1 个月前

Crafting a Winning Data Science Portfolio

Walter Shields 1 个月前

DATA Pill #033 - 4 ways to optimize BigQuery, 30 data…

Adam Kawa 2 年前

3. Data Transformation with Incremental Loading.

It's time to clean the olympi-news.json file in our lakehouse using a spark job within Microsoft Fabric. We will also implement a Type 1 SCD to load our data into the Lakehouse incrementally.

Download the notebook I used on my GitHub to Extract, Transform, and load (ETL) the olympi-news to the olympic_news_updated data > Download Here.

A snapshot of the Transformed data of the Olympic News.

4. Sentiment Analysis with Incremental Loading.

For the Sentiment Analysis, I will be using SynapseML formerly (MMLSpark) which is an open-source library that is available within Microsoft Fabric. We will use one of the pre-built intelligent models for our machine learning task and predict the Olympic news sentiments based on the news description column which has a detailed description of the news article as seen above.

We will be predicting the news sentiments based on the text context from the description. The AnalyzeText prebuilt model from SynapseML was used as seen below.

#import the required packages
import synapse.ml.core
from synapse.ml.services import AnalyzeText

#initialize the model and configure the input and output columns
model = (AnalyzeText()
        .setTextCol("description") ## set the column we want to perform sentiments on
        .setKind("SentimentAnalysis") ## specifying the sentiment analysis model to be performed.
        .setOutputCol("response")
        .setErrorCol("error")) 

#Apply the model to our dataframe
result = model.transform(df)

display(result.limit(10))

#To get the Sentiment Column from the response column
from pyspark.sql.functions import col

sentiment_df = result.withColumn("sentiment", col("response.documents.sentiment"))
display(sentiment_df.limit(7))

#Droping the error and response columns

sentiment_df_final = sentiment_df.drop("error","response")
display(sentiment_df_final.limit(10))

After successfully running the code block above, we created a new column called sentiments that categorized the Olympic news based on the sentiment as Positive, Negative, Neutral, and mixed sentiments.

Implementing the Type 1 Slowing Changing Dimension (SCD) to avoid loading duplicate data during incremental loading. I opted for type 1 because I am not interested in keeping a history of our existing data.

# Adopting TYPE 1 SCD incremental loading for our data.

'''In a Type 1 SCD the new data overwrites the existing data without duplicate. Thus the existing data
 is lost as it is not stored anywhere else. This is typically used when there is no need to keep 
 a history of the data.'''

from pyspark.sql.utils import AnalysisException

#Exception Handling
try:

    table_name = "bing_olympic_news_db.sentiment_analysis"
    sentiment_df_final.write.format("delta").saveAsTable(table_name)

except AnalysisException:

    print ("Table Already Exist")

    sentiment_df.createOrReplaceTempView("vw_sentiment")

    spark.sql(f"""  MERGE INTO {table_name} target_table
                    USING vw_sentiment source_view

                    ON source_view.link = target_table.link

                    WHEN MATCHED AND
                    source_view.title <> target_table.title OR
                    source_view.description <> target_table.description OR
                    source_view.image <> target_table.image OR
                    source_view.link <> target_table.link OR
                    source_view.datePublished <> target_table.datePublished OR
                    source_view.provider <> target_table.provider OR
                    source_view.published_date <> target_table.published_date OR
                    source_view.published_time<> target_table.published_time
                    
                    THEN UPDATE SET *

                    WHEN NOT MATCHED THEN INSERT * 

                """)

Before visualizing my report in Power BI, the picture below shows that the pipeline was successfully scheduled with a daily schedule run of the Olympic news ending by the 31st of August 2024.

Daily schedule of the Pipeline Run of the Olympic news.

Below is the successful run based on the Olympic news search parameter.

Successful Pipeline run of the Olympic news.

5. Data Visualization in Power?BI.

Below is the Data Visualization and report of the Olympic news for the past 7 days as when I performed this analysis. Over 100 news was published and only 17% of it has positive sentiments.

The Sentiment Analysis Dashboard of the Olympic Games Paris 2024.

6. Set up Alerts with Data Activator with notifications on Microsoft Teams.

In the visual above, click on the card visual and select Set Alerts. The set alert pane will pop up as seen below. I created an alert called Positive Alert Item and I would like to receive a Teams message alert when the Positive Sentiment is greater than 17%.

Setting up the Positive Sentiment Alert with Data Activator.

Below is the preview from our Data Activator showing that the alert was successfully created for the visual. As a recipient of the alert, when the condition is met, I will receive a message when this report changes.

Thank you for reading.

Remember to subscribe and follow me for more insightful articles and you can reach me via my socials below if you want to discuss further.

LinkedIn: Musili Adebayo

Twitter: Musili_Adebayo

P.S. Enjoy this article? Support my work directly.

Ramiro Bugari

Principal Data Architect, Azure Data Engineer, Delivery Manager

6 个月

congrats...good jobs!!. thank you for share.

1 次回应

Celestine Adom Tetteh (She/Her)

Data Analyst | Microsoft Power Platform Developer | Founder @Evolve360

7 个月

Gradually you're winning me over sissss, this is amazing, keep it up!

1 次回应

Temidayo Omoniyi

Platform Data Engineer ? Cloud Solution Architect ? Software Engineer ? MCT ? Author

7 个月

Lovely read!

1 次回应

Rasheed AbdulKareem????

Founder & CBO at D-Aggregate|10Years Data Expert| @Kaggle Contributor|ML Researcher

7 个月

Great...seeing greater opportunities with Microsoft fabrics

1 次回应

查看更多评论

要查看或添加评论，请登录

Musili Adebayo的更多文章

Get Started with Microsoft Fabric in 2025.

2025年2月4日

Get Started with Microsoft Fabric in 2025.

Download a Free PDF-curated Microsoft Fabric Study Guide. Image credit: Microsoft Website Microsoft Fabric is tagged as…

23 条评论
Microsoft Fabric End-to-End Project?—? with Shorcut, Data Pipeline, DataFlow.

2024年7月30日

Microsoft Fabric End-to-End Project?—? with Shorcut, Data Pipeline, DataFlow.

Are you interested in exploring Microsoft Fabric? Then this article is for you. Microsoft Fabric End-to-End…

7 条评论
Understanding Batch and Stream Processing.

2024年4月17日

Understanding Batch and Stream Processing.

Data-driven decision-making has become the heart of today’s businesses, signifying its critical importance for…

8 条评论

Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.

Musili Adebayo

Data Engineer | Fabric Analytics Engineer | Data Analyst | Azure | Power BI | Python | SQL

Project Overview

Prerequisite for this?project

1. Create a Bing Search Resource in?Azure.

2. Data Ingestion.

2.1 Configuring the Data?Source?

2.2 Setting the Data Destination.

领英推荐

3. Data Transformation with Incremental Loading.

4. Sentiment Analysis with Incremental Loading.

5. Data Visualization in Power?BI.

6. Set up Alerts with Data Activator with notifications on Microsoft Teams.

Musili Adebayo的更多文章

社区洞察

其他会员也浏览了

Data Wars: Vector Strikes Back

Exploring Data with KQL in Azure

Generative AI Tools Landscape - Data Applications – Part1

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Architecting Data Pipelines

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

???? Skyrocket Your Data Skills: Microsoft Azure's Data Engineering - Your Ticket to Success in 2024! ????

SQL in Data Science: Why It’s Still Essential in 2025

Modern Data Stack - using Google AppSheet, Airflow, DBT, Google Big Query, and Looker Studio

Tackling the “Large Number of Small Files” Problem in Spark

Project Overview

Prerequisite for this?project

1. Create a Bing Search Resource in?Azure.

2. Data Ingestion.

2.1 Configuring the Data?Source?

2.2 Setting the Data Destination.

领英推荐

3. Data Transformation with Incremental Loading.

4. Sentiment Analysis with Incremental Loading.

5. Data Visualization in Power?BI.

6. Set up Alerts with Data Activator with notifications on Microsoft Teams.

Musili Adebayo的更多文章

Get Started with Microsoft Fabric in 2025.

Microsoft Fabric End-to-End Project?—? with Shorcut, Data Pipeline, DataFlow.

Understanding Batch and Stream Processing.

社区洞察

其他会员也浏览了

Data Wars: Vector Strikes Back

Exploring Data with KQL in Azure

Generative AI Tools Landscape - Data Applications – Part1

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Architecting Data Pipelines

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

???? Skyrocket Your Data Skills: Microsoft Azure's Data Engineering - Your Ticket to Success in 2024! ????

SQL in Data Science: Why It’s Still Essential in 2025

Modern Data Stack - using Google AppSheet, Airflow, DBT, Google Big Query, and Looker Studio

Tackling the “Large Number of Small Files” Problem in Spark