Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.
AI-generated Image, Painting by @Shutabug on X

Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.

This project uses Bing News API within Azure to create an end-to-end Data Engineering project using Microsoft Fabric to determine the Olympic news sentiments for each news.

End-to-End Azure Data Engineering using Microsoft Fabric Architecture.

Project Overview

For this project, I analyzed the news sentiment analysis of the recently concluded Olympic Games Paris 2024. I used the Bing Search v7 engine to get all the news from Microsoft Bing Search Engine and determine the sentiments from each piece of news within a specific period.?

I picked the Bing API because working with APIs will always be part of any data engineering process at one point in your career. For this project, I carried out the following:

  1. Create a Bing Search Resource in Azure.
  2. Data Ingestion.?
  3. Data Transformation with Incremental Loading.
  4. Sentiment Analysis with Incremental Loading.
  5. Data Visualization and Reporting in Power BI.
  6. Set up Alerts with Data Activator with notifications on Teams.

This is a project I am excited about as we get to wear the hat of a Data Analyst + Data Engineer + Data Scientist. Sounds interesting right?

So, let's get started;

Prerequisite for this?project

  1. You need a Microsoft Fabric Trial account. Create one here.
  2. You need an Azure Free Trial Account or a Student account.?

1. Create a Bing Search Resource in?Azure.

I created a resource group called AzureXfabric_rg and used the Azure marketplace to search for the Bing Search v7 API to create a Bing Search resource. I picked the F1 (3 calls per seconds and 1k Calls per month) pricing tier because it is free. See other pricing tier.

Create a Bing Search Resource in Azure.

2. Data Ingestion.

For our data lake, I created a Microsoft Fabric Lakehouse called the bing_olympic_news_db. I will be connecting to the Bing Search resource API we created earlier in Azure through a Data Pipeline (olympic-news-ingestion-pipeline). Furthermore, We will also configure our Data Source and Destination.

Create a Bing Search Resource in Azure.

2.1 Configuring the Data?Source?

To connect to the source data we will be using the Rest connection that is available in Fabric to connect to the Bing Search resource API in Azure. Click on Source > Create new connection > Rest API and set up the connection as seen below.

Rest API connection using the Endpoint on
To successfully connect to the Bing Search Resource News Search APIs v7 we will need to input the following parameter for the Sourced data:
1. The Endpoints to get the Base URL to connect to Azure?https://api.bing.microsoft.com/v7.0/news/search).
2. The Query parameter based on our search term?—?(?q=olympic+news&count=100&freshness=Week). If you are familiar with using parameter you can make it dynamic.
3. The Header to get the required header (Ocp-Apim-Subscription-Key). This is the subscription key that you received when you signed up in Azure while creating a Bing resource.
Data Ingestion from the Bing Resource New Search API.

2.2 Setting the Data Destination.

For the Destination, we are going to set our destination to the bing_olympic_news_db Lakehouse we created earlier. I also created a file path called olympic-news.json and set the file format to JSON.

Setting up the destination to the Lakehouse and the file path and file format.
Our data was successfully ingested and saved to the file section of our Lakehouse as olympi-news.json.

3. Data Transformation with Incremental Loading.

It's time to clean the olympi-news.json file in our lakehouse using a spark job within Microsoft Fabric. We will also implement a Type 1 SCD to load our data into the Lakehouse incrementally.

Download the notebook I used on my GitHub to Extract, Transform, and load (ETL) the olympi-news to the olympic_news_updated data > Download Here.

A snapshot of the Transformed data of the Olympic News.

4. Sentiment Analysis with Incremental Loading.

For the Sentiment Analysis, I will be using SynapseML formerly (MMLSpark) which is an open-source library that is available within Microsoft Fabric. We will use one of the pre-built intelligent models for our machine learning task and predict the Olympic news sentiments based on the news description column which has a detailed description of the news article as seen above.

We will be predicting the news sentiments based on the text context from the description. The AnalyzeText prebuilt model from SynapseML was used as seen below.

#import the required packages
import synapse.ml.core
from synapse.ml.services import AnalyzeText

#initialize the model and configure the input and output columns
model = (AnalyzeText()
        .setTextCol("description") ## set the column we want to perform sentiments on
        .setKind("SentimentAnalysis") ## specifying the sentiment analysis model to be performed.
        .setOutputCol("response")
        .setErrorCol("error")) 

#Apply the model to our dataframe
result = model.transform(df)

display(result.limit(10))

#To get the Sentiment Column from the response column
from pyspark.sql.functions import col

sentiment_df = result.withColumn("sentiment", col("response.documents.sentiment"))
display(sentiment_df.limit(7))

#Droping the error and response columns

sentiment_df_final = sentiment_df.drop("error","response")
display(sentiment_df_final.limit(10))


        
After successfully running the code block above, we created a new column called sentiments that categorized the Olympic news based on the sentiment as Positive, Negative, Neutral, and mixed sentiments.

Implementing the Type 1 Slowing Changing Dimension (SCD) to avoid loading duplicate data during incremental loading. I opted for type 1 because I am not interested in keeping a history of our existing data.

# Adopting TYPE 1 SCD incremental loading for our data.

'''In a Type 1 SCD the new data overwrites the existing data without duplicate. Thus the existing data
 is lost as it is not stored anywhere else. This is typically used when there is no need to keep 
 a history of the data.'''

from pyspark.sql.utils import AnalysisException

#Exception Handling
try:

    table_name = "bing_olympic_news_db.sentiment_analysis"
    sentiment_df_final.write.format("delta").saveAsTable(table_name)

except AnalysisException:

    print ("Table Already Exist")

    sentiment_df.createOrReplaceTempView("vw_sentiment")

    spark.sql(f"""  MERGE INTO {table_name} target_table
                    USING vw_sentiment source_view

                    ON source_view.link = target_table.link

                    WHEN MATCHED AND
                    source_view.title <> target_table.title OR
                    source_view.description <> target_table.description OR
                    source_view.image <> target_table.image OR
                    source_view.link <> target_table.link OR
                    source_view.datePublished <> target_table.datePublished OR
                    source_view.provider <> target_table.provider OR
                    source_view.published_date <> target_table.published_date OR
                    source_view.published_time<> target_table.published_time
                    
                    THEN UPDATE SET *

                    WHEN NOT MATCHED THEN INSERT * 

                """)        

Before visualizing my report in Power BI, the picture below shows that the pipeline was successfully scheduled with a daily schedule run of the Olympic news ending by the 31st of August 2024.

Daily schedule of the Pipeline Run of the Olympic news.

Below is the successful run based on the Olympic news search parameter.

Successful Pipeline run of the Olympic news.

5. Data Visualization in Power?BI.

Below is the Data Visualization and report of the Olympic news for the past 7 days as when I performed this analysis. Over 100 news was published and only 17% of it has positive sentiments.

The Sentiment Analysis Dashboard of the Olympic Games Paris 2024.

6. Set up Alerts with Data Activator with notifications on Microsoft Teams.

In the visual above, click on the card visual and select Set Alerts. The set alert pane will pop up as seen below. I created an alert called Positive Alert Item and I would like to receive a Teams message alert when the Positive Sentiment is greater than 17%.

Setting up the Positive Sentiment Alert with Data Activator.

Below is the preview from our Data Activator showing that the alert was successfully created for the visual. As a recipient of the alert, when the condition is met, I will receive a message when this report changes.

Setting up the Positive Sentiment Alert with Data Activator.

Thank you for reading.

Remember to subscribe and follow me for more insightful articles and you can reach me via my socials below if you want to discuss further.

LinkedIn: Musili Adebayo

Twitter: Musili_Adebayo

P.S. Enjoy this article? Support my work directly.

Ramiro Bugari

Principal Data Architect, Azure Data Engineer, Delivery Manager

6 个月

congrats...good jobs!!. thank you for share.

Celestine Adom Tetteh (She/Her)

Data Analyst | Microsoft Power Platform Developer | Founder @Evolve360

7 个月

Gradually you're winning me over sissss, this is amazing, keep it up!

Temidayo Omoniyi

Platform Data Engineer ? Cloud Solution Architect ? Software Engineer ? MCT ? Author

7 个月

Lovely read!

Rasheed AbdulKareem????

Founder & CBO at D-Aggregate|10Years Data Expert| @Kaggle Contributor|ML Researcher

7 个月

Great...seeing greater opportunities with Microsoft fabrics

要查看或添加评论,请登录

Musili Adebayo的更多文章

社区洞察

其他会员也浏览了