Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.
Musili Adebayo
Data Engineer | Fabric Analytics Engineer | Data Analyst | Azure | Power BI | Python | SQL
This project uses Bing News API within Azure to create an end-to-end Data Engineering project using Microsoft Fabric to determine the Olympic news sentiments for each news.
Project Overview
For this project, I analyzed the news sentiment analysis of the recently concluded Olympic Games Paris 2024. I used the Bing Search v7 engine to get all the news from Microsoft Bing Search Engine and determine the sentiments from each piece of news within a specific period.?
I picked the Bing API because working with APIs will always be part of any data engineering process at one point in your career. For this project, I carried out the following:
This is a project I am excited about as we get to wear the hat of a Data Analyst + Data Engineer + Data Scientist. Sounds interesting right?
So, let's get started;
Prerequisite for this?project
1. Create a Bing Search Resource in?Azure.
I created a resource group called AzureXfabric_rg and used the Azure marketplace to search for the Bing Search v7 API to create a Bing Search resource. I picked the F1 (3 calls per seconds and 1k Calls per month) pricing tier because it is free. See other pricing tier.
2. Data Ingestion.
For our data lake, I created a Microsoft Fabric Lakehouse called the bing_olympic_news_db. I will be connecting to the Bing Search resource API we created earlier in Azure through a Data Pipeline (olympic-news-ingestion-pipeline). Furthermore, We will also configure our Data Source and Destination.
2.1 Configuring the Data?Source?
To connect to the source data we will be using the Rest connection that is available in Fabric to connect to the Bing Search resource API in Azure. Click on Source > Create new connection > Rest API and set up the connection as seen below.
To successfully connect to the Bing Search Resource News Search APIs v7 we will need to input the following parameter for the Sourced data:
1. The Endpoints to get the Base URL to connect to Azure?https://api.bing.microsoft.com/v7.0/news/search).
2. The Query parameter based on our search term?—?(?q=olympic+news&count=100&freshness=Week). If you are familiar with using parameter you can make it dynamic.
3. The Header to get the required header (Ocp-Apim-Subscription-Key). This is the subscription key that you received when you signed up in Azure while creating a Bing resource.
2.2 Setting the Data Destination.
For the Destination, we are going to set our destination to the bing_olympic_news_db Lakehouse we created earlier. I also created a file path called olympic-news.json and set the file format to JSON.
Our data was successfully ingested and saved to the file section of our Lakehouse as olympi-news.json.
领英推荐
3. Data Transformation with Incremental Loading.
It's time to clean the olympi-news.json file in our lakehouse using a spark job within Microsoft Fabric. We will also implement a Type 1 SCD to load our data into the Lakehouse incrementally.
Download the notebook I used on my GitHub to Extract, Transform, and load (ETL) the olympi-news to the olympic_news_updated data > Download Here.
4. Sentiment Analysis with Incremental Loading.
For the Sentiment Analysis, I will be using SynapseML formerly (MMLSpark) which is an open-source library that is available within Microsoft Fabric. We will use one of the pre-built intelligent models for our machine learning task and predict the Olympic news sentiments based on the news description column which has a detailed description of the news article as seen above.
We will be predicting the news sentiments based on the text context from the description. The AnalyzeText prebuilt model from SynapseML was used as seen below.
#import the required packages
import synapse.ml.core
from synapse.ml.services import AnalyzeText
#initialize the model and configure the input and output columns
model = (AnalyzeText()
.setTextCol("description") ## set the column we want to perform sentiments on
.setKind("SentimentAnalysis") ## specifying the sentiment analysis model to be performed.
.setOutputCol("response")
.setErrorCol("error"))
#Apply the model to our dataframe
result = model.transform(df)
display(result.limit(10))
#To get the Sentiment Column from the response column
from pyspark.sql.functions import col
sentiment_df = result.withColumn("sentiment", col("response.documents.sentiment"))
display(sentiment_df.limit(7))
#Droping the error and response columns
sentiment_df_final = sentiment_df.drop("error","response")
display(sentiment_df_final.limit(10))
After successfully running the code block above, we created a new column called sentiments that categorized the Olympic news based on the sentiment as Positive, Negative, Neutral, and mixed sentiments.
Implementing the Type 1 Slowing Changing Dimension (SCD) to avoid loading duplicate data during incremental loading. I opted for type 1 because I am not interested in keeping a history of our existing data.
# Adopting TYPE 1 SCD incremental loading for our data.
'''In a Type 1 SCD the new data overwrites the existing data without duplicate. Thus the existing data
is lost as it is not stored anywhere else. This is typically used when there is no need to keep
a history of the data.'''
from pyspark.sql.utils import AnalysisException
#Exception Handling
try:
table_name = "bing_olympic_news_db.sentiment_analysis"
sentiment_df_final.write.format("delta").saveAsTable(table_name)
except AnalysisException:
print ("Table Already Exist")
sentiment_df.createOrReplaceTempView("vw_sentiment")
spark.sql(f""" MERGE INTO {table_name} target_table
USING vw_sentiment source_view
ON source_view.link = target_table.link
WHEN MATCHED AND
source_view.title <> target_table.title OR
source_view.description <> target_table.description OR
source_view.image <> target_table.image OR
source_view.link <> target_table.link OR
source_view.datePublished <> target_table.datePublished OR
source_view.provider <> target_table.provider OR
source_view.published_date <> target_table.published_date OR
source_view.published_time<> target_table.published_time
THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
Before visualizing my report in Power BI, the picture below shows that the pipeline was successfully scheduled with a daily schedule run of the Olympic news ending by the 31st of August 2024.
Below is the successful run based on the Olympic news search parameter.
5. Data Visualization in Power?BI.
Below is the Data Visualization and report of the Olympic news for the past 7 days as when I performed this analysis. Over 100 news was published and only 17% of it has positive sentiments.
6. Set up Alerts with Data Activator with notifications on Microsoft Teams.
In the visual above, click on the card visual and select Set Alerts. The set alert pane will pop up as seen below. I created an alert called Positive Alert Item and I would like to receive a Teams message alert when the Positive Sentiment is greater than 17%.
Below is the preview from our Data Activator showing that the alert was successfully created for the visual. As a recipient of the alert, when the condition is met, I will receive a message when this report changes.
Thank you for reading.
Remember to subscribe and follow me for more insightful articles and you can reach me via my socials below if you want to discuss further.
LinkedIn: Musili Adebayo
Twitter: Musili_Adebayo
P.S. Enjoy this article? Support my work directly.
Principal Data Architect, Azure Data Engineer, Delivery Manager
6 个月congrats...good jobs!!. thank you for share.
Data Analyst | Microsoft Power Platform Developer | Founder @Evolve360
7 个月Gradually you're winning me over sissss, this is amazing, keep it up!
Platform Data Engineer ? Cloud Solution Architect ? Software Engineer ? MCT ? Author
7 个月Lovely read!
Founder & CBO at D-Aggregate|10Years Data Expert| @Kaggle Contributor|ML Researcher
7 个月Great...seeing greater opportunities with Microsoft fabrics