Simple scheduled sentiment analysis using Jupyter Notebooks, NBFire, Google Sheets and NLTK

Simple scheduled sentiment analysis using Jupyter Notebooks, NBFire, Google Sheets and NLTK

Following the popularity of my article on scheduled sentiment analysis using Jupyter Notebooks, NBFire and Textblob a couple of people asked me for a similar guide using NLTK instead. So, in this article I am going to show you how easy it is to schedule some sentiment analysis using this popular python library. We are going to….

  • Use NBFire to do a daily scrape of the Guardian and Telegraph websites to get a list of headlines to analyse.
  • Pull this news data into our Jupyter Notebook.
  • Process the data using Beautiful Soup.
  • Use NLTK to create a compound sentiment score for our headlines
  • Save the processed data to a Google Sheet.
  • Pull the data back into Jupyter and create a simple graphic to help understand the sentiment of the articles over time and compare the two newspapers.
  • Easy as that!

The aim here is to provide you with a simple guide on how to get going with scheduled sentiment analysis using NLTLK and to encourage you to explore further. Most of the work you need to do to get this going is configuration which is a bit tiresome but once you have it done you don’t need to do it for other projects. So, let’s crack on.

I am assuming you have Jupyter install somewhere already, if not take a look at this guide.

You can also find the complete code for the notebook here.

Create an Account on NBFIRE

Go to NBFire and sign up for a FREE account. Then head over to Google Cloud and create an account there.

Set up our Google Project

We need to set up our Google APIs and create a way for our Jupyter Notebook to authenticate to them. This is the most complicated part of this article so let’s step through it carefully.

To create a new project in your Google Account, click on projects in the top left hand corner then choose NEW PROJECT

New Project

Give your project and name and click CREATE. You will be taken to a new screen. Once Google have created your project under Quick Access choose APIs and Services

Choose API and Services

You will next be taken to a new screen where you can enable APIs and Services

Choose ENABLE APIS AND SERVICES

The two services we want to enable are GOOGLE DRIVE API and GOOGLE SHEETS. Click through the forms to enable them.

They should now be listed in your APIs and Services

Google Drive and Google Sheets enabled

Now we need to create some credentials that we can use in our notebook to access our Google Drive. Click on CREDENTIALS in the left hand toolbar, then choose CREATE CREDENTIALS and Choose SERVICE ACCOUNT as the type of credential.

Create a SERVICE ACCOUNT NAME and ID and then make sure you copy the email address that is generated and save it somewhere you can use it in future. This is the email address we will share our Google Sheet with so that our Google Drive API can access it.

Create our Service Account Details and copy the email address and save it

Finally click on DONE and the Grants page will open up, just click DONE again and it will take you back to the Service Accounts page

Service accounts page

Next, click on the three dots under ACTIONS and choose MANAGE KEYS and then ADD a KEY, then CREATE NEW KEY. You will be asked to choose the format, select JSON and hit CREATE. The JSON Credentials file will automatically be downloaded. Remember where you have stored this JSON Credentials file, we will be using it in our Notebook to authenticate to our Google Drive.

Create your private key and save it so you can use it later

Create our Google Sheet

Create a new Google Sheet and name it News Headlines. Copy the Google sheet ID from the URL and save it somewhere. In the example below the sheet ID begins with 1NHg and ends with jvag.

Google sheet ID

Then create 3 sheets named Daily_Guardian_Sentiments, Daily_Telegraph Sentiments and Daily_Mean Sentiments. Your Google Sheet should now look like this.

Set up your Google Sheet

Share our Google Sheet with our service account

Now click on “Share” on the top right hand side and enter the Google service account email address you saved earlier to share this sheet with our service account.

Create our Jupyter Notebook

That is it, all set up. Now the fun part, let’s write some Python in our Jupyter Notebook

Create a new Notebook and in the first cell let’s just make sure all the python packages we need are installed. Add the following code to the first cell

!pip install beautifulsoup4 
!pip install matplotlib
!pip install requests
!pip install nltk
!pip install pandas
!pip install numpy as np        

In the next cell we can import all the packages we need and initialise the NLTK sentiment analyser

import requests
from bs4 import BeautifulSoup as bs
import gspread
from datetime import datetime
from nltk.sentiment import SentimentIntensityAnalyzer
import pandas as pd
import numpy as np
import json
import nltk

nltk.download('vader_lexicon')
analyzer = SentimentIntensityAnalyzer()        

In a new cell below we are going to bring in the Google Service account credentials we need to access the Google Sheet. We don’t want to add them to the Notebook itself because this is secret information that we don’t want to share with anybody else. So, instead we going to add them as a parameter on NBFire.

To do this we need to add a ‘parameter’s tag to this cell. Click into the new cell and then choose VIEW>CELL TOOLBAR>TAGS and in the new box add a tag called ‘parameters’ then ENTER. Your cell should now look like this.

Create a cell for parameters to be added at runtime.

In this cell add the following

# To make it easier to update and share this notebook we will be entering sensitive keys are runtime using nbfire.  T
# This cell is a placeholder for these values.  This is why the cell has a 'parameters' tag added.  For more
#information please read https://papermill.readthedocs.io/en/latest/usage-parameterize.html
credentials = ""        

Your new cell should look like this

Now, create a new cell below and enter the following code. Here we are just making sure we handle the credentials correctly, if they are entered as a parameter they will be a string, but if you want to paste them into the above cell, remove the “” and they will be processed as a dict. Basically, don’t worry about this cell, or, if you are worried, post me a comment below.

# We need to handle both Google API credentials added as a string entered as a parameter on NBFire and as a dict
# entered directly in the cell above
if isinstance(credentials, dict):
    sa = gspread.service_account_from_dict(credentials)
elif isinstance(credentials, str):
    creds = json.loads(credentials)
    sa = gspread.service_account_from_dict(creds)        

Now, create a new cell below and enter the following code. You need to replace the ‘enter the id from the Google Sheet URL here’ with the URL you saved earlier.

#We open the Google Sheet via the ID in the URL and load in the worksheet we want to work with
sheet = sa.open_by_key('enter the id from the Google Sheet URL here')
daily_guardian_datasheet = sheet.worksheet("Daily_Guardian_Sentiments")
daily_telegraph_datasheet = sheet.worksheet("Daily_Telegraph_Sentiments")
mean_datasheet=sheet.worksheet("Daily_Mean_Sentiments")        

Here we are creating 4 references to our Google Sheet.

  • Sheet — the main Google Sheet
  • daily_guardian_datasheet — the worksheet where we are going to store our Guardian news headlines
  • daily_telegraph_datasheet — the worksheet where we are going to store our Telegraph headlines
  • mean_datasheet — the worksheet where we are going to store our daily average sentiments

Next, let’s set up the rest of the parameters we need, the URLs we want to get the news items from and a date stamp we can add to every item we retrieve

guardian_url = 'https://www.guardian.co.uk'
telegraph_url = 'https://telegraph.co.uk'
current_datetime = datetime.now()
now = current_datetime.strftime("%Y-%m-%d")        

Right, let’s get stuck in with the actual scraping of headlines. In a new cell enter the following code. This is the function we are going to use to get our Guardian headlines and save them to our daily_guardian_datasheet.

def get_guardian_titles(url):
    article = requests.get(url)
    soup = bs(article.text, 'html.parser')
    headlines = soup.find('body').find_all('h3')
    for x in headlines[:15]:
        try:
            if x.div.text != 'Live':
                title= x.div.text
                description= x.span.text
                complete_text = title + " - " + description
                scores=analyzer.polarity_scores(complete_text)
                new_entry=[now,complete_text,scores["compound"]]
                print(new_entry)
                daily_guardian_datasheet.append_row(new_entry)
        except: print('error')        

If we break this down

def get_guardian_titles(url):
    article = requests.get(url)
    soup = bs(article.text, 'html.parser')
    headlines = soup.find('body').find_all('h3')        

Here we are using the request library to get the html from the Guardian, parsing it, and then finding all the h3 tags in the body to give us just the html of the headlines.

for x in headlines[:15]:
        try:
            if x.div.text != 'Live':
                title= x.div.text
                description= x.span.text
                complete_text = title + " - " + description
                scores=analyzer.polarity_scores(complete_text)
                new_entry=[now,complete_text,scores["compound"]]
                print(new_entry)
                daily_guardian_datasheet.append_row(new_entry)
        except: print('error')        

Next, we are taking the first 15 Guardian headlines and iterating through them. If the text is not equal to “Live” we are going to

  • Build a text object called “complete_text” from the div.text and div.span.text
  • Print out the text object
  • Pass this text object to NLTK for sentiment analysis
  • Create a new object to store in our datasheet made up of datestamp, the complete text and the NLTK compound sentiment polarity score.

For more information on the NTLK package go here.

Let’s do the same for The Telegraph.

def get_telegraph_titles(url):
    article = requests.get(url)
    soup = bs(article.text, 'html.parser')
    headlines = soup.find_all("span", {"class": ""})
    for x in headlines[4:19]:
        try:
            complete_text = x.text
            scores=analyzer.polarity_scores(complete_text)
            new_entry=[now,complete_text,scores["compound"]]
            print(new_entry)
            daily_telegraph_datasheet.append_row(new_entry)   
        except: print('error')        

The structure of the Telegraph is a bit different to the Guardian. The headlines are all enclosed in a span with a class=””. We also need to discard the first 4 items as they are not headlines.

The rest of the function is the same.

Now lets call our two functions passing the urls we defined earlier

get_guardian_titles(guardian_url)
get_telegraph_titles(telegraph_url)        

Now that we have added our headlines to our datasheets we can grab the sentiment data and calculate the mean sentiment for the headlines we stored.

Here we go to the Guardian datasheet and get the last 15 items from the sentiment column.

guardian_values = daily_guardian_datasheet.col_values(3)
guardian_sentiments=guardian_values[-15:]        

And do the same for the Telegraph sentiment data

telegraph_values = daily_telegraph_datasheet.col_values(3)
telegraph_sentiments=telegraph_values[-15:]        

Next we are going to use Numpy to calculate the mean of the sentiment data we just added to (and retrieved from) our daily guardian datasheet, and then append that mean value to our mean_datasheet.

x = np.array(guardian_sentiments)
y = x.astype(float)
mean = np.mean(y)        
new_mean_entry=[now, mean]
mean_datasheet.append_row(new_mean_entry)        

And again, do the same for the Telegraph sentiment data. Note that the Telegraph mean data needs to go into its own column hence the update_cell(last_row, 3, mean) condition

x = np.array(telegraph_sentiments)
y = x.astype(float)
mean = np.mean(y)        
last_row = len(mean_datasheet.get_all_values())
mean_datasheet.update_cell(last_row, 3, mean)        

Plot our Analysis

Now all our data has been gathered processed and stored we can use it to plot the relative sentiments of the headlines of both newspapers

First, we can grab all the mean values we have created.

df = pd.DataFrame(mean_datasheet.get_all_records())        

Then plot ourselves a graph.

ax = df.plot(x="Date", y="Mean_Guardian_Sentiment", figsize=(10, 10))
df.plot('Date', 'Mean_Telegraph_Sentiment', secondary_y=True, ax=ax)        

Boom!

If you run this everyday for a while it will look something like this.

Schedule the notebook to run on NBFire

But, it is a real pain to have to remember to run this from your local machine everyday. So now let’s load it up on to NBFire and schedule it to run overnight.

Go to the application and upload the Jupyter Notebook from your local drive. You can find the complete code here.

To schedule it to run use the NBFire Scheduler. In the example below we are choosing to run it at 1 am every morning London Time.

Notebook will run at 1am London time everyday

Enter values for our Google Drive Credentials as parameters. Simply retrieve the JSON file you saved earlier and paste it into the parameter cell. See the image below for some sample JSON added to our credentials cell

Then choose ‘Fire Schedule’ and there you are! Your notebook with run everyday in the cloud at 1.00am, London Time. What could be easier? Each time the notebook runs you will get an output notebook and your Google Sheet will be updated

If you don’t want to wait until tomorrow to see it work go down to the ‘Configure Run’ section and again enter your Google Drive Credentials. This time choose ‘Fire Notebook’ and your notebook will run right away. Download and view the output notebook and you will see your notebook interacting with The Guardian, The Telegraph and Google Sheets.

Let me know how you get on by posting comments below. You can download the complete notebook here.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了