The Battle of Neighbourhoods:
London's Crime Rate Analysis and Clustering of the Safest Neighbourhoods of London

The Battle of Neighbourhoods: London's Crime Rate Analysis and Clustering of the Safest Neighbourhoods of London

Introduction

London is one of the most multicultural cities in the world. It is a melting pot of cultures, where one can meet people from all the parts of the world and taste the best of the world cuisine. It is a major centre for banking and finance, insurance, world trade, media, advertising, tourism, theatre, fashion, arts and more. Fusing gritty, historic pomp with shimmering modernity, world-class culture and fashion-forward shopping, the UK’s capital has it all and there’s something for everyone. The vibrancy of the city extends across all 32 of its boroughs, all of which are home to a plethora of unique neighbourhoods.


Business Problem

The decision to move to a new a city or a new country altogether, is a harrowing one. But after having decided to move to London, the next challenge one faces is to decide where to live in London. If one looks at the map of London, they will find a haphazard cluster of neighbourhoods and villages, each with their own distinct features and identity. Some of London’s best neighbourhoods are usually established on the typical tourist trail, while others are constantly evolving, taking turns to emerge as the new cool hotspot. The following questions then arise in our mind,

  • Which neighbourhood is right for us?
  • Which part of the city has the best parks and playgrounds?
  • Which schools fall in the neighbourhood?
  • What area has the best craft beer scene or all-night eateries?
  • Where can one find the hippest bookstores or outdoor yoga?

And at the top of all these doubts, the most intriguing questions anyone would face are,

  • What is the crime rate in the area?
  • Is it a secure neighbourhood?
  • Is it safe to venture out in the night?

All these questions and more plague our mind and then the quest to find the answers begins.


Objective of the Capstone Project

The objective of this assignment is to give an insight into what some of the safest London neighbourhoods can offer its residents and tourists.

?To help uncover the best that London has to offer, this project aims to do the following,

  • Identify the safest boroughs and wards in London based on the latest crime data
  • Find the Latitude & the Longitude coordinates of the preferred neighbourhoods by using their Postcodes
  • Plot the safest neighbourhoods on the Map of London using the geographical coordinates obtained
  • Locate the most common venues in the vicinity of 500 metres from these neighbourhoods
  • Cluster these neighbourhoods based on the common venues using a Machine Learning algorithm (K-Means Clustering)


Interested Parties

The objective of this project is to identify and recommend the best & safest neighbourhoods in London to anyone who wants to visit or relocate to London. The interested parties could be anyone from the below mentioned list,

  • Young couples
  • Families with children
  • Executives
  • Tourists, etc.


Description of Data

1.?MPS Ward Level Crime Data for London:

  • This dataset has been extracted from the Metropolitan Police Service’s “Recorded Crime: Geographic Breakdown” Data available on the London Datastore, https://data.london.gov.uk/dataset/recorded_crime_summary
  • This data provides the number of crimes recorded per month according to crime type at the geographic level of London’s Wards for the period July 2019 to June 2021

2.?List of London Boroughs:

  • This dataset has been extracted from the Wikipedia.org page: https://en.wikipedia.org/wiki/List_of_London_boroughs
  • It has been used to fetch more information on the different Boroughs of London, like the local authority of the borough, the political party controlling the local authority, the Head Quarters of the local authority, the area of the Borough, its population, its coordinates, and its designated number on the map of London
  • With this information we can get more insight in to the various Boroughs of London

3. London Postcodes:

  • This dataset has been extracted from Doogal.co.uk: https://www.doogal.co.uk/london_postcodes.php
  • The dataset has a complete list of London postcode districts
  • Even though this dataset already had the Latitude and the Longitude data available, I have used the ArcGIS API to re-fetch the coordinates of the preferred locations

4. ArcGIS API Data:

  • ArcGIS (https://www.arcgis.com) is an online API that enables us to connect people, locations, and data using interactive maps
  • We use the ArcGIS API to get the geographical coordinates (Latitude and Longitude) of the neighbourhoods of London by providing the Postcodes of the desired locations
  • The following information is obtained for each Postcode,

- Latitude: Latitude of the Postcode

- Longitude: Longitude of the Postcode


5.?Foursquare API Data:

  • Foursquare (https://foursquare.com) is a location data provider with information about different venues and events within an area of interest
  • The information obtained from the Foursquare API includes venue names, locations, menus, reviews, photos, etc.
  • The Foursquare location platform is, thus, used by us as a data source since all the required information about the different venues in various neighbourhoods of the desired Borough or Ward can be obtained through their API


Methodology

A] Importing Libraries:

Libraries used in this Project are,

  • Pandas:?For creating and manipulating dataframes
  • Numpy:?For scientific computation
  • JSON:?To handle JSON files
  • Requests:?To handle http requests
  • Matplotlib:?It is a data visualisation and graphical plotting library
  • Plotly:?It is also a visualisation library for creating interactive and publication-quality charts / graphs
  • Folium:?It is used for visualising geospatial data and plotting interactive maps
  • Geocoder:?To retrieve Location Data
  • Scikit Learn:?To use K-Means Clustering, a Machine Learning Algorithm

import pandas as pd
import numpy as np


import json
from pandas.io.json import json_normalize
import requests


%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors


import plotly.express as px
import plotly.graph_objects as go


import folium


import geocoder
from geopy.geocoders import Nominatim


from arcgis.geocoding import geocode
from arcgis.gis import GIS


from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearnex import patch_sklearn
patch_sklearn()        


B] Extracting, Scraping, Exploring, Cleaning and Processing the Datasets:

After importing all the required libraries, we will extract the data from different sources and clean it so that it is ready for processing and analysing

Dataset 1: Metropolitan Police Service Ward Level Crime Data for London

  • The extracted data is the most recent data available updated till June 2021
  • This data counts the number of crimes per month according to crime type at the geographic level of London’s Wards for the period July 2019 to June 2021
  • In March 2019, the Metropolitan Police Service started to provide offences grouped as per the updated Home Office crime classifications, which has been incorporated in the dataset
  • Before cleaning the data, the dataset contained 29 columns
  • Post cleaning and processing the data, 5 columns have been renamed and 1 has been added
  • The dataset now contains 30 columns
  • While the exploring the dataset, it was found that there were two Wards by the name of "Belmont" in Harrow as well as in Sutton. Hence, in order to segregate them so as not to cause any confusion during analysis, their names were changed to "Belmont Harrow" and "Belmont Sutton".
  • Further, in order to maintain consistency, the names of these two Wards were also changed in the fourth dataset, which had the London Postcodes
  • The original dataset contained a total of 22,403 records
  • Once the dataset was processed to include only the Top 5 safest Boroughs of London, the number of records reduced to 3,007 from 22,403 records
  • After the dataset was processed further, to include only the Top 50 safest Wards of London, the number of records reduced to 1,549 from 3,007 records

crime_df = pd.read_csv("MPS Ward Level Crime (most recent 24 months).csv")

columns = ["Crime Head", "Crime Sub-Head", "Ward", "Ward Code", "Borough", 201907, 201908, 201909, 201910, 201911, 201912, 202001, 202002, 202003, 202004, 202005, 202006, 202007, 202008, 202009, 202010, 202011, 202012, 202101, 202102, 202103, 202104, 202105, 202106]

crime_df.columns = columns

crime_df = crime_df.reindex(["Ward Code", "Ward", "Borough", "Crime Head", "Crime Sub-Head", 201907, 201908, 201909, 201910, 201911, 201912, 202001, 202002, 202003, 202004, 202005, 202006, 202007, 202008, 202009, 202010, 202011, 202012, 202101, 202102, 202103, 202104, 202105, 202106], axis = 1)

crime_df["Total"] = crime_df.sum(numeric_only = True, axis = 1)

crime_df.loc[((crime_df["Ward"] == "Belmont") & (crime_df["Borough"] == "Harrow")), "Ward"] = "Belmont Harrow"

crime_df.loc[((crime_df["Ward"] == "Belmont") & (crime_df["Borough"] == "Sutton")), "Ward"] = "Belmont Sutton"        
No alt text provided for this image

Dataset 2: List of London Boroughs

  • The dataset, “List of London Boroughs”, has been extracted from Wikipedia.org
  • It has been used to fetch more information on the different Boroughs of London, like the local authority of the borough, the political party controlling the local authority, the Head Quarters of the local authority, the area of the Borough, its population, its coordinates, and its designated number on the map of London
  • With this information we can get more insight in to the various Boroughs of London
  • Post cleaning and processing the data, 2 columns have been dropped and 5 columns have been renamed
  • The dataset now contains 8 columns
  • The dataset contains a total of 32 records, which is the total number of London Boroughs, excluding the City of London

london_bor_list_url = "https://en.wikipedia.org/wiki/List_of_London_boroughs"

london_bor_list = pd.read_html(london_bor_list_url)

london_bor_df = london_bor_list[0]

london_bor_df.columns=["Borough", "Inner", "Status", "Local Authority", "Political Control", "Head Quarters", "Area (sq mi)", "Population (2013 estimate)",
"Co-ordinates", "Borough No. on Map"]

london_bor_df = london_bor_df.replace("note 1", "", regex=True)?
london_bor_df = london_bor_df.replace("note 2", "", regex=True)?
london_bor_df = london_bor_df.replace("note 3", "", regex=True)?
london_bor_df = london_bor_df.replace("note 4", "", regex=True)?
london_bor_df = london_bor_df.replace("note 5", "", regex=True)

london_bor_df["Borough"].replace({
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "Barking and Dagenham[]" : "Barking and Dagenham",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "Greenwich []" : "Greenwich",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "Hammersmith and Fulham[]" : "Hammersmith and Fulham"
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? }, inplace = True)

london_bor_df = london_bor_df.drop(["Inner", "Status"], axis = 1)        
No alt text provided for this image

Dataset 3: Merged Dataset of Dataset 1 and Dataset 2

  • The third dataset has been created by merging the first two datasets, i.e., by merging the datasets, “MPS Ward Level Crime (most recent 24 months)” and “List of London Boroughs”
  • The two datasets have been merged on the common column present in both the datasets, i.e., the “Borough” column
  • After merging and reindexing the columns, the dataset contains 37 columns
  • The dataset contains a total of 22,403 records

london_crime_df = pd.merge(crime_df, london_bor_df , on = 'Borough')

london_crime_df = london_crime_df.reindex(["Ward Code", "Ward", "Borough", "Local Authority", "Political Control", "Head Quarters", "Area (sq mi)", "Population (2013 estimate)", "Co-ordinates", "Borough No. on Map", "Crime Head", "Crime Sub-Head", 201907, 201908, 201909, 201910, 201911, 201912, 202001, 202002, 202003, 202004, 202005, 202006, 202007, 202008, 202009, 202010, 202011, 202012, 202101, 202102, 202103, 202104, 202105, 202106, "Total"], axis = 1))        
No alt text provided for this image

  • The merged dataset provides more information on the different Boroughs of London, like the local authority of the borough, the political party controlling the local authority, the address of the local authority, the area of the Borough, its population, its coordinates, and its designated number on the map of London
  • Thus, with this information we can get more insight in to the various Boroughs of London

Dataset 4: London Postcodes

  • As the name suggests, this dataset has been used to fetch the Postcodes of the different neighbourhoods in London
  • This dataset has been created to find Venues in the Neighbourhood of London using the Foursquare API
  • It has been extracted from Doogal.co.uk and has a complete list of London postcode districts
  • The original dataset had 49 columns, but after the cleaning process, 28 columns were dropped as the same were not required for analysis
  • Post cleaning and processing the data, the dataset contains 21 columns
  • While the exploring the 1st Dataset, i.e., “London Crime”, it was found that there were two Wards by the name of "Belmont" in Harrow as well as in Sutton. Hence, in order to segregate them so as not to cause any confusion during analysis, their names were changed to "Belmont Harrow" and "Belmont Sutton".
  • In order to maintain consistency, the names of these two Wards were also changed in this dataset
  • Even though this dataset already had the Latitude and the Longitude data available, I have used the ArcGIS API to re-fetch the coordinates of the preferred locations
  • Before cleaning the data, the dataset contained a total of 3,24,634 records
  • The number of records were reduced to 1,79,704 from 3,24,634 after removing the Postcodes that were not in use
  • Once the dataset was processed to include only the Top 5 safest Boroughs of London, the number of records reduced to 20,249 from 1,79,704
  • After the dataset was processed further, to include only the Top 50 safest Wards of London, the number of records reduced to 10,083 from 20,249
  • Now, we could have used these 10,083 Postcodes of the Top 50 Wards of London to find their coordinates, but the process of fetching the coordinates for so many postcodes would have taken a lot of time. Hence, it was necessary to reduce the number of records further.
  • Therefore, in order to reduce the dataset further, I selected the location that was nearest to the Station
  • After processing the dataset, it was found that there was a total of 81 locations in the Top 50 safest Wards of London that were nearest to the Stations
  • Thus, the number of Postcodes were reduced from 10,083 to 81

london_postcodes_df = pd.read_csv("London Postcodes.csv", low_memory = False)

london_postcodes_df = london_postcodes_df.drop(["County", "Country", "County Code", "Introduced", "Terminated", "Parish", "National Park", "Population", "Households", "Built up area", "Built up sub-division", "Rural/urban", "Region", "Altitude", "Local authority", "Parish Code", "Census output area", "Index of Multiple Deprivation", "Quality", "User Type", "Last updated", "Police force", "Water company", "Plus Code", "Average Income", "Sewage Company", "Travel To Work Area"], axis = 1)

london_postcodes_df = london_postcodes_df[london_postcodes_df["In Use?"] == "Yes"]

london_postcodes_df = london_postcodes_df.drop(["In Use?"], axis = 1)

postcode_cols = ["Postcode Data", "Latitude Data", "Longitude Data", "Easting", "Northing", "Grid Ref", "Borough", "Ward", "Borough Code", "Ward Code", "Constituency", "Lower Layer Super Output Area", "London Zone", "LSOA Code", "MSOA Code", "Middle Layer Super Output Area", "Constituency Code", "Nearest Station", "Distance To Station", "Postcode Area", "Postcode District"]

london_postcodes_df.columns = postcode_cols

postcode_cols_new = ["Postcode Data", "Latitude Data", "Longitude Data", "Nearest Station", "Distance To Station", "Ward Code", "Ward", "Borough Code", "Borough", "Constituency Code", "Constituency", "LSOA Code", "Lower Layer Super Output Area", "MSOA Code", "Middle Layer Super Output Area", "London Zone", "Postcode Area", "Postcode District", "Easting", "Northing", "Grid Ref"]

london_postcodes_df = london_postcodes_df.reindex(postcode_cols_new, axis = 1)

london_postcodes_df.loc[((london_postcodes_df["Ward"] == "Belmont") & (london_postcodes_df["Borough"] == "Harrow")), "Ward"] = "Belmont Harrow"

london_postcodes_df.loc[((london_postcodes_df["Ward"] == "Belmont") & (london_postcodes_df["Borough"] == "Sutton")), "Ward"] = "Belmont Sutton"        
No alt text provided for this image


C] Understanding the Dataset Using Groupby Function and Charts:

We will then use the Groupby Function and Charts to understand the data better

  • During this process, the dataset will be used to find Boroughs that have the highest and the lowest crime rate
  • After having found the boroughs with the lowest crime rate, the data will be sorted, and the 5 safest Boroughs in London will be identified
  • Though the 5 Boroughs identified can easily serve our purpose, as these 5 Boroughs are the safest ones as compared to the other Boroughs of London; we will further try to eliminate the areas with crime so as to find the most secure venues for our target audience
  • If we take all the 92 Wards from the shortlisted 5 safe Boroughs, there may still be a possibility that some of the Venues could fall in the "unsafe" Ward of that particular safe Borough
  • Therefore, in order to avoid such a scenario and to ensure that the Venues found are from the most secure areas of London, another layer of safety will be added to identify the 10 Most Safest Wards within each of the 5 Most Safest Boroughs
  • Thus, out of a total of 615 Wards in the whole of London, we will shortlist only the 50 Most Safest Wards

(i) Group the Dataframe by “Borough”

bor_crime_df = crime_df.groupby("Borough").sum()

bor_crime_df.sort_values(by = ["Total", "Borough"], inplace = True)        

  • Plotting the Bar Chart of the Total Crimes Recorded During the Period July 2019 to June 2021

bar_chart = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? bor_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Total Crimes Recorded in London Boroughs During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "yellow"), (0.5, "red"), (0.75, "red"), (1, "maroon")],
? ? ? ? ? ? ? ? ? ? ? ? width = 1000,
? ? ? ? ? ? ? ? ? ? ? ? height = 700
? ? ? ? ? ? ? ? ? )

bar_chart.show()        
Total Crimes Recorded During the Period July 2019 to June 2021

  • Plotting the Horizontal Bar Chart of the Total Crimes Recorded During the Period July 2019 to June 2021

h_bar_chart = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? bor_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Total Crimes Recorded in London Boroughs During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "yellow"), (0.5, "red"), (0.75, "red"), (1, "maroon")],
? ? ? ? ? ? ? ? ? ? ? ? width = 1000,
? ? ? ? ? ? ? ? ? ? ? ? height = 750
? ? ? ? ? ? ? ? ? ? )

h_bar_chart.show()        
Total Crimes Recorded During the Period July 2019 to June 2021

(ii) Group the Dataframe by “Ward”

  • Plotting the Horizontal Bar Chart for the Total Crimes Recorded in the Top 20 Most Safest London Wards During the Period July 2019 to June 2021

top_ward_crime_df = crime_df.groupby("Ward").sum()

top_ward_crime_df.sort_values(by = ["Total", "Ward"], ascending = True, inplace = True)

top_ward_crime_df = top_ward_crime_df.head(20)

h_bar_chart = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? top_ward_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Total Crimes Recorded in the Top 20 Most Safest London Wards During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "lightgreen"), (0.5, "yellow"), (0.75, "orange"), (1, "red")]
? ? ? ? ? ? ? ? ? ? )

h_bar_chart.show()        
Total Crimes Recorded in the Top 20 Most Safest London Wards During the Period July 2019 to June 2021

  • Plotting the Horizontal Bar Chart for the Total Crimes Recorded in Worst 20 Most Dangerous London Wards During the Period July 2019 to June 2021

worst_ward_crime_df = crime_df.groupby("Ward").sum()

worst_ward_crime_df.sort_values(by = ["Total", "Ward"], ascending = True, inplace = True)

worst_ward_crime_df = worst_ward_crime_df.tail(20)

h_bar_chart = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? worst_ward_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Total Crimes Recorded in Worst 20 Most Dangerous London Wards During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "red"), (0.25, "darkred"), (0.5, "maroon"), (0.75, "maroon"), (1, "indigo")]
? ? ? ? ? ? ? ? ? ? )

h_bar_chart.show()        
Total Crimes Recorded in Worst 20 Most Dangerous London Wards During the Period July 2019 to June 2021

(iii) Group the Dataframe by “Crime Head”

  • Plotting the Horizontal Bar Chart for the Types of Crimes Recorded in London During the Period July 2019 to June 2021

type_crimes_crime_df = crime_df.groupby(["Crime Head"]).sum()

type_crimes_crime_df.sort_values(by = ["Total", "Crime Head"], ascending = True, inplace = True)

type_crimes_crime_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? type_crimes_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Types of Crimes Recorded in London During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "yellow"), (0.5, "orange"), (0.75, "red"), (1, "maroon")]
? ? ? ? ? ? ? ? ? ? )

type_crimes_crime_bar.show()        
Types of Crimes Recorded in London During the Period July 2019 to June 2021

(iv) Group the Dataframe by “Crime Sub-Head”

  • Plotting the Horizontal Bar Chart for the Top 20 Crimes Recorded in London During the Period July 2019 to June 2021

type_sub_crimes_crime_df = crime_df.groupby(["Crime Sub-Head"]).sum()

type_sub_crimes_crime_df.sort_values(by = ["Total", "Crime Sub-Head"], ascending = False, inplace = True)

type_sub_crimes_crime_df = type_sub_crimes_crime_df.head(20)

type_sub_crimes_crime_df.sort_values(by = ["Total", "Crime Sub-Head"], ascending = True, inplace = True)

type_sub_crimes_crime_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? type_sub_crimes_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Top 20 Crimes Recorded in London During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "orange"), (0.5, "red"), (0.75, "maroon"), (1, "purple")]
? ? ? ? ? ? ? ? ? ? )

type_sub_crimes_crime_bar.show()        
Top 20 Crimes Recorded in London During the Period July 2019 to June 2021

(v) Top 10 Most Safest and Worst 10 Most Dangerous Boroughs of London

  • Plotting the Bar Chart for the Top 10 Most Safest Boroughs of London

T10S_bor_crime_df = bor_crime_df.head(10)

T10S_bor_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? T10S_bor_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Top 10 Most Safest Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "lightgreen"), (0.5, "yellow"), (0.75, "orange"), (1, "red")]
? ? ? ? ? ? ? ? ? ? ?)

T10S_bor_bar.show()        
Top 10 Most Safest Boroughs of London

  • Plotting the Bar Chart for the Worst 10 Most Dangerous Boroughs of London

W10D_bor_crime_df = bor_crime_df.sort_values(by = "Total", ascending = False).head(10)

W10D_bor_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? W10D_bor_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Worst 10 Most Dangerous Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale=[(0, "yellow"), (0.25, "red"), (0.5, "red"), (0.75, "maroon"), (1, "maroon")]
? ? ? ? ? ? ? ? ? ? ? ? #color_continuous_scale = [(0, "orange"), (0.5, "magenta"), (1, "red")]
? ? ? ? ? ? ? ? ? ? ?)

W10D_bor_bar.show()        
Worst 10 Most Dangerous Boroughs of London

(vi) Top 5 Crimes in the 5 Most Safest Boroughs of London Grouped By "Borough"

  • Plotting the Grouped Bar Chart for the Top 5 Crimes in the 5 Most Safest Boroughs of London Grouped By "Boroughs"

df = [N01SB_T05C_df, N02SB_T05C_df, N03SB_T05C_df, N04SB_T05C_df, N05SB_T05C_df]

Top05SB_df = pd.DataFrame()

Top05SB_df = Top05SB_df.append(df, ignore_index = True)

Top05SB_df.set_index("Borough", inplace = True)

Top05SB_df_new = Top05SB_df[["Crime Head", "Total"]]

Top05SB_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? Top05SB_df_new,
? ? ? ? ? ? ? ? ? ? ? ? title = "Top 5 Crimes of the 5 Most Safest Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? color = "Crime Head",
? ? ? ? ? ? ? ? ? ? ? ? barmode = "group",
? ? ? ? ? ? ? ? ? ? ? ? width = 900,
? ? ? ? ? ? ? ? ? ? ? ? height = 600
? ? ? ? ? ? ? ? ? ? ?)

Top05SB_bar.show()        
Top 5 Crimes in the 5 Most Safest Boroughs of London Grouped By "Boroughs"?

(vii) Top 10 Crimes in the 5 Most Safest Boroughs of London Grouped By "Crime Head"

  • Plotting the Grouped Bar Chart for the Top 10 Crimes in the 5 Most Safest Boroughs of London Grouped By "Crime Head"

T10C_df = pd.DataFrame()

for i in range(5):
? ? data_df = pd.DataFrame()
? ? data_df = bor_ch_crime_df[bor_ch_crime_df["Borough"] == T10S_boroughs[i]].sort_values(by = "Total", ascending = False).head(10)
? ? T10C_df = T10C_df.append(data_df, ignore_index = True)

T10C_Top05SB_df = T10C_df[["Borough", "Crime Head", "Total"]]

T10C_Top05SB_df = T10C_Top05SB_df.reindex(["Crime Head", "Borough", "Total"], axis = 1)

T10C_Top05SB_df.sort_values(["Total"], ascending = False, inplace = True)

T10C_Top05SB_df.sort_values(["Crime Head", "Total", "Borough"], ascending = False, inplace = True)

T10C_bct_df = T10C_df[["Borough", "Crime Head", "Total"]]

T10C_bct_df = T10C_bct_df.groupby("Crime Head").sum()

T10C_bct_df.sort_values(["Total"], ascending = False, inplace = True)

T10C_bct_list = [str(i) for i in list(T10C_bct_df.index)]

T10C_df_new = pd.DataFrame()

for i in range(10):
? ? data_df_new = pd.DataFrame()
? ? data_df_new = T10C_Top05SB_df[T10C_Top05SB_df["Crime Head"] == T10C_bct_list[i]].sort_values(by = "Total", ascending = False)
? ? T10C_df_new = T10C_df_new.append(data_df_new, ignore_index = True)

T10C_df_new.set_index("Crime Head", inplace = True)

T10C_Top05SB_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? T10C_df_new,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? title = "Top 10 Crimes of the 5 Most Safest Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? color = "Borough",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? barmode = "group",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? width = 1000,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? height = 700
? ? ? ? ? ? ? ? ? ? ? ? ?)

T10C_Top05SB_bar.show()        
Top 10 Crimes in the 5 Most Safest Boroughs of London Grouped By "Crime Head"?

(viii) Top 50 Most Safest Wards in the 5 Most Safest Boroughs of London

  • Plotting the Pie Chart for the Top 50 Most Safest Wards in the 5 Most Safest Boroughs of London

bor_ward_crime_df = crime_df.copy(deep = True)

bor_ward_crime_df = bor_ward_crime_df[["Borough", "Ward", "Total"]]

bor_ward_crime_df = bor_ward_crime_df.groupby(["Borough", "Ward"]).sum()

bor_ward_crime_df.reset_index(inplace = True)

T10W_df = pd.DataFrame()

for i in range(5):
? ? data_df_updated = pd.DataFrame()
? ? data_df_updated = bor_ward_crime_df[bor_ward_crime_df["Borough"] == T10S_boroughs[i]].sort_values(by = "Total", ascending = True).head(10)
? ? T10W_df = T10W_df.append(data_df_updated, ignore_index = True)

T10W_df_bar = px.pie(
? ? ? ? ? ? ? ? ? ? ? ? T10W_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Top 50 Most Safest Wards in the 5 Most Safest Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? values = "Total",
? ? ? ? ? ? ? ? ? ? ? ? names = "Ward",
? ? ? ? ? ? ? ? ? ? ? ? color = "Borough",
? ? ? ? ? ? ? ? ? ? ? ? width = 900,
? ? ? ? ? ? ? ? ? ? ? ? height = 700
? ? ? ? ? ? ? ? ? ? ?)

T10W_df_bar.show()        
Top 50 Most Safest Wards in the 5 Most Safest Boroughs of London


D] Collecting the Coordinates and Plotting them on the Map of London:

Once we are done with identifying the safest Boroughs and Wards of London, we will extract the Postcodes of the different neighbourhoods in London

  • It should be noted that even though the dataset already has the Latitude and the Longitude data available as a part of the originally downloaded dataset, we will still be using the ArcGIS API to re-fetch the coordinates of the preferred locations
  • The dataset was processed to identify the postcodes of the Top 50 safest Wards of London
  • After cleaning the data, we find that there are still 10,083 records of the Postcodes for the Top 50 safest Wards of London
  • Now, we could have used these 10,083 Postcodes of the Top 50 Wards of London to find their coordinates, but the process of fetching the coordinates for so many postcodes would have taken a lot of time
  • Hence, it is necessary to reduce the number of records further
  • In order to further reduce the number of locations, the neighbourhoods having distance nearest to a station in these safe Wards will be selected
  • This process will not only reduce the number of locations, but it will also greatly assist the target audience, as finding venues that are nearer to the stations will reduce their travel time and will also be more convenient to them
  • Since the number of stations that fall in these safe Wards are 81, after processing this requirement, the number of Postcodes reduce to 81 from 10,083
  • Further, since we have selected only those Venues that are nearest to the Station, we will rename the column "Nearest Station" to "Neighbourhood" as all these neighbourhoods are very close to the respective Stations
  • This dataset of Postcodes will then be used to fetch the geographical coordinates, i.e., the Latitude and Longitude, of the different neighbourhoods within the Top 50 safest Wards of London
  • As discussed earlier, we will use the ArcGIS API to collect the Latitude and Longitude coordinates of the neighbourhoods based on their postcodes
  • These coordinates will then be used to plot these locations on the Map of London

(i) Merging Crime Dataframe of the Top 50 Most Safest Wards

all_crime_df = crime_df[["Ward Code", "Ward", "Borough", "Crime Head", "Crime Sub-Head", "Total"]]

Top5_bor_crime_df = pd.DataFrame()

for i in range(5):
? ? data_df_updated = pd.DataFrame()
? ? data_df_updated = all_crime_df[all_crime_df["Borough"] == T10S_boroughs[i]]
? ? Top5_bor_crime_df = Top5_bor_crime_df.append(data_df_updated, ignore_index = True)

T10W_df_updated = T10W_df_new.copy(deep = True)

T10W_df_updated.drop(["Total"], axis = 1, inplace = True)

T10W_crime_df = pd.merge(T10W_df_updated, Top5_bor_crime_df, on = 'Ward')

T10W_crime_df.drop(["Borough_y"], axis = 1, inplace = True)

T10W_crime_df = T10W_crime_df.reindex(["Ward Code", "Ward", "Borough_x", "Crime Head", "Crime Sub-Head", "Total"], axis = 1)

T10W_crime_df.columns = ["Ward Code", "Ward", "Borough", "Crime Head", "Crime Sub-Head", "Total"]        

(ii) Extracting the Postcodes of the 50 Most Safest Wards of London and Selecting Locations Nearest to a Station

top5_bor_postcode_df = london_postcodes_df.copy(deep = True)

postcode_top5_bor_df = pd.DataFrame()

for i in range(5):
? ? dataset_new = pd.DataFrame()
? ? dataset_new = top5_bor_postcode_df[top5_bor_postcode_df["Borough"] == T10S_boroughs[i]]
? ? postcode_top5_bor_df = postcode_top5_bor_df.append(dataset_new, ignore_index = True)

postcode_top50_ward_df = pd.DataFrame()

for i in range(len(T10W_df_updated)):
? ? dataset_updated = pd.DataFrame()
? ? dataset_updated = postcode_top5_bor_df[postcode_top5_bor_df["Ward"] == T10W_df_updated["Ward"][i]]
? ? postcode_top50_ward_df = postcode_top50_ward_df.append(dataset_updated, ignore_index = True)

min_dist_station_df = postcode_top50_ward_df.groupby(["Nearest Station", "Distance To Station"]).min()

min_dist_station_df.reset_index(inplace = True)

nearest_to_station_df = min_dist_station_df.drop_duplicates(subset = ["Nearest Station"], keep = "first")

nearest_to_station_df.reset_index(inplace = True)

nearest_to_station_df.drop(["index"], axis = 1, inplace = True)        

(iii) Fetching Coordinates by Using the ArcGIS API

gis = GIS(


def get_coordinates_uk(address):
? ?latitude_coordinates = 0
? ?longitude_coordinates = 0
? ?g = geocode(address = "{}, London, England, GBR".format(address))[0]
? ?longitude_coordinates = g["location"]["x"]
? ?latitude_coordinates = g["location"]["y"]
? ?return str(latitude_coordinates) + "," + str(longitude_coordinates))

london_postcodes = nearest_to_station_df.loc[ : , "Postcode Data"]

london_postcodes_dfnew = pd.DataFrame(london_postcodes)

post_cols = ["Postcodes"]

london_postcodes_dfnew.columns = post_cols

london_coordinates = []

for i in range(len(london_postcodes)):
? ? london_coordinates.append(get_coordinates_uk(london_postcodes[i]))

london_latitude = []

for i in range(len(london_coordinates)):
? ? lat = london_coordinates[i].split(",")[0]
? ? lat = round(float(lat), 5)
? ? london_latitude.append(lat)

london_latitude_df = pd.DataFrame(london_latitude)

lat_cols = ["Latitude"]

london_latitude_df.columns = lat_cols

london_longitude = []

for i in range(len(london_coordinates)):
? ? long = london_coordinates[i].split(",")[1]
? ? long = round(float(long), 5)
? ? london_longitude.append(long)

london_longitude_df = pd.DataFrame(london_longitude)

long_cols = ["Longitude"]

london_longitude_df.columns = long_cols

london_pc_df = pd.concat([london_postcodes_dfnew, london_latitude_df, london_longitude_df], axis=1)

nearest_to_station_coordinates_df = pd.concat([nearest_to_station_df, london_pc_df], join = "outer", axis=1)

postcode_cols_new = ["Nearest Station", "Distance To Station", "Postcodes", "Latitude", "Longitude", "Ward Code", "Ward", "Borough Code", "Borough", "Constituency Code", "Constituency", "LSOA Code", "Lower Layer Super Output Area", "MSOA Code", "Middle Layer Super Output Area", "London Zone", "Postcode Area", "Postcode District", "Easting", "Northing", "Grid Ref", "Postcode Data", "Latitude Data", "Longitude Data"]

nearest_to_station_coordinates_df = nearest_to_station_coordinates_df.reindex(postcode_cols_new, axis = 1)

postcode_cols_updated = ["Neighbourhood", "Distance To Station", "Postcodes", "Latitude", "Longitude", "Ward Code", "Ward", "Borough Code", "Borough", "Constituency Code", "Constituency", "LSOA Code", "Lower Layer Super Output Area", "MSOA Code", "Middle Layer Super Output Area", "London Zone", "Postcode Area", "Postcode District", "Easting", "Northing", "Grid Ref", "Postcode Data", "Latitude Data", "Longitude Data"]

nearest_to_station_coordinates_df.columns = postcode_cols_updated

neighbourhood_df = nearest_to_station_coordinates_df.reindex(postcode_cols_updated, axis = 1)        

(iv) Plotting All Stations on the Map of London

address = "London, England"

geolocator = Nominatim(user_agent = "london_explorer")

location = geolocator.geocode(address)

latitude = location.latitude

longitude = location.longitude

print("The coordinates of London are {}, {}.".format(latitude, longitude))"        
No alt text provided for this image
min_dist_all_station_df = london_postcodes_df.groupby(["Nearest Station", "Distance To Station"]).min()

min_dist_all_station_df.reset_index(inplace = True)

min_dist_all_station_df.to_csv("Minimum Distance to All Stations.csv")

station_df = min_dist_all_station_df.drop_duplicates(subset = ["Nearest Station"], keep = "first")

station_df.reset_index(inplace = True)

station_df.drop(["index"], axis = 1, inplace = True)        
No alt text provided for this image
# Creating the map of London
map_London_all_stations = folium.Map(location = [latitude, longitude], zoom_start = 10)

# Adding markers to map
for latitude, longitude, borough, ward, neighbourhood in zip(station_df["Latitude Data"], station_df["Longitude Data"], station_df["Borough"], station_df["Ward"], station_df["Nearest Station"]):
? ? label = "{}, {}, {}".format(neighbourhood, ward, borough)
? ? label = folium.Popup(label, parse_html = True)
? ? folium.CircleMarker(
? ? ? ? ? ? ? ? ? ? ? ? ? ? [latitude, longitude],
? ? ? ? ? ? ? ? ? ? ? ? ? ? radius = 5,
? ? ? ? ? ? ? ? ? ? ? ? ? ? popup = label,
? ? ? ? ? ? ? ? ? ? ? ? ? ? color = "red",
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill = True
? ? ? ? ? ? ? ? ? ? ? ? ).add_to(map_London_all_stations)??

map_London_all_stations        
Plotting All Stations of London on the Map of London

(v) Plotting Stations in the Safest Wards of London on the Map of London

# Creating the map of London
map_London_safe_neigh = folium.Map(location = [latitude, longitude], zoom_start = 10)

# Adding markers to map
for latitude, longitude, borough, ward, neighbourhood in zip(neighbourhood_df["Latitude"], neighbourhood_df["Longitude"], neighbourhood_df["Borough"], neighbourhood_df["Ward"], neighbourhood_df["Neighbourhood"]):
? ? label = "{}, {}, {}".format(neighbourhood, ward, borough)
? ? label = folium.Popup(label, parse_html = True)
? ? folium.CircleMarker(
? ? ? ? ? ? ? ? ? ? ? ? ? ? [latitude, longitude],
? ? ? ? ? ? ? ? ? ? ? ? ? ? radius = 5,
? ? ? ? ? ? ? ? ? ? ? ? ? ? popup = label,
? ? ? ? ? ? ? ? ? ? ? ? ? ? color = "blue",
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill = True
? ? ? ? ? ? ? ? ? ? ? ? ).add_to(map_London_safe_neigh)??
? ??
map_London_safe_neigh        
Plotting Stations in the Safest Wards of London on the Map of London


E] Identifying Venues Around the Safest Neighbourhoods of London:

  • The Latitude and Longitude coordinates will be linked with the Foursquare API to identify the different venues near these neighbourhoods
  • In order to get the required information, we provide the Foursquare API with the Latitude and Longitude coordinates of the preferred neighbourhood
  • Based on the Latitude and Longitude coordinates, the Foursquare API acquires information about different venues within each neighbourhood
  • The data retrieved from the Foursquare API contains information of venues, which are within the radius of 500 metres of the latitude and longitude of said postcode
  • The following information is obtained for each venue,

- Neighbourhood: Name of the Neighbourhood

- Neighbourhood Latitude: Latitude of the Neighbourhood

- Neighbourhood Longitude: Longitude of the Neighbourhood

- Venue: Name of the Venue

- Venue Category: Category of the Venue

- Venue Latitude: Latitude of the Venue

- Venue Longitude: Longitude of the Venue

  • In order to understand this information better, we will analyse the data using the Groupby function

CLIENT_ID = "xxxxxxxxxxxxx" # Enter your Foursquare ID

CLIENT_SECRET = "xxxxxxxxxxxxx" # Enter your Foursquare Secret

VERSION = "20180605" # Foursquare API version

LIMIT = 100 # A default Foursquare API limit value        

  • Function to Get the Nearby Venues

def getNearbyVenues(names, wards, boroughs, latitudes, longitudes, radius = 500):
? ??venues_list = []
? ? for name, ward, borough, lat, lng in zip(names, wards, boroughs, latitudes, longitudes):
? ? ? ? print(name)
? ? ? ? ? ??
? ? ? ? # create the API request URL
? ? ? ? url = "https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}".format(
? ? ? ? ? ? ? ? ? ? CLIENT_ID,?
? ? ? ? ? ? ? ? ? ? CLIENT_SECRET,?
? ? ? ? ? ? ? ? ? ? VERSION,?
? ? ? ? ? ? ? ? ? ? lat,?
? ? ? ? ? ? ? ? ? ? lng,?
? ? ? ? ? ? ? ? ? ? radius
? ? ? ? ? ? ? ? ? ? )
? ? ? ? ? ??
? ? ? ? # make the GET request
? ? ? ? results = requests.get(url).json()["response"]["groups"][0]["items"]
? ? ? ??
? ? ? ? # return only relevant information for each nearby venue
? ? ? ? venues_list.append([(
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? name,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ward,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? borough,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lat,?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lng,?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? v["venue"]["name"],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? v["venue"]["categories"][0]["name"],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? v["venue"]["location"]["lat"],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? v["venue"]["location"]["lng"]
? ? ? ? ? ? ? ? ? ? ? ? ? ? ) for v in results])


? ? nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])

? ? nearby_venues.columns = ["Neighbourhood", "Ward", "Borough", "Neighbourhood Latitude", "Neighbourhood Longitude", "Venue", "Venue Category", "Venue Latitude", "Venue Longitude"]
? ??
? ? return (nearby_venues)

venues_london = getNearbyVenues(neighbourhood_df["Neighbourhood"], neighbourhood_df["Ward"], neighbourhood_df["Borough"], neighbourhood_df["Latitude"], neighbourhood_df["Longitude"])        
No alt text provided for this image


F] Segmenting Neighbourhoods of London by Common Venue Categories:

(i) One Hot Encoding

  • Here, we will use One Hot Encoding on the column Venue Category
  • This will convert all the values in the column Venue Category to those many different columns

venues_london_ohe = pd.get_dummies(venues_london[["Venue Category"]], prefix = "", prefix_sep = "")

venues_london_ohe["Neighbourhood"] = venues_london["Neighbourhood"] # This adds the "Neighbourhood" column in the end

# Moving the Neighbourhood Column to the First Column
columns = [venues_london_ohe.columns[-1]] + list(venues_london_ohe.columns[ : -1])

venues_london_ohe = venues_london_ohe[columns]

neighbourhood_group_ohe = venues_london_ohe.groupby("Neighbourhood").sum()

neighbourhood_group_ohe.reset_index(inplace = True)        
No alt text provided for this image

(ii) Printing Each Neighbourhood Along with the Top 8 Most Common Venues

  • We will then print each Neighbourhood along with the Top 8 Most Common Venues in that Neighbourhood

num_top_venues = 8


for neigh in neighbourhood_group_ohe["Neighbourhood"]:
? ? print("---------"+neigh+"---------")

? ? temp = neighbourhood_group_ohe[neighbourhood_group_ohe["Neighbourhood"] == neigh].T.reset_index()

? ? temp.columns = ["Venue", "Frequency"]

? ? temp = temp.iloc[ 1 : ]

? ? temp["Frequency"] = temp["Frequency"].astype(float)

? ? temp = temp.round({"Frequency" : 2})

? ? print(temp.sort_values("Frequency", ascending = False).reset_index(drop = True).head(num_top_venues))

? ? print("\n")        
No alt text provided for this image

(iii) Transferring the Venues into a Pandas Dataframe

  • After this, we will create a dataframe having the columns Neighbourhood and the Top 8 Most Common Venues in those Neighbourhoods

def return_most_common_venues(row, num_top_venues):
? ? row_categories = row.iloc[1 : ]

? ? row_categories_sorted = row_categories.sort_values(ascending = False)
? ??
? ? return row_categories_sorted.index.values[0 : num_top_venues]


indicators = ["st", "nd", "rd"]

# Create columns according to number of top Venues
columns = ["Neighbourhood"]

for ind in np.arange(num_top_venues):
? ? try:
? ? ? ? columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
? ? except:
? ? ? ? columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new Dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns = columns)

neighbourhoods_venues_sorted["Neighbourhood"] = neighbourhood_group_ohe["Neighbourhood"]

for ind in np.arange(neighbourhood_group_ohe.shape[0]):
? ? neighbourhoods_venues_sorted.iloc[ind, 1 : ] = return_most_common_venues(neighbourhood_group_ohe.iloc[ind, : ], num_top_venues)        
No alt text provided for this image


G] Clustering Neighbourhoods by Common Venues (K-Means Clustering):

(i) Building a Model to Cluster the Neighbourhoods

  • In order to assist our Target Audience to find venues of their choice in the safest neighbourhoods of London, we will be clustering the neighbourhoods using the K-Means Clustering Algorithm
  • The K-Means Clustering Algorithm will cluster neighbourhoods with similar venues into different clusters
  • We will first use the Elbow Method to identify the Optimal Number of Clusters
  • Elbow?method gives us an idea on what a good?k?number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids
  • As per this method, the optimal number of clusters is achieved when the change in slope of the line becomes small
  • Thus, we pick?“k” at the spot where SSE starts to flatten out and forming an elbow
  • After we have identified the optimal number of clusters, we will run the Machine Learning Algorithm to get the Cluster Labels

neighbourhood_group_cluster = neighbourhood_group_ohe.drop(labels = "Neighbourhood", axis = 1)

distortions = []

K = range(1,20)

for k in K:
? ? kmean = KMeans(init = "k-means++", n_clusters = k, random_state = 0, n_init = 50, max_iter = 500)

? ? kmean.fit(neighbourhood_group_cluster)

? ? distortions.append(kmean.inertia_)

plt.figure(figsize = (10, 5))

plt.plot(K, distortions, "bx-")

plt.xlabel("k")

plt.ylabel("Distortion")

plt.title("The Elbow Method")

plt.show()        
Elbow Method
# set number of clusters
k_num_clusters = 3

kmeans = KMeans(init = "k-means++", n_clusters = k_num_clusters, random_state = 0)

kmeans.fit(neighbourhood_group_cluster)

# check cluster labels generated for each row in the dataframe
labels = kmeans.labels_[0 : 80]

neighbourhoods_venues_sorted.insert(1, "Cluster Labels", kmeans.labels_)

neighbour_df = neighbourhood_df[["Neighbourhood", "Distance To Station", "Ward", "Borough", "Postcodes", "Latitude", "Longitude"]].copy(deep = True)

london_merged = neighbour_df

london_merged = london_merged.join(neighbourhoods_venues_sorted.set_index("Neighbourhood"), on = "Neighbourhood")

london_merged = london_merged.dropna(subset = ["Cluster Labels"])        
No alt text provided for this image

(ii) Principal Component Analysis (PCA)

  • Applying Dimensionality Reduction Techniques helps in visualising how the Clusters are related in the original high dimensional space
  • Hence, in order to see how the Clusters are related in the original space, we will use Principal Component Analysis (PCA) to visualise the high dimensional data
  • PCA also helps in finding if the features of the data are linearly related to each other
  • It can be seen that the Explained Variance for the 10% of the Total Components, i.e., the first eight components, are able to preserve about 74% of the original information, thus, reducing the dimensionality of our data

pca = PCA().fit(neighbourhood_group_cluster)

pca_neigh = pca.transform(neighbourhood_group_cluster)

print("Variance Explained by Each Component (%): ")

for i in range(len(pca.explained_variance_ratio_)):
? ? ? print("\n", i+1, "o: " + str(round(pca.explained_variance_ratio_[i] * 100, 2)) + "%")

print("\nTotal Sum: " + str(round(sum(pca.explained_variance_ratio_) * 100, 2)) + "%")

print("\nExplained Variance of the First Eight Components, i.e. 10% of the Total Components: " + str(round(sum(pca.explained_variance_ratio_[0 : 8]) * 100, 2)) + "%")        
No alt text provided for this image
No alt text provided for this image
c1 = []
c2 = []
c3 = []

for i in range(len(pca_neigh)):
? ? if kmeans.labels_[i] == 0:
? ? ? ? c1.append(pca_neigh[i])
? ? if kmeans.labels_[i] == 1:
? ? ? ? c2.append(pca_neigh[i])
? ? if kmeans.labels_[i] == 2:
? ? ? ? c3.append(pca_neigh[i])
? ? ? ??? ? ? ??
c1 = np.array(c1)
c2 = np.array(c2)
c3 = np.array(c3)

plt.figure(figsize = (10, 8))
plt.scatter(c1[ : , 0], c1[ : , 1], c = "red", label = "Cluster 1")
plt.scatter(c2[ : , 0], c2[ : , 1], c = "blue", label = "Cluster 2")
plt.scatter(c3[ : , 0], c3[ : , 1], c = "green", label = "Cluster 3")

plt.legend()
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Low Dimensional Visualisation (PCA) - Neighbourhoods")        
Principal Component Analysis (PCA)

(iii) Visualising the Resulting Clusters on the Map of London

# Create Map of London
map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 10)


# Set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i * x) ** 2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


for lat, lon, mcv, poi, ward, bor, cluster in zip(london_merged["Latitude"], london_merged["Longitude"], london_merged["1st Most Common Venue"], london_merged["Neighbourhood"], london_merged["Ward"], london_merged["Borough"], london_merged["Cluster Labels"]):
? ? label = folium.Popup("Cluster " + str(int(cluster) + 1) + ":\n" + str(mcv) + ",\n" + str(poi) + ",\n" + str(ward) + ",\n" + str(bor), parse_html = True)
? ? folium.CircleMarker(
? ? ? ? ? ? ? ? ? ? ? ? ? ? [lat, lon],
? ? ? ? ? ? ? ? ? ? ? ? ? ? radius = 5,
? ? ? ? ? ? ? ? ? ? ? ? ? ? popup = label,
? ? ? ? ? ? ? ? ? ? ? ? ? ? color = rainbow[int(cluster)],
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill = True,
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill_color = rainbow[int(cluster)],
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill_opacity = 0.5
? ? ? ? ? ? ? ? ? ? ? ? ).add_to(map_clusters)
? ? ? ??
map_clusters        
Resulting Clusters on the Map of London

(iv) Examining the Clusters

  • Cluster 1

cluster_1 = london_merged.loc[london_merged["Cluster Labels"] == 0, london_merged.columns[[0] + [2] + [3] + list(range(7, london_merged.shape[1]))]]

cluster_1.to_csv("Venues in the Neighbourhood of London - Cluster 1.csv")

cluster_1        
No alt text provided for this image
No alt text provided for this image

  • Cluster 2

cluster_2 = london_merged.loc[london_merged["Cluster Labels"] == 1, london_merged.columns[[0] + [2] + [3] + list(range(7, london_merged.shape[1]))]]

cluster_2.to_csv("Venues in the Neighbourhood of London - Cluster 2.csv")

cluster_2        
No alt text provided for this image
No alt text provided for this image

  • Cluster 3

cluster_3 = london_merged.loc[london_merged["Cluster Labels"] == 2, london_merged.columns[[0] + [2] + [3] + list(range(7, london_merged.shape[1]))]]

cluster_3.to_csv("Venues in the Neighbourhood of London - Cluster 3.csv")

cluster_3        
No alt text provided for this image
No alt text provided for this image


Links to Jupyter Notebook

  • If, due to some reason, you are unable to view / open the Jupyter Notebook, Charts or Maps on GitHub, you may access the .ipynb file from the below mentioned links
  • ?Link to the Jupyter Notebook on IBM Cloud:

https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/ac0cc32a-2a4b-44c9-babc-2506856bace4/view?access_token=4bd1f1dea43b545d2abaa7391720e9b947f6a990af10990b92fe5490ec212b3a

  • ?Link to the Jupyter Notebook on Binder:

https://mybinder.org/v2/gh/vincyspereira/Coursera_Capstone/cd96eb73058886ece38132d0265b443e5aaecb58

Note:

  1. First click on the “Week 5 – The Battle of Neighbourhoods (Part 2)” folder§?
  2. Next, click on the “Capstone Project - The Battle of Neighbourhoods - London's Crime Rate Analysis and Clustering of the Safest Neighbourhoods of London.ipynb” file to access the Jupyter Notebook
  3. Lastly, click the ‘File’ Menu and then select ‘Trust Notebook’ to view the charts and maps

  • Link to the Jupyter Notebook using ‘nbviewer’:

https://nbviewer.jupyter.org/github/vincyspereira/Coursera_Capstone/blob/cd96eb73058886ece38132d0265b443e5aaecb58/Week%205%20-%20The%20Battle%20of%20Neighborhoods%20(Part%202)/Capstone%20Project%20-%20The%20Battle%20of%20Neighbourhoods%20-%20London's%20Crime%20Rate%20Analysis%20and%20Clustering%20of%20the%20Safest%20Neighbourhoods%20of%20London.ipynb

  • Link to the Jupyter Notebook on GitHub:

https://github.com/vincyspereira/Coursera_Capstone/blob/cd96eb73058886ece38132d0265b443e5aaecb58/Week%205%20-%20The%20Battle%20of%20Neighborhoods%20(Part%202)/Capstone%20Project%20-%20The%20Battle%20of%20Neighbourhoods%20-%20London's%20Crime%20Rate%20Analysis%20and%20Clustering%20of%20the%20Safest%20Neighbourhoods%20of%20London.ipynb

Note:

If you are unable to view the code / charts properly on GitHub, then you may either:

§?????Click on the “Circle with Horizontal Line” symbol on the top right-hand corner to view the Jupyter Notebook with “nbviewer”

OR

§?????Click on the “Download” button to download the .ipynb file


Link to Report

  • Link to the Report on GitHub:

https://github.com/vincyspereira/Coursera_Capstone/blob/320adf5974202e43d9bd7f45e6fb631b5d2647de/Week%205%20-%20The%20Battle%20of%20Neighborhoods%20(Part%202)/Report%20-%20Capstone%20Project%20-%20The%20Battle%20of%20Neighbourhoods.pdf


Results and Discussion

  • The aim of this project is to help the Migrants and Tourists who want to explore the safest neighbourhoods of London
  • They can decide to stay or visit a specific neighbourhood based on their preferred cluster
  • Based on the type of clusters, different people, i.e., families with children, young couples, executives, or tourists, can decide which neighbourhood is best suited for them
  • Cluster 1:

- This cluster is mostly made up of Hotels, Pubs, Theatres, Art Galleries, Art Museums, Outdoor Sculptures and Plazas

- Thus, this cluster is most suitable for Tourists

  • Cluster 2:

- This cluster is mostly made up of Pubs, Coffee Shops, Cafe, Multi-Cultural Restaurants, Bars, Gyms, Sports Clubs, Supermarkets, Grocery Stores, Shopping Plazas, Fast-food Joints, etc.

- Thus, this cluster is most suitable for young couples and executives

  • Cluster 3:

- This is the biggest cluster from our Dataset

- It is mostly made up of Supermarkets, Bakeries, Pharmacies, Auto Garages, Parks, Playgrounds, Sports Complexes, Multi-Cultural Restaurants, Ice Cream Parlours, Fish & Chips Shops, Pubs, various stores like, Grocery, Convenience, Clothing, Furniture, Pet, Optical, Electronics, Warehouse, etc., and Train Stations

- It has almost everything that a family requires

- Thus, this cluster seems to be most suitable for families with children

  • This segmentation is also proved right from the PCA Chart
  • According to PCA, Cluster 2 and Cluster 3 seem to be Linearly Related, while Cluster 1 is not at all related to the other two clusters
  • As can be seen above, Clusters 2 & 3 seem to suit Migrants, who intend to stay in neighbourhoods falling in those clusters, while Cluster 1 seems to suit Tourists, who intend to visit neighbourhoods falling in that cluster


Conclusion

  • This Capstone Project will help families with children, young couples, executives, and tourists, to understand,

- which are the safe Boroughs, Wards and Neighbourhoods of London

- the most common venues in those neighbourhoods

- the different types of neighbourhoods based on the cluster of venue categories

- which neighbourhoods to choose as per their preference

  • As can be seen from the data on clusters, the aim of the project to seems to have been fulfilled.


Thank You




要查看或添加评论,请登录

社区洞察

其他会员也浏览了