The Battle of Neighbourhoods: London's Crime Rate Analysis and Clustering of the Safest Neighbourhoods of London
Vincent Pereira
MBA (Financial & Computer Management) | Interested in Data Analytics, Data Science, Machine Learning and Deep Learning
Introduction
London is one of the most multicultural cities in the world. It is a melting pot of cultures, where one can meet people from all the parts of the world and taste the best of the world cuisine. It is a major centre for banking and finance, insurance, world trade, media, advertising, tourism, theatre, fashion, arts and more. Fusing gritty, historic pomp with shimmering modernity, world-class culture and fashion-forward shopping, the UK’s capital has it all and there’s something for everyone. The vibrancy of the city extends across all 32 of its boroughs, all of which are home to a plethora of unique neighbourhoods.
Business Problem
The decision to move to a new a city or a new country altogether, is a harrowing one. But after having decided to move to London, the next challenge one faces is to decide where to live in London. If one looks at the map of London, they will find a haphazard cluster of neighbourhoods and villages, each with their own distinct features and identity. Some of London’s best neighbourhoods are usually established on the typical tourist trail, while others are constantly evolving, taking turns to emerge as the new cool hotspot. The following questions then arise in our mind,
And at the top of all these doubts, the most intriguing questions anyone would face are,
All these questions and more plague our mind and then the quest to find the answers begins.
Objective of the Capstone Project
The objective of this assignment is to give an insight into what some of the safest London neighbourhoods can offer its residents and tourists.
?To help uncover the best that London has to offer, this project aims to do the following,
Interested Parties
The objective of this project is to identify and recommend the best & safest neighbourhoods in London to anyone who wants to visit or relocate to London. The interested parties could be anyone from the below mentioned list,
Description of Data
1.?MPS Ward Level Crime Data for London:
2.?List of London Boroughs:
3. London Postcodes:
4. ArcGIS API Data:
- Latitude: Latitude of the Postcode
- Longitude: Longitude of the Postcode
5.?Foursquare API Data:
Methodology
A] Importing Libraries:
Libraries used in this Project are,
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
import requests
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import plotly.express as px
import plotly.graph_objects as go
import folium
import geocoder
from geopy.geocoders import Nominatim
from arcgis.geocoding import geocode
from arcgis.gis import GIS
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearnex import patch_sklearn
patch_sklearn()
B] Extracting, Scraping, Exploring, Cleaning and Processing the Datasets:
After importing all the required libraries, we will extract the data from different sources and clean it so that it is ready for processing and analysing
Dataset 1: Metropolitan Police Service Ward Level Crime Data for London
crime_df = pd.read_csv("MPS Ward Level Crime (most recent 24 months).csv")
columns = ["Crime Head", "Crime Sub-Head", "Ward", "Ward Code", "Borough", 201907, 201908, 201909, 201910, 201911, 201912, 202001, 202002, 202003, 202004, 202005, 202006, 202007, 202008, 202009, 202010, 202011, 202012, 202101, 202102, 202103, 202104, 202105, 202106]
crime_df.columns = columns
crime_df = crime_df.reindex(["Ward Code", "Ward", "Borough", "Crime Head", "Crime Sub-Head", 201907, 201908, 201909, 201910, 201911, 201912, 202001, 202002, 202003, 202004, 202005, 202006, 202007, 202008, 202009, 202010, 202011, 202012, 202101, 202102, 202103, 202104, 202105, 202106], axis = 1)
crime_df["Total"] = crime_df.sum(numeric_only = True, axis = 1)
crime_df.loc[((crime_df["Ward"] == "Belmont") & (crime_df["Borough"] == "Harrow")), "Ward"] = "Belmont Harrow"
crime_df.loc[((crime_df["Ward"] == "Belmont") & (crime_df["Borough"] == "Sutton")), "Ward"] = "Belmont Sutton"
Dataset 2: List of London Boroughs
london_bor_list_url = "https://en.wikipedia.org/wiki/List_of_London_boroughs"
london_bor_list = pd.read_html(london_bor_list_url)
london_bor_df = london_bor_list[0]
london_bor_df.columns=["Borough", "Inner", "Status", "Local Authority", "Political Control", "Head Quarters", "Area (sq mi)", "Population (2013 estimate)",
"Co-ordinates", "Borough No. on Map"]
london_bor_df = london_bor_df.replace("note 1", "", regex=True)?
london_bor_df = london_bor_df.replace("note 2", "", regex=True)?
london_bor_df = london_bor_df.replace("note 3", "", regex=True)?
london_bor_df = london_bor_df.replace("note 4", "", regex=True)?
london_bor_df = london_bor_df.replace("note 5", "", regex=True)
london_bor_df["Borough"].replace({
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "Barking and Dagenham[]" : "Barking and Dagenham",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "Greenwich []" : "Greenwich",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? "Hammersmith and Fulham[]" : "Hammersmith and Fulham"
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? }, inplace = True)
london_bor_df = london_bor_df.drop(["Inner", "Status"], axis = 1)
Dataset 3: Merged Dataset of Dataset 1 and Dataset 2
london_crime_df = pd.merge(crime_df, london_bor_df , on = 'Borough')
london_crime_df = london_crime_df.reindex(["Ward Code", "Ward", "Borough", "Local Authority", "Political Control", "Head Quarters", "Area (sq mi)", "Population (2013 estimate)", "Co-ordinates", "Borough No. on Map", "Crime Head", "Crime Sub-Head", 201907, 201908, 201909, 201910, 201911, 201912, 202001, 202002, 202003, 202004, 202005, 202006, 202007, 202008, 202009, 202010, 202011, 202012, 202101, 202102, 202103, 202104, 202105, 202106, "Total"], axis = 1))
Dataset 4: London Postcodes
london_postcodes_df = pd.read_csv("London Postcodes.csv", low_memory = False)
london_postcodes_df = london_postcodes_df.drop(["County", "Country", "County Code", "Introduced", "Terminated", "Parish", "National Park", "Population", "Households", "Built up area", "Built up sub-division", "Rural/urban", "Region", "Altitude", "Local authority", "Parish Code", "Census output area", "Index of Multiple Deprivation", "Quality", "User Type", "Last updated", "Police force", "Water company", "Plus Code", "Average Income", "Sewage Company", "Travel To Work Area"], axis = 1)
london_postcodes_df = london_postcodes_df[london_postcodes_df["In Use?"] == "Yes"]
london_postcodes_df = london_postcodes_df.drop(["In Use?"], axis = 1)
postcode_cols = ["Postcode Data", "Latitude Data", "Longitude Data", "Easting", "Northing", "Grid Ref", "Borough", "Ward", "Borough Code", "Ward Code", "Constituency", "Lower Layer Super Output Area", "London Zone", "LSOA Code", "MSOA Code", "Middle Layer Super Output Area", "Constituency Code", "Nearest Station", "Distance To Station", "Postcode Area", "Postcode District"]
london_postcodes_df.columns = postcode_cols
postcode_cols_new = ["Postcode Data", "Latitude Data", "Longitude Data", "Nearest Station", "Distance To Station", "Ward Code", "Ward", "Borough Code", "Borough", "Constituency Code", "Constituency", "LSOA Code", "Lower Layer Super Output Area", "MSOA Code", "Middle Layer Super Output Area", "London Zone", "Postcode Area", "Postcode District", "Easting", "Northing", "Grid Ref"]
london_postcodes_df = london_postcodes_df.reindex(postcode_cols_new, axis = 1)
london_postcodes_df.loc[((london_postcodes_df["Ward"] == "Belmont") & (london_postcodes_df["Borough"] == "Harrow")), "Ward"] = "Belmont Harrow"
london_postcodes_df.loc[((london_postcodes_df["Ward"] == "Belmont") & (london_postcodes_df["Borough"] == "Sutton")), "Ward"] = "Belmont Sutton"
C] Understanding the Dataset Using Groupby Function and Charts:
We will then use the Groupby Function and Charts to understand the data better
(i) Group the Dataframe by “Borough”
bor_crime_df = crime_df.groupby("Borough").sum()
bor_crime_df.sort_values(by = ["Total", "Borough"], inplace = True)
bar_chart = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? bor_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Total Crimes Recorded in London Boroughs During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "yellow"), (0.5, "red"), (0.75, "red"), (1, "maroon")],
? ? ? ? ? ? ? ? ? ? ? ? width = 1000,
? ? ? ? ? ? ? ? ? ? ? ? height = 700
? ? ? ? ? ? ? ? ? )
bar_chart.show()
h_bar_chart = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? bor_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Total Crimes Recorded in London Boroughs During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "yellow"), (0.5, "red"), (0.75, "red"), (1, "maroon")],
? ? ? ? ? ? ? ? ? ? ? ? width = 1000,
? ? ? ? ? ? ? ? ? ? ? ? height = 750
? ? ? ? ? ? ? ? ? ? )
h_bar_chart.show()
(ii) Group the Dataframe by “Ward”
top_ward_crime_df = crime_df.groupby("Ward").sum()
top_ward_crime_df.sort_values(by = ["Total", "Ward"], ascending = True, inplace = True)
top_ward_crime_df = top_ward_crime_df.head(20)
h_bar_chart = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? top_ward_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Total Crimes Recorded in the Top 20 Most Safest London Wards During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "lightgreen"), (0.5, "yellow"), (0.75, "orange"), (1, "red")]
? ? ? ? ? ? ? ? ? ? )
h_bar_chart.show()
worst_ward_crime_df = crime_df.groupby("Ward").sum()
worst_ward_crime_df.sort_values(by = ["Total", "Ward"], ascending = True, inplace = True)
worst_ward_crime_df = worst_ward_crime_df.tail(20)
h_bar_chart = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? worst_ward_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Total Crimes Recorded in Worst 20 Most Dangerous London Wards During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "red"), (0.25, "darkred"), (0.5, "maroon"), (0.75, "maroon"), (1, "indigo")]
? ? ? ? ? ? ? ? ? ? )
h_bar_chart.show()
(iii) Group the Dataframe by “Crime Head”
type_crimes_crime_df = crime_df.groupby(["Crime Head"]).sum()
type_crimes_crime_df.sort_values(by = ["Total", "Crime Head"], ascending = True, inplace = True)
type_crimes_crime_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? type_crimes_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Types of Crimes Recorded in London During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "yellow"), (0.5, "orange"), (0.75, "red"), (1, "maroon")]
? ? ? ? ? ? ? ? ? ? )
type_crimes_crime_bar.show()
(iv) Group the Dataframe by “Crime Sub-Head”
type_sub_crimes_crime_df = crime_df.groupby(["Crime Sub-Head"]).sum()
type_sub_crimes_crime_df.sort_values(by = ["Total", "Crime Sub-Head"], ascending = False, inplace = True)
type_sub_crimes_crime_df = type_sub_crimes_crime_df.head(20)
type_sub_crimes_crime_df.sort_values(by = ["Total", "Crime Sub-Head"], ascending = True, inplace = True)
type_sub_crimes_crime_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? type_sub_crimes_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Top 20 Crimes Recorded in London During the Period July 2019 to June 2021",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? orientation = "h",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "orange"), (0.5, "red"), (0.75, "maroon"), (1, "purple")]
? ? ? ? ? ? ? ? ? ? )
type_sub_crimes_crime_bar.show()
(v) Top 10 Most Safest and Worst 10 Most Dangerous Boroughs of London
T10S_bor_crime_df = bor_crime_df.head(10)
T10S_bor_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? T10S_bor_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Top 10 Most Safest Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale = [(0, "cyan"), (0.25, "lightgreen"), (0.5, "yellow"), (0.75, "orange"), (1, "red")]
? ? ? ? ? ? ? ? ? ? ?)
T10S_bor_bar.show()
W10D_bor_crime_df = bor_crime_df.sort_values(by = "Total", ascending = False).head(10)
W10D_bor_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? W10D_bor_crime_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Worst 10 Most Dangerous Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? color = "Total",
? ? ? ? ? ? ? ? ? ? ? ? color_continuous_scale=[(0, "yellow"), (0.25, "red"), (0.5, "red"), (0.75, "maroon"), (1, "maroon")]
? ? ? ? ? ? ? ? ? ? ? ? #color_continuous_scale = [(0, "orange"), (0.5, "magenta"), (1, "red")]
? ? ? ? ? ? ? ? ? ? ?)
W10D_bor_bar.show()
(vi) Top 5 Crimes in the 5 Most Safest Boroughs of London Grouped By "Borough"
df = [N01SB_T05C_df, N02SB_T05C_df, N03SB_T05C_df, N04SB_T05C_df, N05SB_T05C_df]
Top05SB_df = pd.DataFrame()
Top05SB_df = Top05SB_df.append(df, ignore_index = True)
Top05SB_df.set_index("Borough", inplace = True)
Top05SB_df_new = Top05SB_df[["Crime Head", "Total"]]
Top05SB_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? Top05SB_df_new,
? ? ? ? ? ? ? ? ? ? ? ? title = "Top 5 Crimes of the 5 Most Safest Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? color = "Crime Head",
? ? ? ? ? ? ? ? ? ? ? ? barmode = "group",
? ? ? ? ? ? ? ? ? ? ? ? width = 900,
? ? ? ? ? ? ? ? ? ? ? ? height = 600
? ? ? ? ? ? ? ? ? ? ?)
Top05SB_bar.show()
(vii) Top 10 Crimes in the 5 Most Safest Boroughs of London Grouped By "Crime Head"
T10C_df = pd.DataFrame()
for i in range(5):
? ? data_df = pd.DataFrame()
? ? data_df = bor_ch_crime_df[bor_ch_crime_df["Borough"] == T10S_boroughs[i]].sort_values(by = "Total", ascending = False).head(10)
? ? T10C_df = T10C_df.append(data_df, ignore_index = True)
T10C_Top05SB_df = T10C_df[["Borough", "Crime Head", "Total"]]
T10C_Top05SB_df = T10C_Top05SB_df.reindex(["Crime Head", "Borough", "Total"], axis = 1)
T10C_Top05SB_df.sort_values(["Total"], ascending = False, inplace = True)
T10C_Top05SB_df.sort_values(["Crime Head", "Total", "Borough"], ascending = False, inplace = True)
T10C_bct_df = T10C_df[["Borough", "Crime Head", "Total"]]
T10C_bct_df = T10C_bct_df.groupby("Crime Head").sum()
T10C_bct_df.sort_values(["Total"], ascending = False, inplace = True)
T10C_bct_list = [str(i) for i in list(T10C_bct_df.index)]
T10C_df_new = pd.DataFrame()
for i in range(10):
? ? data_df_new = pd.DataFrame()
? ? data_df_new = T10C_Top05SB_df[T10C_Top05SB_df["Crime Head"] == T10C_bct_list[i]].sort_values(by = "Total", ascending = False)
? ? T10C_df_new = T10C_df_new.append(data_df_new, ignore_index = True)
T10C_df_new.set_index("Crime Head", inplace = True)
T10C_Top05SB_bar = px.bar(
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? T10C_df_new,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? title = "Top 10 Crimes of the 5 Most Safest Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? color = "Borough",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? barmode = "group",
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? width = 1000,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? height = 700
? ? ? ? ? ? ? ? ? ? ? ? ?)
T10C_Top05SB_bar.show()
(viii) Top 50 Most Safest Wards in the 5 Most Safest Boroughs of London
bor_ward_crime_df = crime_df.copy(deep = True)
bor_ward_crime_df = bor_ward_crime_df[["Borough", "Ward", "Total"]]
bor_ward_crime_df = bor_ward_crime_df.groupby(["Borough", "Ward"]).sum()
bor_ward_crime_df.reset_index(inplace = True)
T10W_df = pd.DataFrame()
for i in range(5):
? ? data_df_updated = pd.DataFrame()
? ? data_df_updated = bor_ward_crime_df[bor_ward_crime_df["Borough"] == T10S_boroughs[i]].sort_values(by = "Total", ascending = True).head(10)
? ? T10W_df = T10W_df.append(data_df_updated, ignore_index = True)
T10W_df_bar = px.pie(
? ? ? ? ? ? ? ? ? ? ? ? T10W_df,
? ? ? ? ? ? ? ? ? ? ? ? title = "Top 50 Most Safest Wards in the 5 Most Safest Boroughs of London",
? ? ? ? ? ? ? ? ? ? ? ? values = "Total",
? ? ? ? ? ? ? ? ? ? ? ? names = "Ward",
? ? ? ? ? ? ? ? ? ? ? ? color = "Borough",
? ? ? ? ? ? ? ? ? ? ? ? width = 900,
? ? ? ? ? ? ? ? ? ? ? ? height = 700
? ? ? ? ? ? ? ? ? ? ?)
T10W_df_bar.show()
D] Collecting the Coordinates and Plotting them on the Map of London:
Once we are done with identifying the safest Boroughs and Wards of London, we will extract the Postcodes of the different neighbourhoods in London
(i) Merging Crime Dataframe of the Top 50 Most Safest Wards
all_crime_df = crime_df[["Ward Code", "Ward", "Borough", "Crime Head", "Crime Sub-Head", "Total"]]
Top5_bor_crime_df = pd.DataFrame()
for i in range(5):
? ? data_df_updated = pd.DataFrame()
? ? data_df_updated = all_crime_df[all_crime_df["Borough"] == T10S_boroughs[i]]
? ? Top5_bor_crime_df = Top5_bor_crime_df.append(data_df_updated, ignore_index = True)
T10W_df_updated = T10W_df_new.copy(deep = True)
T10W_df_updated.drop(["Total"], axis = 1, inplace = True)
T10W_crime_df = pd.merge(T10W_df_updated, Top5_bor_crime_df, on = 'Ward')
T10W_crime_df.drop(["Borough_y"], axis = 1, inplace = True)
T10W_crime_df = T10W_crime_df.reindex(["Ward Code", "Ward", "Borough_x", "Crime Head", "Crime Sub-Head", "Total"], axis = 1)
T10W_crime_df.columns = ["Ward Code", "Ward", "Borough", "Crime Head", "Crime Sub-Head", "Total"]
(ii) Extracting the Postcodes of the 50 Most Safest Wards of London and Selecting Locations Nearest to a Station
top5_bor_postcode_df = london_postcodes_df.copy(deep = True)
postcode_top5_bor_df = pd.DataFrame()
for i in range(5):
? ? dataset_new = pd.DataFrame()
? ? dataset_new = top5_bor_postcode_df[top5_bor_postcode_df["Borough"] == T10S_boroughs[i]]
? ? postcode_top5_bor_df = postcode_top5_bor_df.append(dataset_new, ignore_index = True)
postcode_top50_ward_df = pd.DataFrame()
for i in range(len(T10W_df_updated)):
? ? dataset_updated = pd.DataFrame()
? ? dataset_updated = postcode_top5_bor_df[postcode_top5_bor_df["Ward"] == T10W_df_updated["Ward"][i]]
? ? postcode_top50_ward_df = postcode_top50_ward_df.append(dataset_updated, ignore_index = True)
min_dist_station_df = postcode_top50_ward_df.groupby(["Nearest Station", "Distance To Station"]).min()
min_dist_station_df.reset_index(inplace = True)
nearest_to_station_df = min_dist_station_df.drop_duplicates(subset = ["Nearest Station"], keep = "first")
nearest_to_station_df.reset_index(inplace = True)
nearest_to_station_df.drop(["index"], axis = 1, inplace = True)
(iii) Fetching Coordinates by Using the ArcGIS API
gis = GIS(
def get_coordinates_uk(address):
? ?latitude_coordinates = 0
? ?longitude_coordinates = 0
? ?g = geocode(address = "{}, London, England, GBR".format(address))[0]
? ?longitude_coordinates = g["location"]["x"]
? ?latitude_coordinates = g["location"]["y"]
? ?return str(latitude_coordinates) + "," + str(longitude_coordinates))
london_postcodes = nearest_to_station_df.loc[ : , "Postcode Data"]
london_postcodes_dfnew = pd.DataFrame(london_postcodes)
post_cols = ["Postcodes"]
london_postcodes_dfnew.columns = post_cols
london_coordinates = []
for i in range(len(london_postcodes)):
? ? london_coordinates.append(get_coordinates_uk(london_postcodes[i]))
london_latitude = []
for i in range(len(london_coordinates)):
? ? lat = london_coordinates[i].split(",")[0]
? ? lat = round(float(lat), 5)
? ? london_latitude.append(lat)
london_latitude_df = pd.DataFrame(london_latitude)
lat_cols = ["Latitude"]
london_latitude_df.columns = lat_cols
london_longitude = []
for i in range(len(london_coordinates)):
? ? long = london_coordinates[i].split(",")[1]
? ? long = round(float(long), 5)
? ? london_longitude.append(long)
london_longitude_df = pd.DataFrame(london_longitude)
long_cols = ["Longitude"]
london_longitude_df.columns = long_cols
london_pc_df = pd.concat([london_postcodes_dfnew, london_latitude_df, london_longitude_df], axis=1)
nearest_to_station_coordinates_df = pd.concat([nearest_to_station_df, london_pc_df], join = "outer", axis=1)
postcode_cols_new = ["Nearest Station", "Distance To Station", "Postcodes", "Latitude", "Longitude", "Ward Code", "Ward", "Borough Code", "Borough", "Constituency Code", "Constituency", "LSOA Code", "Lower Layer Super Output Area", "MSOA Code", "Middle Layer Super Output Area", "London Zone", "Postcode Area", "Postcode District", "Easting", "Northing", "Grid Ref", "Postcode Data", "Latitude Data", "Longitude Data"]
nearest_to_station_coordinates_df = nearest_to_station_coordinates_df.reindex(postcode_cols_new, axis = 1)
postcode_cols_updated = ["Neighbourhood", "Distance To Station", "Postcodes", "Latitude", "Longitude", "Ward Code", "Ward", "Borough Code", "Borough", "Constituency Code", "Constituency", "LSOA Code", "Lower Layer Super Output Area", "MSOA Code", "Middle Layer Super Output Area", "London Zone", "Postcode Area", "Postcode District", "Easting", "Northing", "Grid Ref", "Postcode Data", "Latitude Data", "Longitude Data"]
nearest_to_station_coordinates_df.columns = postcode_cols_updated
neighbourhood_df = nearest_to_station_coordinates_df.reindex(postcode_cols_updated, axis = 1)
(iv) Plotting All Stations on the Map of London
领英推荐
address = "London, England"
geolocator = Nominatim(user_agent = "london_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("The coordinates of London are {}, {}.".format(latitude, longitude))"
min_dist_all_station_df = london_postcodes_df.groupby(["Nearest Station", "Distance To Station"]).min()
min_dist_all_station_df.reset_index(inplace = True)
min_dist_all_station_df.to_csv("Minimum Distance to All Stations.csv")
station_df = min_dist_all_station_df.drop_duplicates(subset = ["Nearest Station"], keep = "first")
station_df.reset_index(inplace = True)
station_df.drop(["index"], axis = 1, inplace = True)
# Creating the map of London
map_London_all_stations = folium.Map(location = [latitude, longitude], zoom_start = 10)
# Adding markers to map
for latitude, longitude, borough, ward, neighbourhood in zip(station_df["Latitude Data"], station_df["Longitude Data"], station_df["Borough"], station_df["Ward"], station_df["Nearest Station"]):
? ? label = "{}, {}, {}".format(neighbourhood, ward, borough)
? ? label = folium.Popup(label, parse_html = True)
? ? folium.CircleMarker(
? ? ? ? ? ? ? ? ? ? ? ? ? ? [latitude, longitude],
? ? ? ? ? ? ? ? ? ? ? ? ? ? radius = 5,
? ? ? ? ? ? ? ? ? ? ? ? ? ? popup = label,
? ? ? ? ? ? ? ? ? ? ? ? ? ? color = "red",
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill = True
? ? ? ? ? ? ? ? ? ? ? ? ).add_to(map_London_all_stations)??
map_London_all_stations
(v) Plotting Stations in the Safest Wards of London on the Map of London
# Creating the map of London
map_London_safe_neigh = folium.Map(location = [latitude, longitude], zoom_start = 10)
# Adding markers to map
for latitude, longitude, borough, ward, neighbourhood in zip(neighbourhood_df["Latitude"], neighbourhood_df["Longitude"], neighbourhood_df["Borough"], neighbourhood_df["Ward"], neighbourhood_df["Neighbourhood"]):
? ? label = "{}, {}, {}".format(neighbourhood, ward, borough)
? ? label = folium.Popup(label, parse_html = True)
? ? folium.CircleMarker(
? ? ? ? ? ? ? ? ? ? ? ? ? ? [latitude, longitude],
? ? ? ? ? ? ? ? ? ? ? ? ? ? radius = 5,
? ? ? ? ? ? ? ? ? ? ? ? ? ? popup = label,
? ? ? ? ? ? ? ? ? ? ? ? ? ? color = "blue",
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill = True
? ? ? ? ? ? ? ? ? ? ? ? ).add_to(map_London_safe_neigh)??
? ??
map_London_safe_neigh
E] Identifying Venues Around the Safest Neighbourhoods of London:
- Neighbourhood: Name of the Neighbourhood
- Neighbourhood Latitude: Latitude of the Neighbourhood
- Neighbourhood Longitude: Longitude of the Neighbourhood
- Venue: Name of the Venue
- Venue Category: Category of the Venue
- Venue Latitude: Latitude of the Venue
- Venue Longitude: Longitude of the Venue
CLIENT_ID = "xxxxxxxxxxxxx" # Enter your Foursquare ID
CLIENT_SECRET = "xxxxxxxxxxxxx" # Enter your Foursquare Secret
VERSION = "20180605" # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
def getNearbyVenues(names, wards, boroughs, latitudes, longitudes, radius = 500):
? ??venues_list = []
? ? for name, ward, borough, lat, lng in zip(names, wards, boroughs, latitudes, longitudes):
? ? ? ? print(name)
? ? ? ? ? ??
? ? ? ? # create the API request URL
? ? ? ? url = "https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}".format(
? ? ? ? ? ? ? ? ? ? CLIENT_ID,?
? ? ? ? ? ? ? ? ? ? CLIENT_SECRET,?
? ? ? ? ? ? ? ? ? ? VERSION,?
? ? ? ? ? ? ? ? ? ? lat,?
? ? ? ? ? ? ? ? ? ? lng,?
? ? ? ? ? ? ? ? ? ? radius
? ? ? ? ? ? ? ? ? ? )
? ? ? ? ? ??
? ? ? ? # make the GET request
? ? ? ? results = requests.get(url).json()["response"]["groups"][0]["items"]
? ? ? ??
? ? ? ? # return only relevant information for each nearby venue
? ? ? ? venues_list.append([(
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? name,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ward,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? borough,
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lat,?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? lng,?
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? v["venue"]["name"],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? v["venue"]["categories"][0]["name"],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? v["venue"]["location"]["lat"],
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? v["venue"]["location"]["lng"]
? ? ? ? ? ? ? ? ? ? ? ? ? ? ) for v in results])
? ? nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
? ? nearby_venues.columns = ["Neighbourhood", "Ward", "Borough", "Neighbourhood Latitude", "Neighbourhood Longitude", "Venue", "Venue Category", "Venue Latitude", "Venue Longitude"]
? ??
? ? return (nearby_venues)
venues_london = getNearbyVenues(neighbourhood_df["Neighbourhood"], neighbourhood_df["Ward"], neighbourhood_df["Borough"], neighbourhood_df["Latitude"], neighbourhood_df["Longitude"])
F] Segmenting Neighbourhoods of London by Common Venue Categories:
(i) One Hot Encoding
venues_london_ohe = pd.get_dummies(venues_london[["Venue Category"]], prefix = "", prefix_sep = "")
venues_london_ohe["Neighbourhood"] = venues_london["Neighbourhood"] # This adds the "Neighbourhood" column in the end
# Moving the Neighbourhood Column to the First Column
columns = [venues_london_ohe.columns[-1]] + list(venues_london_ohe.columns[ : -1])
venues_london_ohe = venues_london_ohe[columns]
neighbourhood_group_ohe = venues_london_ohe.groupby("Neighbourhood").sum()
neighbourhood_group_ohe.reset_index(inplace = True)
(ii) Printing Each Neighbourhood Along with the Top 8 Most Common Venues
num_top_venues = 8
for neigh in neighbourhood_group_ohe["Neighbourhood"]:
? ? print("---------"+neigh+"---------")
? ? temp = neighbourhood_group_ohe[neighbourhood_group_ohe["Neighbourhood"] == neigh].T.reset_index()
? ? temp.columns = ["Venue", "Frequency"]
? ? temp = temp.iloc[ 1 : ]
? ? temp["Frequency"] = temp["Frequency"].astype(float)
? ? temp = temp.round({"Frequency" : 2})
? ? print(temp.sort_values("Frequency", ascending = False).reset_index(drop = True).head(num_top_venues))
? ? print("\n")
(iii) Transferring the Venues into a Pandas Dataframe
def return_most_common_venues(row, num_top_venues):
? ? row_categories = row.iloc[1 : ]
? ? row_categories_sorted = row_categories.sort_values(ascending = False)
? ??
? ? return row_categories_sorted.index.values[0 : num_top_venues]
indicators = ["st", "nd", "rd"]
# Create columns according to number of top Venues
columns = ["Neighbourhood"]
for ind in np.arange(num_top_venues):
? ? try:
? ? ? ? columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
? ? except:
? ? ? ? columns.append('{}th Most Common Venue'.format(ind+1))
# Create a new Dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns = columns)
neighbourhoods_venues_sorted["Neighbourhood"] = neighbourhood_group_ohe["Neighbourhood"]
for ind in np.arange(neighbourhood_group_ohe.shape[0]):
? ? neighbourhoods_venues_sorted.iloc[ind, 1 : ] = return_most_common_venues(neighbourhood_group_ohe.iloc[ind, : ], num_top_venues)
G] Clustering Neighbourhoods by Common Venues (K-Means Clustering):
(i) Building a Model to Cluster the Neighbourhoods
neighbourhood_group_cluster = neighbourhood_group_ohe.drop(labels = "Neighbourhood", axis = 1)
distortions = []
K = range(1,20)
for k in K:
? ? kmean = KMeans(init = "k-means++", n_clusters = k, random_state = 0, n_init = 50, max_iter = 500)
? ? kmean.fit(neighbourhood_group_cluster)
? ? distortions.append(kmean.inertia_)
plt.figure(figsize = (10, 5))
plt.plot(K, distortions, "bx-")
plt.xlabel("k")
plt.ylabel("Distortion")
plt.title("The Elbow Method")
plt.show()
# set number of clusters
k_num_clusters = 3
kmeans = KMeans(init = "k-means++", n_clusters = k_num_clusters, random_state = 0)
kmeans.fit(neighbourhood_group_cluster)
# check cluster labels generated for each row in the dataframe
labels = kmeans.labels_[0 : 80]
neighbourhoods_venues_sorted.insert(1, "Cluster Labels", kmeans.labels_)
neighbour_df = neighbourhood_df[["Neighbourhood", "Distance To Station", "Ward", "Borough", "Postcodes", "Latitude", "Longitude"]].copy(deep = True)
london_merged = neighbour_df
london_merged = london_merged.join(neighbourhoods_venues_sorted.set_index("Neighbourhood"), on = "Neighbourhood")
london_merged = london_merged.dropna(subset = ["Cluster Labels"])
(ii) Principal Component Analysis (PCA)
pca = PCA().fit(neighbourhood_group_cluster)
pca_neigh = pca.transform(neighbourhood_group_cluster)
print("Variance Explained by Each Component (%): ")
for i in range(len(pca.explained_variance_ratio_)):
? ? ? print("\n", i+1, "o: " + str(round(pca.explained_variance_ratio_[i] * 100, 2)) + "%")
print("\nTotal Sum: " + str(round(sum(pca.explained_variance_ratio_) * 100, 2)) + "%")
print("\nExplained Variance of the First Eight Components, i.e. 10% of the Total Components: " + str(round(sum(pca.explained_variance_ratio_[0 : 8]) * 100, 2)) + "%")
c1 = []
c2 = []
c3 = []
for i in range(len(pca_neigh)):
? ? if kmeans.labels_[i] == 0:
? ? ? ? c1.append(pca_neigh[i])
? ? if kmeans.labels_[i] == 1:
? ? ? ? c2.append(pca_neigh[i])
? ? if kmeans.labels_[i] == 2:
? ? ? ? c3.append(pca_neigh[i])
? ? ? ??? ? ? ??
c1 = np.array(c1)
c2 = np.array(c2)
c3 = np.array(c3)
plt.figure(figsize = (10, 8))
plt.scatter(c1[ : , 0], c1[ : , 1], c = "red", label = "Cluster 1")
plt.scatter(c2[ : , 0], c2[ : , 1], c = "blue", label = "Cluster 2")
plt.scatter(c3[ : , 0], c3[ : , 1], c = "green", label = "Cluster 3")
plt.legend()
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("Low Dimensional Visualisation (PCA) - Neighbourhoods")
(iii) Visualising the Resulting Clusters on the Map of London
# Create Map of London
map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 10)
# Set color scheme for the clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i * x) ** 2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
for lat, lon, mcv, poi, ward, bor, cluster in zip(london_merged["Latitude"], london_merged["Longitude"], london_merged["1st Most Common Venue"], london_merged["Neighbourhood"], london_merged["Ward"], london_merged["Borough"], london_merged["Cluster Labels"]):
? ? label = folium.Popup("Cluster " + str(int(cluster) + 1) + ":\n" + str(mcv) + ",\n" + str(poi) + ",\n" + str(ward) + ",\n" + str(bor), parse_html = True)
? ? folium.CircleMarker(
? ? ? ? ? ? ? ? ? ? ? ? ? ? [lat, lon],
? ? ? ? ? ? ? ? ? ? ? ? ? ? radius = 5,
? ? ? ? ? ? ? ? ? ? ? ? ? ? popup = label,
? ? ? ? ? ? ? ? ? ? ? ? ? ? color = rainbow[int(cluster)],
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill = True,
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill_color = rainbow[int(cluster)],
? ? ? ? ? ? ? ? ? ? ? ? ? ? fill_opacity = 0.5
? ? ? ? ? ? ? ? ? ? ? ? ).add_to(map_clusters)
? ? ? ??
map_clusters
(iv) Examining the Clusters
cluster_1 = london_merged.loc[london_merged["Cluster Labels"] == 0, london_merged.columns[[0] + [2] + [3] + list(range(7, london_merged.shape[1]))]]
cluster_1.to_csv("Venues in the Neighbourhood of London - Cluster 1.csv")
cluster_1
cluster_2 = london_merged.loc[london_merged["Cluster Labels"] == 1, london_merged.columns[[0] + [2] + [3] + list(range(7, london_merged.shape[1]))]]
cluster_2.to_csv("Venues in the Neighbourhood of London - Cluster 2.csv")
cluster_2
cluster_3 = london_merged.loc[london_merged["Cluster Labels"] == 2, london_merged.columns[[0] + [2] + [3] + list(range(7, london_merged.shape[1]))]]
cluster_3.to_csv("Venues in the Neighbourhood of London - Cluster 3.csv")
cluster_3
Links to Jupyter Notebook
Note:
Note:
If you are unable to view the code / charts properly on GitHub, then you may either:
§?????Click on the “Circle with Horizontal Line” symbol on the top right-hand corner to view the Jupyter Notebook with “nbviewer”
OR
§?????Click on the “Download” button to download the .ipynb file
Link to Report
Results and Discussion
- This cluster is mostly made up of Hotels, Pubs, Theatres, Art Galleries, Art Museums, Outdoor Sculptures and Plazas
- Thus, this cluster is most suitable for Tourists
- This cluster is mostly made up of Pubs, Coffee Shops, Cafe, Multi-Cultural Restaurants, Bars, Gyms, Sports Clubs, Supermarkets, Grocery Stores, Shopping Plazas, Fast-food Joints, etc.
- Thus, this cluster is most suitable for young couples and executives
- This is the biggest cluster from our Dataset
- It is mostly made up of Supermarkets, Bakeries, Pharmacies, Auto Garages, Parks, Playgrounds, Sports Complexes, Multi-Cultural Restaurants, Ice Cream Parlours, Fish & Chips Shops, Pubs, various stores like, Grocery, Convenience, Clothing, Furniture, Pet, Optical, Electronics, Warehouse, etc., and Train Stations
- It has almost everything that a family requires
- Thus, this cluster seems to be most suitable for families with children
Conclusion
- which are the safe Boroughs, Wards and Neighbourhoods of London
- the most common venues in those neighbourhoods
- the different types of neighbourhoods based on the cluster of venue categories
- which neighbourhoods to choose as per their preference
Thank You