Housing Rent Prices and Venues Data Analysis of London - A geographical clustering approach for house seekers.

Housing Rent Prices and Venues Data Analysis of London - A geographical clustering approach for house seekers.

Understanding the environment - London Population 2019

9,176,530

The latest official estimate of the population of London comes from the Office for National Statistics. According to their data, the estimated population of Greater London in 2016 was 8,787,892. The metro population in 2019 is estimated to be as much as 9.18 million.

The Census in the United Kingdom takes place every ten years, with the most recent one completed in 2011 this means that we are close to the next demographic data collection.

London's population makes it by far the largest city in the United Kingdom. The second-largest city in the UK - Birmingham - has a population of 1.1 million which is 11,98% only the population of the Capital City. London is also the largest city in the European Union, twice the size of Dublin and three times the size of Rome.

It is the third-largest city in Europe, behind Istanbul (14.8 million) and Moscow (10.3 million), and the 27th most populous metro area in the world.

Ethnicity

London as a city is considerably more diverse than the rest of the United Kingdom. Across England and Wales, 86% of the population is white based on the 2011 Census, but in London that number falls to 59,79%. The white proportion of London's population increases when travelling away from the city centre.

Is important to understand how the diversity of ethnicity of the population changes the way we analyze the housing market. People have different needs and priorities to satisfy and, when it's time to rent a flat, they are more likely available to rent a flat in an area with the presence of specific venues at walking distance.

No alt text provided for this image

Introduction and Business problem presentation

I believe there are 3 main reasons why a flat doesn't fit the customer needs:

  • The flat looks old and decrepit
  • The neighbourhood hasn't the expected commodities nearby
  • The price is too high for that particular flat or out of budget

My goal for this research is to have a systematic way to analyze the offers posted by RightMove.co.uk to produce a map of the best opportunities in the city. If you are looking for a new flat and you like your actual neighbourhood, this project can provide you with a list of all the best properties on the market according to your preferences.

For this project, I'm going to create simple software that scrapes RightMove to garner an updated list of flats available for rent, collect and analyze the main venues near each property available for rental using Foursquare and cluster them in order to divide the housing market into 20 groups by venues similarity in a radius of 500 metres.

Methodology

Let's start with the data gathering process.

To do so, I decided to invest time developing a web scraping application using Beautiful Soup 4, but then I discovered a repository on GitHub offered by toby-p and available for download, that provides an easy way to scrape Rightmove!

This script collects the following information:

  • price
  • type
  • address
  • URL
  • agent_url
  • postcode
  • number_bedrooms
  • search_date

The dataset will look like the following:

No alt text provided for this image

The address format is as "Street, City, Postcode" and is an unstructured field, but for our purpose we can leave as it is. Instead, the PostCode present a "limited" format because we have the first two/three digits only. This is not accurate enough to collect meaningful data about the venues around the flats.

To solve this problem, I'm going to use OpenCage Geocoder API to look up the coordinates from a postal address. This is a case when an unstructured field becomes helpful.

To associate each rent offer to a District, I'm going to join the data table with a second dataset which has two columns:

  • District Name
  • PostCode

This dataset had been created scraping a Wikipedia Table (available here) with the data I need for this analysis.

When the data are collected and merged into a single data frame, I cluster them using the K-Means algorithm. To visualize geographic details and the distribution of the offers in London, I plotted 2 meaningful maps using folium Python library:

  • Clusters map: this map shows the distribution of the clusters using colours to identify each cluster.
  • Heating map: this map shows the areas with a higher number of offers.

To better understand the market, it is important to plot bar charts to easily identify the average price for a studio flat, 1 bedroom flat, 2 bedrooms flat, 3 bedrooms flat and 4 bedrooms flat by the District.

One of the goals is to quantify the magnitude of the impact that the location (District) has on the average price for each apartment category and to identify the number of bedrooms that minimize the geographic influence on the monthly fee.

Finally, I concluded the project asking the user to input the following data:

  • Your address: this input is used to analyze the neighbourhood you are living in and to use this information to find the cluster you belong to.
  • The number of bedrooms you are looking for: this input is used to filter the results of the cluster you belong to.
  • The Maximum monthly fee (budget)

The output of this analysis is a data frame with a list of filtered results based on your preferences.

Results

As expected, the price of a flat can't be forecasted by the venues around it only but the market can be filtered effectively to help the house-seeker finding the best properties available. Nevertheless, it is possible to develop a pricing model based on the characteristics of the flat (number of bedrooms, number of bathrooms etc), the District the flat belongs to, and the presence of some key venues nearby the flat. An example of key factors is the presence of supermarkets with high reputation, public transportation stops, schools or Universities, Hospitals. The correlation between price and these categories is low but important when based on the preferences of the final users.

The main goal of this project is to provide everyone with a tool to analyze the housing market and to identify the best offers that fit their personal needs.

Data Exploration

First of all, I perform a data exploration visualizing the frequency distribution through different queries to start understanding the market.

The following bar chart plot shows the number of offers listed by the number of bedrooms.

No alt text provided for this image

As we can see, the most frequent offers on the market are relevant to 2 bedrooms apartment category followed by 1 bedroom apartment category.

Which district has the most offers in London?

The following bar chart explores the distribution of the offers by postcode.

No alt text provided for this image

Data Manipulation

In order to make the results easier to read and to interpret, I decided to combine each postcode to its relative District Name. This step is of primary importance to make the results enjoyable to the final user.

To define the location of each District, I am going to identify the latitude and longitude of the centre of each District using OpenCageData.

This API permits to search and grab geographical information based on latitude and longitude or address. I'm going to use the postcode of each District to expand our data frame.

No alt text provided for this image

Having the latitude and longitude of each District is useful for future and more in-depth analysis such as the distance from the City centre which is an independent variable that could impact the price of the flat.

For the purpose of this project, however, it is even more important to expand the database of the properties merging each apartment with its geographical coordinates.

Using OpenCage to extend the record of each flat

I follow the same process to extend the geographic information relatively to each flat. OpenCage Data permits to collect geographical specifications using the address as a query.

In order to have accurate data to work with, it is better to gather the following details:

  • latitude of the flat
  • longitude of the flat
  • county
  • complete postcode
  • state district
  • suburb

After this important step, the dataset looks like the following:

No alt text provided for this image

To be sure that each address has been merged with the correct series of new information, I used the 'i' column as a "flag".  

This solution permitted to me to double-check that the previous index matched the new index, this means that I dropped the records where key data are missing.

Columns explanation:

  • District Name
  • Latitude and Longitude: lat and lng of the "District Name"
  • Address_y: a copy of the original address that we'll drop. I used it to be sure that the DataFrames have been merged correctly
  • Latitude_a and Longitude_a: lat and lng of the flat.
  • County
  • Postcode_complete: an extension of the original PostCode
  • State_district
  • suburb

Foursquare API - Find the most common venues near each flat

This step is crucial to clustering the market training a K-Means model.

We are going to use the Foursquare API to collect the first 100 most common venues in a radius of 500 meters around each flat posted on Rightmove.

Address and Venues

The following table shows the number of venues collected for the first three records from the dataset:

No alt text provided for this image

In this case, the first 3 records are not particularly significative of the process but, believe me, some address has a larger amount of venues nearby.

Is important to note that we are working with categorical variables, now. A categorical variable is a variable that can take on one of a limited, and usually fixed number, possible values, assigning each individual or other units of observation to a particular group or nominal category on the basis of some qualitative property.

The new dataset has a record for each venue category related to each flat (this means that each row has "1" only out of 439 unique venues categories). For this reason, the shape of the table is 44.637 rows and 443 columns.

To make the analysis faster, I'm going to consider the top 20 most common venues only, and those will be used to develop the clustering analysis.

Finally, I obtain a dataset ready to be used for my purpose!

No alt text provided for this image

Clustering using K-means

Now I have all the information I need to subdivide the market by geographic similarity. It's time to split it into clusters.

I'm going to use the K-Means algorithm.

The K-Means algorithm is one of the most popular unsupervised machine learning algorithms. Normally, the unsupervised algorithms make inferences using unlabelled dataset. The goal of this unsupervised machine learning technique is to find similarities in the data point and group similar data points together. The K-Means clustering algorithm, it aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

One method to decide the optimal k is known as "the elbow method".

The “elbow” method helps data scientists to select the optimal number of clusters by fitting the model with a range of values for K. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point.

No alt text provided for this image

Unfortunately, the elbow is not definitely clear and chose a high value for "k" would be counterproductive for the goal of the project. After several tests, I decided to use a k = 20 because the number of offers in each cluster looks more equally distributed into the model despite some clusters contain a low number of properties.

No alt text provided for this image

London Clusters Distribution Map

To show the clusters that are not identified by density or geographic proximity, I plot a map of the market.  Each colour identifies a specific cluster and each dot represents an available property and its geographic location.

No alt text provided for this image

Housing market - Heating Map

It is meaningful to plot a heating map to identify the areas that present a higher volume of offers. This particular view changes daily because of the high activity of the market.

A dataset with historical time series of the records could be interesting to identify whether there is any phenomenon of seasonality or peaks of activity, or not.

Another analysis interesting to conduct, is to investigate the relation between the housing market and Brexit announcements. Did the Brexit accelerate the cycles of the housing market promoting a major number of short term tenancy agreements to face the higher uncertainty about the future in the Country for foreign citizens?

No alt text provided for this image

In my opinion, looking at the heating map, there are some areas in the East-Side of London that are very cold. This is probably due to the algorithm Rightmove uses to show the offers to specific users. In fact, querying the website to look for the offers in London (without any other additional specifications), it shows there are 30,805 results.

In reality, there are 25 offers per page published into 42 pages readable by the user, that means 1050 available offers in total (the same number of offers I'm analyzing for this report).

The average price for a studio flat by district

After a brief analysis of the market, I believe it is helpful for the readers to have an overview of the average price of the 5 most common types of flat in relation with the district.

The standard deviation of the price of a studio flat in London is 378,1329 and the average price is £1.233,09. I calculate the coefficient of variation to compare the type of flats and the price.

The coefficient of variation (CV), also known as relative standard deviation (RSD), is a standardized measure of the dispersion of a probability distribution or frequency distribution. It is often expressed as a percentage and is defined as the ratio of the standard deviation to the mean (or its absolute value).

In this case, the coefficient of variation is equal to 0,306653.

No alt text provided for this image

The average price for a 1 bedroom flat by district

The standard deviation of the price of a 1 bedroom flat in London is 561,9852 and the average price is £1.468,28.

The coefficient of variation is equal to 0,382750

No alt text provided for this image

The average price for a 2 bedroom flat by district

The standard deviation of the price of a 2 bedroom flat in London is 949,9799 and the average price is £2.050,92.

The coefficient of variation is equal to 0,463196.

No alt text provided for this image


The average price for a 3 bedroom flat by district

The standard deviation of the price of a 3 bedroom flat in London is 2.449,1497 and the average price is £2.892,65.

The coefficient of variation is equal to 0,846678.

No alt text provided for this image

The average price for a 4 bedroom flat by district

The standard deviation of the price of a 4 bedroom flat in London is 5.176,5294 and the average price is £4.891,59.

The coefficient of variation is equal to 1,058249.

No alt text provided for this image

The tendency of the price by the number of bedrooms

In the charts above, I would like to highlight that the variation of the price increases more than proportionally to the number of bedrooms. This means that the impact of the district on the monthly rental fee is weaker for small apartments and studio flats. Of course, this is just a summary analysis of the phenomenon because, as we previously showed, the flat type are not equally distributed among Districts but is still a good insight of the tendency of the market.

In the boxplots below showing the distribution of the price for the number of bedrooms, is easy to identify the presence of outliers, typically relevant to apartments located in the Districts of South-Kensington, Chelsea or Mayfair.

No alt text provided for this image

There may be several reasons for such market behaviour:

  • the price of a studio flat or a one-bedroom flat is too high compared to the average salary and people find hard to afford it and this fact makes it less appealing on the market.
  • the flats are old and poorly maintained
  • renters prefer other channels to post this specific category of flat

And moreover.

Wealth distribution

The analysis of the housing market mirror the wealth distribution into the City of London. Wealth ownership in London is much more unevenly distributed than income. To quantify the social inequalities usually the Gini coefficient is used.

The statistic insight is between 0, that indicates a completely equal population that in this case means that everyone has the same level of wealth, and 1 that indicates a complete inequality where the entire wealth is owned by a single person. In other words, the higher the Gini coefficient, the higher the level of inequality. The Gini coefficient for wealth in London is 0.67, compared with 0.61 in Great Britain as a whole. The Gini coefficient for income in London is 0.37, much lower. 

Are earnings and incomes in London keeping pace with house price rises?

Demographia’s annual survey of international housing affordability suggests that London present high multiples by international standards. The median multiples are calculated as the Median House Price divided by the Median Household Income and indicates the affordability/unaffordability of the housing market. Based on National data from Q3 2018, London (Greater London Authority) is rated the tenth least affordable of 91 major metropolitan markets with an estimated median multiple of 8.1.

The Final Input

The goal of this analysis is not only to inform people about the rentals in London but also to provide a way to filter the market based on their needs.

The repository I uploaded on GitHub contains a Jupyter Notebook with the entire market analysis process and, at the end, the cell that permits to input some preferences such as an address with venues the user is interested in, the number of bedrooms and the maximum budget.

After the user has submitted the information, the algorithm analyzes the address and provide a list of offers filtered by venues similarity.

Conclusion

I developed this project to test, once again, my ability to face a business problem implementing a data-based solution. The entire process has been developed using Python and all the documentation, the screenshots and the data I used are available on GitHub.

People can achieve better outcomes if they can make well-informed decisions. Providing ways to make wise decisions is essential for a future scenario resources-waste free.

Thank you!

References:

[1] https://www.rightmove.co.uk/

[2] Wikipedia London Postal Districts

[3] Rightmove web scraper

[4] Foursquare API

[5] OpenCageData API

要查看或添加评论,请登录

Federico Sciuca的更多文章

社区洞察

其他会员也浏览了