Predicting Property Market Trends: The Role of Data Science, Sentiment Analysis, and Time-Series Forecasting
Introduction
Full code and sample plots from GitHub
In today's dynamic and rapidly evolving property market, making informed decisions is more critical than ever. Accurate predictions about property prices can provide invaluable insights for investors, homeowners, and policymakers alike. Traditionally, forecasting property market trends relied heavily on historical data and economic indicators. However, the advent of data science has revolutionized this process, enabling more sophisticated and accurate predictions.
This article delves into the combined use of data science techniques, sentiment analysis, and time-series forecasting to predict property market trends. By leveraging these advanced methodologies, we can uncover patterns and insights that were previously inaccessible.
We will explore a case study focusing on Kensington and Chelsea, a prime real estate area in London, to illustrate how these tools can be applied to real-world scenarios. Our analysis covers historical data from 2020 to 2023, utilizing land registry data for accurate tracking. Beyond 2023, we employ advanced forecasting methods to predict future trends, providing a comprehensive view of the market.
Data Preparation
To source our data, we utilized the extensive datasets available from the UK Land Registry, which provides a wealth of information in the form of downloadable CSV files. The most recent dataset, titled UK-HPI-full-file-2024-01, was obtained from the official website. This file includes a derived back series for the new UK House Price Index (HPI). The UK HPI data is available from 1995 for England and Wales, 2004 for Scotland, and 2005 for Northern Ireland. Additionally, a longer back series has been derived by using the historic path of the Office for National Statistics HPI to construct a series back to 1968.
The CSV file was downloaded and stored in an HDFS (Hadoop Distributed File System) sub-directory. Apache Spark was employed as an ETL (Extract, Transform, Load) tool to load the CSV data into a Hive table, facilitating efficient data processing and querying. The Hive table contained 139,770 rows, with 349 records specifically pertaining to the Kensington and Chelsea region.
Each entry in the CSV file represents the average price for a given month. Thus, over a year, there are 12 entries for each property type. The property types included in the dataset are:
- Detached
- Semi-detached
- Terraced
- Flat
Additionally, the dataset includes an overall average price for each month, providing a comprehensive view of the property market trends for the specified period. This granular level of detail allows for robust analysis and accurate forecasting, forming the foundation of our predictive modelling efforts.
The load process using Spark is shown below
import sys
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext
from pyspark.sql.functions import when
# Initialize SparkSession
appName = "DS"
spark = SparkSession.builder \
.appName(appName) \
.enableHiveSupport() \
.getOrCreate()
# Set the log level to ERROR to reduce verbosity
sc = spark.sparkContext
sc.setLogLevel("ERROR")
##
## Get a DF first
##
#csvlocation="hdfs://rhes75:9000/ds/UK-HPI-full-file-2020-01.csv"
#csvlocation="hdfs://rhes75:9000/ds/UK-HPI-full-file-2023-06.csv"
csvlocation="hdfs://rhes75:9000/ds/UK-HPI-full-file-2024-01.csv"
rows = spark.read.csv(csvlocation, header="true").count()
print("\nnumber of rows from csv file is ",rows)
if (rows == 0):
println("Empty CSV directory, aborting!")
sys.exit(1)
house_df = spark.read.csv(csvlocation, header="true")
house_df.printSchema()
# Map the columns to correct data types
##
for col_name in house_df.columns:
house_df = house_df.withColumn(col_name, when(house_df[col_name].isNull(), 0).otherwise(house_df[col_name]))
# Cast columns to appropriate data types
for col_name in house_df.columns:
if col_name == "Date":
house_df = house_df.withColumn(col_name, F.col(col_name).cast("string"))
elif col_name == "Date" or col_name == "RegionName" or col_name == "AreaCode":
house_df = house_df.withColumn(col_name, F.col(col_name).cast("string"))
else:
house_df = house_df.withColumn(col_name, F.col(col_name).cast("double"))
house_df.printSchema()
house_df.createOrReplaceTempView("tmp")
## Check if table exist otherwise create it
DB = "DS"
tableName = "ukhouseprices"
fullyQualifiedTableName = DB + '.' + tableName
regionname = "Kensington and Chelsea"
spark.sql(f"""DROP TABLE IF EXISTS {fullyQualifiedTableName}_staging""")
sqltext = f"""
CREATE TABLE {fullyQualifiedTableName}_staging(
Datetaken string
, RegionName string
, AreaCode string
, AveragePrice double
, Index double
, IndexSA double
, oneMonthPercentChange double
, twelveMonthPercentChange double
, AveragePriceSA double
, SalesVolume double
, DetachedPrice double
, DetachedIndex double
, Detached1mPercentChange double
, Detached12mPercentChange double
, SemiDetachedPrice double
, SemiDetachedIndex double
, SemiDetached1mPercentChange double
, SemiDetached12mPercentChange double
, TerracedPrice double
, TerracedIndex double
, Terraced1mPercentChange double
, Terraced12mPercentChange double
, FlatPrice double
, FlatIndex double
, Flat1mPercentChange double
, Flat12mPercentChange double
, CashPrice double
, CashIndex double
, Cash1mPercentChange double
, Cash12mPercentChange double
, MortgagePrice double
, MortgageIndex double
, Mortgage1mPercentChange double
, Mortgage12mPercentChange double
, FTBPrice double
, FTBIndex double
, FTB1mPercentChange double
, FTB12mPercentChange double
, FOOPrice double
, FOOIndex double
, FOO1mPercentChange double
, FOO12mPercentChange double
, NewPrice double
, NewIndex double
, New1mPercentChange double
, New12mPercentChange double
, OldPrice double
, OldIndex double
, Old1mPercentChange double
, Old12mPercentChange double
)
COMMENT 'Your comment here'
STORED AS PARQUET
"""
spark.sql(sqltext)
sqltext = f"""
INSERT INTO {fullyQualifiedTableName}_staging
SELECT
--TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(date,'dd/MM/yyyy'),'yyyy-MM-dd')) AS datetaken
date AS datetaken,
RegionName,
AreaCode,
AveragePrice,
Index,
IndexSA,
`1m%Change`, -- Escape special characters using backticks
`12m%Change`, -- Escape special characters using backticks
AveragePriceSA,
SalesVolume,
DetachedPrice,
DetachedIndex,
`Detached1m%Change`, -- Escape special characters using backticks
`Detached12m%Change`, -- Escape special characters using backticks
SemiDetachedPrice,
SemiDetachedIndex,
`SemiDetached1m%Change`, -- Escape special characters using backticks
`SemiDetached12m%Change`, -- Escape special characters using backticks
TerracedPrice,
TerracedIndex,
`Terraced1m%Change`, -- Escape special characters using backticks
`Terraced12m%Change`, -- Escape special characters using backticks
FlatPrice,
FlatIndex,
`Flat1m%Change`, -- Escape special characters using backticks
`Flat12m%Change`, -- Escape special characters using backticks
CashPrice,
CashIndex,
`Cash1m%Change`, -- Escape special characters using backticks
`Cash12m%Change`, -- Escape special characters using backticks
MortgagePrice,
MortgageIndex,
`Mortgage1m%Change`, -- Escape special characters using backticks
`Mortgage12m%Change`, -- Escape special characters using backticks
FTBPrice,
FTBIndex,
`FTB1m%Change`, -- Escape special characters using backticks
`FTB12m%Change`, -- Escape special characters using backticks
FOOPrice,
FOOIndex,
`FOO1m%Change`, -- Escape special characters using backticks
`FOO12m%Change`, -- Escape special characters using backticks
NewPrice,
NewIndex,
`New1m%Change`, -- Escape special characters using backticks
`New12m%Change`, -- Escape special characters using backticks
OldPrice,
OldIndex,
`Old1m%Change`, -- Escape special characters using backticks
`Old12m%Change` -- Escape special characters using backticks
FROM tmp
"""
spark.sql(sqltext)
# now create and populate the main table
spark.sql(f"""DROP TABLE IF EXISTS {fullyQualifiedTableName}""")
sqltext = f"""
CREATE TABLE {fullyQualifiedTableName}(
Datetaken DATE COMMENT 'The year and month to which the monthly statistics apply'
, RegionName string COMMENT 'Name of geography (Country, Regional, County/Unitary/District Authority and London Borough)'
, AreaCode string COMMENT 'Code of geography (Country, Regional, County/Unitary/District Authority and London Borough)'
, AveragePrice double COMMENT 'Average house price for a geography in a particular period'
, Index double COMMENT 'House price index for a geography in a particular period (January 2015=100).'
, IndexSA double COMMENT 'Seasonally adjusted house price for a geography in a particular period (January 2015=100).'
, oneMonthPercentChange double COMMENT 'The percentage change in the Average Price compared to the previous month'
, twelveMonthPercentChange double COMMENT 'The percentage change in the Average Price compared to the same period twelve months earlier.'
, AveragePriceSA double COMMENT 'Seasonally adjusted Average Price for a geography in a particular period'
, SalesVolume double COMMENT 'Number of registered transactions for a geography in a particular period'
, DetachedPrice double COMMENT 'Average house price for a particular property type (such as detached houses), for a geography in a particular period.'
, DetachedIndex double COMMENT 'House price index for a particular property type (such as detached houses), for a geography in a particular period (January 2015=100).'
, Detached1mPercentChange double COMMENT 'The percentage change in the [Property Type Price (such as detached houses) compared to the previous month'
, Detached12mPercentChange double COMMENT 'The percentage change in the [Property Type Price (such as detached houses) compared to the same period twelve months earlier.'
, SemiDetachedPrice double
, SemiDetachedIndex double
, SemiDetached1mPercentChange double
, SemiDetached12mPercentChange double
, TerracedPrice double
, TerracedIndex double
, Terraced1mPercentChange double
, Terraced12mPercentChange double
, FlatPrice double
, FlatIndex double
, Flat1mPercentChange double
, Flat12mPercentChange double
, CashPrice double COMMENT 'Average house price by funding status (such as cash), for a geography in a particular period.'
, CashIndex double COMMENT 'House price index by funding status (such as cash), for a geography in a particular period (January 2015=100).'
, Cash1mPercentChange double
, Cash12mPercentChange double
, MortgagePrice double COMMENT 'Average house price by funding status (such as cash), for a geography in a particular period.'
, MortgageIndex double COMMENT 'House price index by funding status (such as cash), for a geography in a particular period (January 2015=100).'
, Mortgage1mPercentChange double
, Mortgage12mPercentChange double
, FTBPrice double COMMENT 'Average house price by buyer status (such as first time buyer/former owner occupier), for a geography in a particular period.'
, FTBIndex double COMMENT 'House price index by buyer status (such as first time buyer/former owner occupier), for a geography in a particular period. (January 2015=100).'
, FTB1mPercentChange double
, FTB12mPercentChange double
, FOOPrice double COMMENT 'Average house price by buyer status (such as first time buyer/former owner occupier), for a geography in a particular period.'
, FOOIndex double COMMENT 'House price index by buyer status (such as first time buyer/former owner occupier), for a geography in a particular period. (January 2015=100).'
, FOO1mPercentChange double
, FOO12mPercentChange double
, NewPrice double COMMENT 'Average house price by property status (such as new or existing property), for a geography in a particular period.'
, NewIndex double COMMENT 'House price index by property status (such as new or existing property), for a geography in a particular period. (January 2015=100).'
, New1mPercentChange double
, New12mPercentChange double
, OldPrice double COMMENT 'Average house price by property status (such as new or existing property), for a geography in a particular period.'
, OldIndex double COMMENT 'House price index by property status (such as new or existing property), for a geography in a particular period. (January 2015=100).'
, Old1mPercentChange double
, Old12mPercentChange double
)
COMMENT 'from csv file UK-HPI-full-file-2020-01.csv through the staging table'
STORED AS PARQUET
TBLPROPERTIES ( "parquet.compress"="ZLIB" )
"""
spark.sql(sqltext)
sqltext = f"""
INSERT INTO {fullyQualifiedTableName}
SELECT
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(datetaken,'dd/MM/yyyy'),'yyyy-MM-dd')) AS datetaken
, RegionName
, AreaCode
, AveragePrice
, Index
, IndexSA
, oneMonthPercentChange
, twelveMonthPercentChange
, AveragePriceSA
, SalesVolume
, DetachedPrice
, DetachedIndex
, Detached1mPercentChange
, Detached12mPercentChange
, SemiDetachedPrice
, SemiDetachedIndex
, SemiDetached1mPercentChange
, SemiDetached12mPercentChange
, TerracedPrice
, TerracedIndex
, Terraced1mPercentChange
, Terraced12mPercentChange
, FlatPrice
, FlatIndex
, Flat1mPercentChange
, Flat12mPercentChange
, CashPrice
, CashIndex
, Cash1mPercentChange
, Cash12mPercentChange
, MortgagePrice
, MortgageIndex
, Mortgage1mPercentChange
, Mortgage12mPercentChange
, FTBPrice
, FTBIndex
, FTB1mPercentChange
, FTB12mPercentChange
, FOOPrice
, FOOIndex
, FOO1mPercentChange
, FOO12mPercentChange
, NewPrice
, NewIndex
, New1mPercentChange
, New12mPercentChange
, OldPrice
, OldIndex
, Old1mPercentChange
, Old12mPercentChange
FROM {fullyQualifiedTableName}_staging
"""
spark.sql(sqltext)
spark.sql(f"""DROP TABLE IF EXISTS {fullyQualifiedTableName}_staging""")
spark.sql(f"""DROP TABLE IF EXISTS {DB}.summary""")
sqltext = f"""
CREATE TABLE {DB}.summary
COMMENT 'summary table with non null columns'
STORED AS PARQUET
TBLPROPERTIES ( "parquet.compress"="ZLIB" )
AS
SELECT
datetaken
, regionname
, areacode
, averageprice
, index
, salesvolume
, detachedprice
, detachedindex
, semidetachedprice
, semidetachedindex
, terracedprice
, terracedindex
, flatprice
, flatindex
, cashprice
, cashindex
, mortgageprice
, mortgageindex
, ftbprice
, ftbindex
, fooprice
, fooindex
, newindex
, oldprice
, oldindex
FROM {fullyQualifiedTableName}
WHERE regionname = '{regionname}'
"""
spark.sql(sqltext)
rows = spark.sql(f"""SELECT COUNT(1) FROM {fullyQualifiedTableName}""").collect()[0][0]
print(f"number of rows from {fullyQualifiedTableName} is {rows}")
rows = spark.sql(f"""SELECT COUNT(1) FROM {DB}.summary""").collect()[0][0]
print(f"number of rows in {DB}.summary table is {rows}")
sys.exit()
Once the csv file is loaded into Hive table, it can be read as follows using a method of a class in Spark
# Load data from Hive table
def load_data(self):
DSDB = "DS"
tableName = "ukhouseprices"
fullyQualifiedTableName = f"{DSDB}.{tableName}"
if self.spark.sql(f"SHOW TABLES IN {DSDB} LIKE '{tableName}'").count() == 1:
self.spark.sql(f"ANALYZE TABLE {fullyQualifiedTableName} COMPUTE STATISTICS")
rows = self.spark.sql(f"SELECT COUNT(1) FROM {fullyQualifiedTableName}").collect()[0][0]
print(f"\nTotal number of rows in table {fullyQualifiedTableName} is {rows}\n")
else:
print(f"No such table {fullyQualifiedTableName}")
sys.exit(1)
# create a dataframe from the loaded data
house_df = self.spark.sql(f"SELECT * FROM {fullyQualifiedTableName}")
house_df.printSchema()
return house_df
From the loaded data, we selected 3 years data for Kensington and Chelsea from the beginning of 2020 till end of 2023 (last entries in csv file). This allowed 36 period entries for each property type (flat, detached, semi-detached and terrace), plus the derived average for each month. An example for average price is shown below
RegionName = 'Kensington and Chelsea'
new_string = RegionName.replace(" ", "")
StartDate = '2020-01-01'
EndDate = '2023-12-31'
print(f"\nAnalysis for London Borough of {RegionName} for period {StartDate} till {EndDate}\n")
# Filter, select, and order by AveragePrice in descending order
df3 = house_df_final.filter(
(F.col('RegionName') == RegionName) &
(F.col('AveragePrice').isNotNull()) &
(F.col('DateTaken') >= StartDate) &
(F.col('DateTaken') <= EndDate)
).select(
F.col('AveragePrice').alias("Average Price")
).orderBy(F.col('AveragePrice').desc())
df3.show(100,False)
The above can then be repeated for every property type and will be stored in a Json file for ease of access and plotting. The Json file content is shown below
cat property_data.json
{
"DateTaken": ["2021-02-01", "2021-03-01", "2021-04-01", "2021-05-01", "2021-06-01",
"2021-07-01", "2021-08-01", "2021-09-01", "2021-10-01", "2021-11-01",
"2021-12-01", "2022-01-01", "2022-02-01", "2022-03-01", "2022-04-01",
"2022-05-01", "2022-06-01", "2022-07-01", "2022-08-01", "2022-09-01",
"2022-10-01", "2022-11-01", "2022-12-01", "2023-01-01", "2023-02-01",
"2023-03-01", "2023-04-01", "2023-05-01", "2023-06-01", "2023-07-01",
"2023-08-01", "2023-09-01", "2023-10-01", "2023-11-01", "2023-12-01"],
"Average Price": [1265469.0, 1224768.0, 1198159.0, 1223380.0, 1196736.0,
1321844.0, 1445633.0, 1515740.0, 1527601.0, 1447296.0,
1410458.0, 1371047.0, 1374360.0, 1441513.0, 1513572.0,
1518352.0, 1520758.0, 1503688.0, 1517723.0, 1527039.0,
1410576.0, 1376685.0, 1304428.0, 1364872.0, 1350849.0,
1384497.0, 1334814.0, 1363774.0, 1400045.0, 1376043.0,
1366026.0, 1315895.0, 1375884.0, 1268656.0, 1199227.0],
"Detached Price": [3339643.0, 3210883.0, 3180273.0, 3261376.0, 3209506.0,
3609718.0, 4017462.0, 4257942.0, 4319779.0, 4105235.0,
3991142.0, 3860414.0, 3850122.0, 4126667.0, 4324868.0,
4397984.0, 4346722.0, 4320823.0, 4369358.0, 4494141.0,
4125761.0, 4024638.0, 3769144.0, 4021058.0, 3956980.0,
4086739.0, 3917525.0, 3954089.0, 4041221.0, 3980686.0,
4016724.0, 3854223.0, 3982877.0, 3628941.0, 3301387.0],
"Semi Detached Price": [3457594.0, 3330070.0, 3269217.0, 3341125.0, 3271765.0,
3667087.0, 4100628.0, 4342912.0, 4403758.0, 4171473.0,
4059804.0, 3944470.0, 3920283.0, 4178471.0, 4393418.0,
4455597.0, 4451900.0, 4415196.0, 4458491.0, 4513078.0,
4158258.0, 4058165.0, 3821416.0, 4035780.0, 3984685.0,
4111878.0, 3953592.0, 4021585.0, 4125853.0, 4067773.0,
4082121.0, 3923799.0, 4069382.0, 3719860.0, 3420758.0],
"Terraced Price": [2347194.0, 2260217.0, 2218038.0, 2265759.0, 2227464.0,
2475419.0, 2741704.0, 2887687.0, 2919685.0, 2765163.0,
2687816.0, 2616501.0, 2605946.0, 2756785.0, 2909453.0,
2940172.0, 2953079.0, 2912599.0, 2941838.0, 2960397.0,
2741789.0, 2682415.0, 2540540.0, 2661903.0, 2620349.0,
2680795.0, 2572512.0, 2632418.0, 2716269.0, 2671069.0,
2665975.0, 2561674.0, 2662450.0, 2436004.0, 2259569.0],
"Flat Price": [1099165.0, 1065405.0, 1041050.0, 1062748.0, 1037927.0,
1143756.0, 1245289.0, 1303428.0, 1312071.0, 1243202.0,
1212598.0, 1178268.0, 1183809.0, 1237634.0, 1297378.0,
1297847.0, 1299009.0, 1285290.0, 1296957.0, 1304349.0,
1204022.0, 1174193.0, 1113022.0, 1163575.0, 1153696.0,
1182815.0, 1142170.0, 1166603.0, 1195780.0, 1174990.0,
1164021.0, 1122277.0, 1176026.0, 1087318.0, 1034924.0]
}
This file can then be read in the main Python method at the beginning of the analysis process and passed to the relevant classes to be be used in the their constructor as shown below:
# Load data from JSON file
with open('/python/genai/data/property_data.json', 'r') as f:
data = json.load(f)
# pass this file to every class that needs it
predictor = PropertyPricePredictor(data)
Analysis of data
To start analysing data, we need to deploy visualisation tools and appropriate plotting tools. like pandas and matplotlib among others
import matplotlib.pyplot as plt
import pandas as pd
We then create a panda's dataframe from Spark dataframe as below where
p_dfm = df3.toPandas() # converting spark DF to Pandas DF
Once that is done, we can use Pandas dataframe for plotting
There are several ways to get information about columns in a Pandas DataFrame:. For example
Plotting the distribution of average prices for Kensington and Chelsea from 2020-2023
# Plot the distribution of AveragePrice
plt.figure(figsize=(10, 6))
plt.hist(p_dfm['Average Price'], bins=30, edgecolor='black')
plt.xlabel('Average Price', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title(f'Distribution of Average price in {RegionName} from {StartDate} to {EndDate}', fontsize=16)
plt.grid(True)
plt.savefig(f'{new_string}_AveragePricePaid.png')
# Show the plot
#plt.show()
plt.close()
The times series plot is shown below
A combined plot for all properties types for the same period is shown below and the associated code
plt.figure(figsize=(12, 6))
# Plot each house type price
plt.plot(p_dfm['DateTaken'], p_dfm['Average Price'], label='Average Price', marker='o')
plt.plot(p_dfm['DateTaken'], p_dfm['Detached Price'], label='Detached Price', marker='o')
plt.plot(p_dfm['DateTaken'], p_dfm['Semi Detached Price'], label='Semi Detached Price', marker='o')
plt.plot(p_dfm['DateTaken'], p_dfm['Terraced Price'], label='Terraced Price', marker='o')
plt.plot(p_dfm['DateTaken'], p_dfm['Flat Price'], label='Flat Price', marker='o')
# Adding titles and labels
plt.xlabel('Date')
plt.ylabel('Price (£)')
plt.title(f"UK Registered House Prices in {RegionName} From {StartDate} until {EndDate}")
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig(f'{new_string}_CombinedPricePaid_from_{StartDate}_until_{EndDate}.png')
Interpretation of data
The plot displays the trends in registered house prices for different types of properties in Kensington and Chelsea from January 1, 2020, to December 31, 2023. Here are the key points to note:
1. Property Types:
- Average Price (Blue)
- Detached Price (Orange)
- Semi Detached Price (Green)
- Terraced Price (Red)
- Flat Price (Purple)
2. Y-Axis (Price in £): The price is represented in millions of pounds (£). The average price line is the lowest, while semi-detached and detached prices are higher.
3. X-Axis (Date): The time period spans from early 2020 to the end of 2023. The x-axis labels are dates with a format showing the year and month.
4. General Trends:
- Semi Detached Price: This has been consistently the highest among all property types. There is an observable peak around mid-2022.
- Detached Price: Closely follows the semi-detached prices but is slightly lower. This also peaks around the same time.
- Terraced Price: Lower than semi-detached and detached prices but higher than flats and average prices. It shows some fluctuations but a generally stable trend.
- Flat Price: This has the lowest price point among the specific property types, showing slight fluctuations but maintaining a relatively stable trend.
- Average Price: This represents the overall average of the prices and is the lowest, reflecting the combined trend of all property types.
5. Observations:
- There is a noticeable peak in property prices around mid-2022, particularly for semi-detached and detached properties.
- Post mid-2022, there is a downward trend in prices for all property types.
- The flat prices and average prices have the least fluctuation, indicating less volatility compared to detached and semi-detached properties.
- Detached and semi-detached properties follow similar trends, indicating they might be influenced by similar market factors.
6. Legend and Annotations:
- The legend in the top left corner clarifies which colour corresponds to which property type.
- The title of the plot specifies the geographical location (Kensington and Chelsea) and the time frame.
In summary, this plot shows that semi-detached and detached properties in Kensington and Chelsea are the most expensive and volatile in price. There was a significant price peak around mid-2022, followed by a general decline. Flats and the average prices are more stable and lower compared to other property types.
Average Prices for Each Property Type in Kensington and Chelsea from 2020 till end of 2023
Property Type Average Price
Average £1,378,783.06
Detached £3,921,601.46
Semi Detached £3,991,464.06
Terraced £2,642,818.49
Flat £1,182,931.77
Future trends, Justification for Using Different Techniques for Predicting Data In 2024
In the context of forecasting property prices for Kensington and Chelsea in 2024, employing multiple prediction techniques offers several advantages over relying on a single model. Here is a detailed justification for the use of Linear Regression, Gradient Boosting, ARIMA, and Combined Predictions:
1. Diversity in Modelling Assumptions and Strengths
Each predictive model has its own set of assumptions, strengths, and weaknesses. By combining different models, we can leverage the unique advantages of each, leading to a more comprehensive and reliable forecast.
- Linear Regression:
- Assumption: Assumes a linear relationship between time (independent variable) and property prices (dependent variable).
- Strength: Simple and interpretable, provides a baseline trend.
- Weakness: May not capture non-linear patterns or complex interactions.
- Gradient Boosting:
- Assumption: Flexible, does not assume a specific form of relationship between variables.
- Strength: Capable of modelling complex, non-linear relationships and interactions in the data.
- Weakness: Can be sensitive to overfitting, especially with small datasets.
- ARIMA:
- Assumption: Captures temporal dependencies, assumes stationarity after differencing.
- Strength: Effective for time-series data, adapts to trends and seasonality.
- Weakness: Requires careful parameter tuning, may struggle with sudden changes or non-linear trends.
2. Enhanced Predictive Accuracy
Combining multiple models can improve predictive accuracy. Each model contributes to the overall prediction based on its ability to capture different aspects of the data.
- Linear Regression provides a foundational trend.
领英推荐
- Gradient Boosting captures intricate patterns and interactions.
- ARIMA accounts for temporal dependencies and recent trends.
By averaging these predictions, the combined model mitigates the individual weaknesses of each approach, leading to more robust and accurate forecasts.
3. Risk Mitigation
Using multiple models reduces the risk of relying on a single method that may fail to capture important data characteristics. For instance, if the property market undergoes unexpected changes or anomalies, one model might be better suited to adapt than others. This diversification ensures that the overall prediction is less susceptible to errors specific to any one model.
4. Confidence Intervals and Uncertainty Quantification
The ARIMA model provides confidence intervals, which quantify the uncertainty in the predictions. This is crucial for decision-making, as it offers a range within which the true property prices are expected to lie. By combining this with other models, stakeholders get not only a point estimate but also a sense of the reliability of these estimates.
5. Data Complexity and Non-Stationarity
Property prices are influenced by a multitude of factors including economic conditions, policy changes, and local market dynamics. Such complexity often leads to non-stationarity and non-linear patterns in the data. No single model is likely to capture all these aspects adequately. By using a combination of Linear Regression, Gradient Boosting, and ARIMA, we can better address these complexities and provide a more comprehensive forecast.
6. Stakeholder Confidence and Decision Support
Providing forecasts from multiple models can enhance the confidence of stakeholders (e.g., investors, policymakers, and real estate professionals) in the predictions. The rationale for using a combination of models demonstrates a thorough and thoughtful approach, which is likely to be more persuasive and credible.
The decision to use multiple prediction techniques for forecasting property prices in Kensington and Chelsea is driven by the need for accuracy, robustness, and comprehensive analysis. By leveraging the strengths of Linear Regression, Gradient Boosting, and ARIMA models, and combining their predictions, we achieve a balanced forecast that accounts for linear trends, complex patterns, and temporal dependencies. This approach not only enhances the reliability of the predictions but also provides stakeholders with a clear understanding of potential uncertainties, ultimately supporting more informed decision-making in the property market. Below shows the code used
class PropertyPricePredictor:
def __init__(self, data):
self.scaler = StandardScaler() # Initialize a scaler for normalizing data
self.linear_model = LinearRegression() # Initialize Linear Regression model
self.gb_model = GradientBoostingRegressor() # Initialize Gradient Boosting model
self.data = data # Store the input data
def train_and_predict(self, RegionName, column_name):
"""
Workflow for Linear Regression, Gradient Boosting Regressor, and ARIMA
"""
new_string = RegionName.replace(" ", "") # Remove spaces from region name for file naming
df = pd.DataFrame(self.data) # Convert the input data to a Pandas DataFrame
# Check if 'DateTaken' is in the DataFrame
if 'DateTaken' not in df.columns:
raise KeyError("'DateTaken' column not found in the data")
# Convert DateTaken to datetime format
df['DateTaken'] = pd.to_datetime(df['DateTaken'])
# Convert DateTaken to a numerical format (year and month as integer)
df['year_month'] = df['DateTaken'].dt.year * 100 + df['DateTaken'].dt.month
# Extract feature (year_month) and target (property price) columns
X = df[['year_month']].values
y = df[column_name]
# Scale the feature data
X_scaled = self.scaler.fit_transform(X)
# Train the Linear Regression model
self.linear_model.fit(X_scaled, y)
# Train the Gradient Boosting model
self.gb_model.fit(X_scaled, y)
# Prepare future dates for prediction (next 12 months starting from January 2024)
future_dates = pd.date_range(start='2024-01-01', periods=12, freq='MS')
future_year_month = future_dates.year * 100 + future_dates.month
future_year_month_scaled = self.scaler.transform(future_year_month.values.reshape(-1, 1))
# Predict future prices using Linear Regression
linear_predictions = self.linear_model.predict(future_year_month_scaled)
# Predict future prices using Gradient Boosting
gb_predictions = self.gb_model.predict(future_year_month_scaled)
# Fit the ARIMA model on the historical data
arima_model = SARIMAX(y, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
arima_results = arima_model.fit()
# Forecast future prices using the ARIMA model for the next 12 months
forecast = arima_results.get_forecast(steps=12)
arima_predictions = forecast.predicted_mean
arima_conf_int = forecast.conf_int() # Confidence intervals for the ARIMA predictions
# Combine the predictions from Linear Regression, Gradient Boosting, and ARIMA
combined_predictions = (linear_predictions + gb_predictions + arima_predictions) / 3
# Print the predictions for each month
display_name = column_name.replace(" Price", "") # Clean the column name for display
print(f"Property type: {display_name}")
for date, lin_pred, gb_pred, arima_pred, combined_pred in zip(future_dates, linear_predictions, gb_predictions, arima_predictions, combined_predictions):
print(f"Predicted {column_name} for {date.strftime('%Y-%m')} - Linear: {lin_pred:.2f}, Gradient Boosting: {gb_pred:.2f}, ARIMA: {arima_pred:.2f}, Combined: {combined_pred:.2f}")
# Plot the historical data and predictions
plt.figure(figsize=(14, 7))
plt.plot(df['DateTaken'], df[column_name], label='Historical Data') # Historical data
# Plot predictions from each model
plt.plot(future_dates, linear_predictions, label='Linear Regression Predictions', linestyle='--', marker='o')
plt.plot(future_dates, gb_predictions, label='Gradient Boosting Predictions', linestyle='--', marker='x')
plt.plot(future_dates, arima_predictions, label='ARIMA Predictions', linestyle='--', marker='d')
plt.plot(future_dates, combined_predictions, label='Combined Predictions', linestyle='--', marker='^')
# Add confidence intervals for ARIMA predictions
plt.fill_between(future_dates, arima_conf_int.iloc[:, 0], arima_conf_int.iloc[:, 1], color='k', alpha=0.1, label='ARIMA Confidence Interval')
# Customize the plot
plt.xlabel('Date')
plt.ylabel(column_name)
plt.title(f'{RegionName} Historical and Predicted {column_name}')
plt.legend()
plt.grid(True)
plt.savefig(f'{new_string}_combined_predictions_with_confidence_intervals.png') # Save the plot as a PNG file
# Show the plot
# plt.show() # Uncomment this line if you want to display the plot interactively
Key Points Highlighted in Comments added in the code above:
Initialization:
The code initializes the necessary models and scaler, and stores the input data.
Data Preparation:
The train_and_predict method converts the input data to a DataFrame, checks for necessary columns, and converts date information into a numerical format for modelling.
Feature Scaling:
The feature data (year_month) is scaled using StandardScaler.
Model Training:
Linear Regression and Gradient Boosting models are trained on the scaled data.
Future Predictions:
Future dates are generated and scaled for prediction.
Predictions are made using Linear Regression and Gradient Boosting models.
ARIMA (Autoregressive integrated moving average) Model:
The ARIMA model is fitted to the historical data, and future prices are forecasted.
Confidence intervals for the ARIMA predictions are also obtained.
Combining Predictions:
Predictions from all three models are averaged to obtain combined predictions.
Output and Visualization:
Predictions are printed and plotted along with the historical data.
The plot includes confidence intervals for ARIMA predictions and is saved as a PNG file.
The plot is shown below
Interpretation of Data for Predictions as Shown on the Plot
The plot of the historical and predicted average property prices for Kensington and Chelsea, as shown in the provided diagram, presents a comprehensive view of past trends and future forecasts. Here’s a detailed interpretation of the data:
Historical Data (2021-2023)
1. Trends and Fluctuations
- 2021: The average property prices started around £1.2 million and showed a gradual increase towards mid-2021, reaching around £1.4 million.
- 2022: There was a notable peak in early 2022, with prices surpassing £1.6 million. However, prices then began to decline, stabilizing around £1.3 million by the end of 2022.
- 2023: The prices continued to fluctuate throughout 2023, showing some volatility but generally trending downward towards the end of the year, closing below £1.2 million.
2. Seasonality and Market Dynamics
- The historical data reflects typical market dynamics, including seasonal variations, economic conditions, and possibly policy changes affecting the property market.
Predicted Data (2024)
1. Linear Regression Predictions
- The Linear Regression model forecasts a stable trend for 2024, with average prices hovering around £1.375 million consistently across all months. This indicates a steady market without significant upswings or downturns according to the linear trend observed from historical data.
2. Gradient Boosting Predictions
- The Gradient Boosting model predicts a lower average price, approximately £1.201 million, which remains constant throughout 2024. This model, being more sensitive to complex patterns, suggests a stabilization at a lower price point, possibly reflecting the model's recognition of more recent downward trends or other complex interactions within the data.
3. ARIMA Predictions
- The ARIMA model provides a fluctuating forecast, starting at £1.261 million in January and peaking at £1.351 million in March before gradually declining to £1.200 million by December 2024. The ARIMA model captures temporal dependencies and seasonality, indicating potential periodic increases and decreases in prices.
4. Combined Predictions
- The Combined Predictions, an average of the outputs from the three models, suggest a moderate outlook. Starting at £1.279 million in January, the combined forecast peaks at £1.306 million in March and then shows slight fluctuations, ending at £1.259 million in December. This combined approach smooths out extreme predictions and provides a balanced forecast.
Confidence Intervals
1. ARIMA Confidence Interval
- The shaded area represents the confidence intervals for the ARIMA predictions, indicating the range within which the true average prices are expected to lie with a certain probability.
- The confidence interval widens over time, reflecting increasing uncertainty in the predictions as the forecast horizon extends. This is typical in time-series forecasting as the model's accuracy diminishes with time due to potential unforeseen market changes.
Key Insights
1. Stabilization vs. Fluctuation
- The Linear Regression and Gradient Boosting models suggest a more stable trend compared to the ARIMA model, which anticipates more fluctuations. This difference highlights the various aspects each model captures from the historical data.
2. Moderate Growth with Uncertainty
- The combined predictions suggest moderate growth with some level of uncertainty, as indicated by the confidence intervals. Stakeholders should be aware of the potential range of prices and not rely solely on point estimates.
3. Model Complementarity
- Using multiple models provides a comprehensive view. While Linear Regression offers a simple trend, Gradient Boosting captures complex interactions, and ARIMA models temporal dependencies. The combined prediction leverages these strengths, offering a balanced forecast.
4. Market Implications
- Investors, policymakers, and real estate professionals can use these insights to make informed decisions. The predictions suggest a relatively stable market in 2024, but the confidence intervals highlight the importance of considering potential variability.
Conclusion
The forecasted average property prices for Kensington and Chelsea in 2024, derived from Linear Regression, Gradient Boosting, and ARIMA models, along with their combined predictions, provide a good understanding of future market trends. This multi-model approach ensures a robust and reliable forecast, accounting for different aspects of the historical data and potential future variations. Stakeholders should use these insights to navigate the property market with a balanced perspective on expected trends and associated uncertainties.
Technical Artifacts and Python Packages Used
In this project, a range of technical artifacts and Python packages were employed to efficiently handle, process, and analyze the property price data. Below is an overview of the tools and libraries used:
Technical Artifacts
Apache Spark: Utilized for distributed data processing and analysis. Spark's ability to handle large datasets efficiently was crucial for this project.
HDFS (Hadoop Distributed File System): Used for storing the large CSV files containing the property price data.
Apache Hive: Employed to manage and query the data stored in HDFS. Hive provided a SQL-like interface to interact with the data.
Yarn (Yet Another Resource Negotiator): Managed the cluster resources for running Apache Spark jobs.
Jenkins: Implemented for continuous integration and continuous deployment (CI/CD) of the data processing pipelines.
Airflow: Used for orchestrating the ETL workflows and scheduling the data processing tasks.
Visual Studio Code (VSC): The primary development environment for writing and debugging the Python code.
Python Packages
Python: The main programming language used for data processing, analysis, and model implementation.
PySpark: A Python API for Apache Spark, used to perform data transformations and machine learning tasks on the Spark cluster.
PySpark ML: The machine learning library in PySpark, utilized for building and training the prediction models.
Pandas: A data manipulation and analysis library, used for handling and transforming data within the Python environment.
Matplotlib: A plotting library for creating static, animated, and interactive visualizations in Python. Used here to generate the prediction plots.
PyTorch: A deep learning library used for more advanced machine learning tasks and model development.
sklearn.linear_model: A module from the scikit-learn library, specifically used for implementing linear regression models.
statsmodels: A library for estimating and testing statistical models. Utilized here for the ARIMA model implementation.
Limfit: A fitting library used for minimizing the least-squares error in model fitting.
SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors): A model used for time series forecasting in the statsmodels library.
Disclaimer
Great care has been taken to make sure that the technical information presented in this article is accurate, but any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on its content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
The full code , plots and how to run in https://github.com/michTalebzadeh/property_market_trends Please check the README file in GitHub