Predicting Property Market Trends: The Role of Data Science, Sentiment Analysis, and Time-Series Forecasting

Mich Talebzadeh (Ph.D.)

发布日期: 2024年7月15日

Introduction

Full code and sample plots from GitHub

In today's dynamic and rapidly evolving property market, making informed decisions is more critical than ever. Accurate predictions about property prices can provide invaluable insights for investors, homeowners, and policymakers alike. Traditionally, forecasting property market trends relied heavily on historical data and economic indicators. However, the advent of data science has revolutionized this process, enabling more sophisticated and accurate predictions.

This article delves into the combined use of data science techniques, sentiment analysis, and time-series forecasting to predict property market trends. By leveraging these advanced methodologies, we can uncover patterns and insights that were previously inaccessible.

We will explore a case study focusing on Kensington and Chelsea, a prime real estate area in London, to illustrate how these tools can be applied to real-world scenarios. Our analysis covers historical data from 2020 to 2023, utilizing land registry data for accurate tracking. Beyond 2023, we employ advanced forecasting methods to predict future trends, providing a comprehensive view of the market.

Data Preparation

To source our data, we utilized the extensive datasets available from the UK Land Registry, which provides a wealth of information in the form of downloadable CSV files. The most recent dataset, titled UK-HPI-full-file-2024-01, was obtained from the official website. This file includes a derived back series for the new UK House Price Index (HPI). The UK HPI data is available from 1995 for England and Wales, 2004 for Scotland, and 2005 for Northern Ireland. Additionally, a longer back series has been derived by using the historic path of the Office for National Statistics HPI to construct a series back to 1968.

The CSV file was downloaded and stored in an HDFS (Hadoop Distributed File System) sub-directory. Apache Spark was employed as an ETL (Extract, Transform, Load) tool to load the CSV data into a Hive table, facilitating efficient data processing and querying. The Hive table contained 139,770 rows, with 349 records specifically pertaining to the Kensington and Chelsea region.

Each entry in the CSV file represents the average price for a given month. Thus, over a year, there are 12 entries for each property type. The property types included in the dataset are:

- Detached

- Semi-detached

- Terraced

- Flat

Additionally, the dataset includes an overall average price for each month, providing a comprehensive view of the property market trends for the specified period. This granular level of detail allows for robust analysis and accurate forecasting, forming the foundation of our predictive modelling efforts.

The load process using Spark is shown below

import sys
from pyspark.sql import functions as F
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext
from pyspark.sql.functions import when

# Initialize SparkSession
appName = "DS"
spark = SparkSession.builder \
    .appName(appName) \
    .enableHiveSupport() \
    .getOrCreate()

# Set the log level to ERROR to reduce verbosity
sc = spark.sparkContext
sc.setLogLevel("ERROR")
##
## Get a DF first
##
#csvlocation="hdfs://rhes75:9000/ds/UK-HPI-full-file-2020-01.csv"
#csvlocation="hdfs://rhes75:9000/ds/UK-HPI-full-file-2023-06.csv"
csvlocation="hdfs://rhes75:9000/ds/UK-HPI-full-file-2024-01.csv"

rows = spark.read.csv(csvlocation, header="true").count()
print("\nnumber of rows from csv file is ",rows)
if (rows == 0):
         println("Empty CSV directory, aborting!")
         sys.exit(1)

house_df = spark.read.csv(csvlocation, header="true")
house_df.printSchema()

# Map the columns to correct data types
##
for col_name in house_df.columns:
    house_df = house_df.withColumn(col_name, when(house_df[col_name].isNull(), 0).otherwise(house_df[col_name]))

# Cast columns to appropriate data types
for col_name in house_df.columns:
    if col_name == "Date":
        house_df = house_df.withColumn(col_name, F.col(col_name).cast("string"))
    elif col_name == "Date" or col_name == "RegionName" or col_name == "AreaCode":
        house_df = house_df.withColumn(col_name, F.col(col_name).cast("string"))
    else:
        house_df = house_df.withColumn(col_name, F.col(col_name).cast("double"))

house_df.printSchema()
house_df.createOrReplaceTempView("tmp")

## Check if table exist otherwise create it
DB = "DS"
tableName = "ukhouseprices"
fullyQualifiedTableName = DB + '.' + tableName
regionname = "Kensington and Chelsea"
spark.sql(f"""DROP TABLE IF EXISTS {fullyQualifiedTableName}_staging""")
sqltext = f"""
    CREATE TABLE {fullyQualifiedTableName}_staging(
         Datetaken  string
       , RegionName  string
       , AreaCode  string
       , AveragePrice  double
       , Index  double
       , IndexSA  double
       , oneMonthPercentChange  double
       , twelveMonthPercentChange  double
       , AveragePriceSA  double
       , SalesVolume  double
       , DetachedPrice  double
       , DetachedIndex  double
       , Detached1mPercentChange  double
       , Detached12mPercentChange  double
       , SemiDetachedPrice  double
       , SemiDetachedIndex  double
       , SemiDetached1mPercentChange  double
       , SemiDetached12mPercentChange  double
       , TerracedPrice  double
       , TerracedIndex  double
       , Terraced1mPercentChange  double
       , Terraced12mPercentChange  double
       , FlatPrice  double
       , FlatIndex  double
       , Flat1mPercentChange  double
       , Flat12mPercentChange  double
       , CashPrice  double
       , CashIndex  double
       , Cash1mPercentChange  double
       , Cash12mPercentChange  double
       , MortgagePrice  double
       , MortgageIndex  double
       , Mortgage1mPercentChange  double
       , Mortgage12mPercentChange  double
       , FTBPrice  double
       , FTBIndex  double
       , FTB1mPercentChange  double
       , FTB12mPercentChange  double
       , FOOPrice  double
       , FOOIndex  double
       , FOO1mPercentChange  double
       , FOO12mPercentChange  double
       , NewPrice  double
       , NewIndex  double
       , New1mPercentChange  double
       , New12mPercentChange  double
       , OldPrice  double
       , OldIndex  double
       , Old1mPercentChange  double
       , Old12mPercentChange  double
    )
    COMMENT 'Your comment here'
    STORED AS PARQUET
"""
spark.sql(sqltext)

sqltext = f"""
    INSERT INTO {fullyQualifiedTableName}_staging
    SELECT
        --TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(date,'dd/MM/yyyy'),'yyyy-MM-dd')) AS datetaken
        date AS datetaken,
        RegionName,
        AreaCode,
        AveragePrice,
        Index,
        IndexSA,
        `1m%Change`,  -- Escape special characters using backticks
        `12m%Change`,  -- Escape special characters using backticks
        AveragePriceSA,
        SalesVolume,
        DetachedPrice,
        DetachedIndex,
        `Detached1m%Change`,  -- Escape special characters using backticks
        `Detached12m%Change`,  -- Escape special characters using backticks
        SemiDetachedPrice,
        SemiDetachedIndex,
        `SemiDetached1m%Change`,  -- Escape special characters using backticks
        `SemiDetached12m%Change`,  -- Escape special characters using backticks
        TerracedPrice,
        TerracedIndex,
        `Terraced1m%Change`,  -- Escape special characters using backticks
        `Terraced12m%Change`,  -- Escape special characters using backticks
        FlatPrice,
        FlatIndex,
        `Flat1m%Change`,  -- Escape special characters using backticks
        `Flat12m%Change`,  -- Escape special characters using backticks
        CashPrice,
        CashIndex,
        `Cash1m%Change`,  -- Escape special characters using backticks
        `Cash12m%Change`,  -- Escape special characters using backticks
        MortgagePrice,
        MortgageIndex,
        `Mortgage1m%Change`,  -- Escape special characters using backticks
        `Mortgage12m%Change`,  -- Escape special characters using backticks
        FTBPrice,
        FTBIndex,
        `FTB1m%Change`,  -- Escape special characters using backticks
        `FTB12m%Change`,  -- Escape special characters using backticks
        FOOPrice,
        FOOIndex,
        `FOO1m%Change`,  -- Escape special characters using backticks
        `FOO12m%Change`,  -- Escape special characters using backticks
        NewPrice,
        NewIndex,
        `New1m%Change`,  -- Escape special characters using backticks
        `New12m%Change`,  -- Escape special characters using backticks
        OldPrice,
        OldIndex,
        `Old1m%Change`,  -- Escape special characters using backticks
        `Old12m%Change`  -- Escape special characters using backticks
    FROM tmp
"""
spark.sql(sqltext)
# now create and populate the main table
spark.sql(f"""DROP TABLE IF EXISTS {fullyQualifiedTableName}""")
sqltext = f"""
    CREATE TABLE {fullyQualifiedTableName}(
         Datetaken  DATE COMMENT 'The year and month to which the monthly statistics apply'
       , RegionName  string COMMENT 'Name of geography (Country, Regional, County/Unitary/District Authority and London Borough)'
       , AreaCode  string COMMENT 'Code of geography (Country, Regional, County/Unitary/District Authority and London Borough)'
       , AveragePrice  double COMMENT 'Average house price for a geography in a particular period'
       , Index  double COMMENT 'House price index for a geography in a particular period (January 2015=100).'
       , IndexSA  double COMMENT 'Seasonally adjusted house price for a geography in a particular period (January 2015=100).'
       , oneMonthPercentChange  double COMMENT 'The percentage change in the Average Price compared to the previous month'
       , twelveMonthPercentChange  double COMMENT 'The percentage change in the Average Price compared to the same period twelve months earlier.'
       , AveragePriceSA  double COMMENT 'Seasonally adjusted Average Price for a geography in a particular period'
       , SalesVolume  double COMMENT 'Number of registered transactions for a geography in a particular period'
       , DetachedPrice  double COMMENT 'Average house price for a particular property type (such as detached houses), for a geography in a particular period.'
       , DetachedIndex  double COMMENT 'House price index for a particular property type (such as detached houses), for a geography in a particular period (January 2015=100).'
       , Detached1mPercentChange  double COMMENT 'The percentage change in the [Property Type Price (such as detached houses) compared to the previous month'
       , Detached12mPercentChange  double COMMENT 'The percentage change in the [Property Type Price (such as detached houses) compared to the same period twelve months earlier.'
       , SemiDetachedPrice  double
       , SemiDetachedIndex  double
       , SemiDetached1mPercentChange  double
       , SemiDetached12mPercentChange  double
       , TerracedPrice  double
       , TerracedIndex  double
       , Terraced1mPercentChange  double
       , Terraced12mPercentChange  double
       , FlatPrice  double
       , FlatIndex  double
       , Flat1mPercentChange  double
       , Flat12mPercentChange  double
       , CashPrice  double COMMENT 'Average house price by funding status (such as cash), for a geography in a particular period.'
       , CashIndex  double COMMENT 'House price index by funding status (such as cash), for a geography in a particular period (January 2015=100).'
       , Cash1mPercentChange  double
       , Cash12mPercentChange  double
       , MortgagePrice  double COMMENT 'Average house price by funding status (such as cash), for a geography in a particular period.'
       , MortgageIndex  double COMMENT 'House price index by funding status (such as cash), for a geography in a particular period (January 2015=100).'
       , Mortgage1mPercentChange  double
       , Mortgage12mPercentChange  double
       , FTBPrice  double COMMENT 'Average house price by buyer status (such as first time buyer/former owner occupier), for a geography in a particular period.'
       , FTBIndex  double COMMENT 'House price index by buyer status (such as first time buyer/former owner occupier), for a geography in a particular period. (January 2015=100).'
       , FTB1mPercentChange  double
       , FTB12mPercentChange  double
       , FOOPrice  double COMMENT 'Average house price by buyer status (such as first time buyer/former owner occupier), for a geography in a particular period.'
       , FOOIndex  double COMMENT 'House price index by buyer status (such as first time buyer/former owner occupier), for a geography in a particular period. (January 2015=100).'
       , FOO1mPercentChange  double
       , FOO12mPercentChange  double
       , NewPrice  double COMMENT 'Average house price by property status (such as new or existing property), for a geography in a particular period.'
       , NewIndex  double COMMENT 'House price index by property status (such as new or existing property), for a geography in a particular period. (January 2015=100).'
       , New1mPercentChange  double
       , New12mPercentChange  double
       , OldPrice  double  COMMENT 'Average house price by property status (such as new or existing property), for a geography in a particular period.'
       , OldIndex  double COMMENT 'House price index by property status (such as new or existing property), for a geography in a particular period. (January 2015=100).'
       , Old1mPercentChange  double
       , Old12mPercentChange  double
    )
    COMMENT 'from csv file UK-HPI-full-file-2020-01.csv through the staging table'
    STORED AS PARQUET
    TBLPROPERTIES ( "parquet.compress"="ZLIB" )
"""
spark.sql(sqltext)
sqltext = f"""
          INSERT INTO {fullyQualifiedTableName}
          SELECT
            TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(datetaken,'dd/MM/yyyy'),'yyyy-MM-dd')) AS datetaken
          , RegionName
          , AreaCode
          , AveragePrice
          , Index
          , IndexSA
          , oneMonthPercentChange
          , twelveMonthPercentChange
          , AveragePriceSA
          , SalesVolume
          , DetachedPrice
          , DetachedIndex
          , Detached1mPercentChange
          , Detached12mPercentChange
          , SemiDetachedPrice
          , SemiDetachedIndex
          , SemiDetached1mPercentChange
          , SemiDetached12mPercentChange
          , TerracedPrice
          , TerracedIndex
          , Terraced1mPercentChange
          , Terraced12mPercentChange
          , FlatPrice
          , FlatIndex
          , Flat1mPercentChange
          , Flat12mPercentChange
          , CashPrice
          , CashIndex
          , Cash1mPercentChange
          , Cash12mPercentChange
          , MortgagePrice
          , MortgageIndex
          , Mortgage1mPercentChange
          , Mortgage12mPercentChange
          , FTBPrice
          , FTBIndex
          , FTB1mPercentChange
          , FTB12mPercentChange
          , FOOPrice
          , FOOIndex
          , FOO1mPercentChange
          , FOO12mPercentChange
          , NewPrice
          , NewIndex
          , New1mPercentChange
          , New12mPercentChange
          , OldPrice
          , OldIndex
          , Old1mPercentChange
          , Old12mPercentChange
          FROM {fullyQualifiedTableName}_staging
        """
spark.sql(sqltext)
spark.sql(f"""DROP TABLE IF EXISTS {fullyQualifiedTableName}_staging""")
spark.sql(f"""DROP TABLE IF EXISTS {DB}.summary""")
sqltext = f"""
   CREATE TABLE {DB}.summary
    COMMENT 'summary table with non null columns'
    STORED AS PARQUET
    TBLPROPERTIES ( "parquet.compress"="ZLIB" )
   AS
   SELECT
          datetaken
        , regionname
        , areacode
        , averageprice
        , index
        , salesvolume
        , detachedprice
        , detachedindex
        , semidetachedprice
        , semidetachedindex
        , terracedprice
        , terracedindex
        , flatprice
        , flatindex
        , cashprice
        , cashindex
        , mortgageprice
        , mortgageindex
        , ftbprice
        , ftbindex
        , fooprice
        , fooindex
        , newindex
        , oldprice
        , oldindex
    FROM  {fullyQualifiedTableName}
    WHERE regionname = '{regionname}'
"""
spark.sql(sqltext)
rows = spark.sql(f"""SELECT COUNT(1) FROM {fullyQualifiedTableName}""").collect()[0][0]
print(f"number of rows from {fullyQualifiedTableName} is {rows}")
rows = spark.sql(f"""SELECT COUNT(1) FROM {DB}.summary""").collect()[0][0]
print(f"number of rows in {DB}.summary table is {rows}")
sys.exit()

Once the csv file is loaded into Hive table, it can be read as follows using a method of a class in Spark

        # Load data from Hive table
    def load_data(self):
        DSDB = "DS"
        tableName = "ukhouseprices"
        fullyQualifiedTableName = f"{DSDB}.{tableName}"
        if self.spark.sql(f"SHOW TABLES IN {DSDB} LIKE '{tableName}'").count() == 1:
            self.spark.sql(f"ANALYZE TABLE {fullyQualifiedTableName} COMPUTE STATISTICS")
            rows = self.spark.sql(f"SELECT COUNT(1) FROM {fullyQualifiedTableName}").collect()[0][0]
            print(f"\nTotal number of rows in table {fullyQualifiedTableName} is {rows}\n")
        else:
            print(f"No such table {fullyQualifiedTableName}")
            sys.exit(1)
        
        # create a dataframe from the loaded data
        house_df = self.spark.sql(f"SELECT * FROM {fullyQualifiedTableName}")
        house_df.printSchema()
        return house_df

From the loaded data, we selected 3 years data for Kensington and Chelsea from the beginning of 2020 till end of 2023 (last entries in csv file). This allowed 36 period entries for each property type (flat, detached, semi-detached and terrace), plus the derived average for each month. An example for average price is shown below

        RegionName = 'Kensington and Chelsea'
        new_string = RegionName.replace(" ", "")
        StartDate = '2020-01-01'
        EndDate   = '2023-12-31'
        print(f"\nAnalysis for London Borough of {RegionName} for period {StartDate} till {EndDate}\n")
        # Filter, select, and order by AveragePrice in descending order
        df3 = house_df_final.filter(
            (F.col('RegionName') == RegionName) & 
            (F.col('AveragePrice').isNotNull()) & 
            (F.col('DateTaken') >= StartDate) &
            (F.col('DateTaken') <= EndDate)
        ).select(
            F.col('AveragePrice').alias("Average Price")
        ).orderBy(F.col('AveragePrice').desc())
        df3.show(100,False)

The above can then be repeated for every property type and will be stored in a Json file for ease of access and plotting. The Json file content is shown below

 cat property_data.json
{
            "DateTaken": ["2021-02-01", "2021-03-01", "2021-04-01", "2021-05-01", "2021-06-01",
                        "2021-07-01", "2021-08-01", "2021-09-01", "2021-10-01", "2021-11-01",
                        "2021-12-01", "2022-01-01", "2022-02-01", "2022-03-01", "2022-04-01",
                        "2022-05-01", "2022-06-01", "2022-07-01", "2022-08-01", "2022-09-01",
                        "2022-10-01", "2022-11-01", "2022-12-01", "2023-01-01", "2023-02-01",
                        "2023-03-01", "2023-04-01", "2023-05-01", "2023-06-01", "2023-07-01",
                        "2023-08-01", "2023-09-01", "2023-10-01", "2023-11-01", "2023-12-01"],
            "Average Price": [1265469.0, 1224768.0, 1198159.0, 1223380.0, 1196736.0,
                            1321844.0, 1445633.0, 1515740.0, 1527601.0, 1447296.0,
                            1410458.0, 1371047.0, 1374360.0, 1441513.0, 1513572.0,
                            1518352.0, 1520758.0, 1503688.0, 1517723.0, 1527039.0,
                            1410576.0, 1376685.0, 1304428.0, 1364872.0, 1350849.0,
                            1384497.0, 1334814.0, 1363774.0, 1400045.0, 1376043.0,
                            1366026.0, 1315895.0, 1375884.0, 1268656.0, 1199227.0],
            "Detached Price": [3339643.0, 3210883.0, 3180273.0, 3261376.0, 3209506.0,
                            3609718.0, 4017462.0, 4257942.0, 4319779.0, 4105235.0,
                            3991142.0, 3860414.0, 3850122.0, 4126667.0, 4324868.0,
                            4397984.0, 4346722.0, 4320823.0, 4369358.0, 4494141.0,
                            4125761.0, 4024638.0, 3769144.0, 4021058.0, 3956980.0,
                            4086739.0, 3917525.0, 3954089.0, 4041221.0, 3980686.0,
                            4016724.0, 3854223.0, 3982877.0, 3628941.0, 3301387.0],
            "Semi Detached Price": [3457594.0, 3330070.0, 3269217.0, 3341125.0, 3271765.0,
                                    3667087.0, 4100628.0, 4342912.0, 4403758.0, 4171473.0,
                                    4059804.0, 3944470.0, 3920283.0, 4178471.0, 4393418.0,
                                    4455597.0, 4451900.0, 4415196.0, 4458491.0, 4513078.0,
                                    4158258.0, 4058165.0, 3821416.0, 4035780.0, 3984685.0,
                                    4111878.0, 3953592.0, 4021585.0, 4125853.0, 4067773.0,
                                    4082121.0, 3923799.0, 4069382.0, 3719860.0, 3420758.0],
            "Terraced Price": [2347194.0, 2260217.0, 2218038.0, 2265759.0, 2227464.0,
                            2475419.0, 2741704.0, 2887687.0, 2919685.0, 2765163.0,
                            2687816.0, 2616501.0, 2605946.0, 2756785.0, 2909453.0,
                            2940172.0, 2953079.0, 2912599.0, 2941838.0, 2960397.0,
                            2741789.0, 2682415.0, 2540540.0, 2661903.0, 2620349.0,
                            2680795.0, 2572512.0, 2632418.0, 2716269.0, 2671069.0,
                            2665975.0, 2561674.0, 2662450.0, 2436004.0, 2259569.0],
            "Flat Price": [1099165.0, 1065405.0, 1041050.0, 1062748.0, 1037927.0,
                        1143756.0, 1245289.0, 1303428.0, 1312071.0, 1243202.0,
                        1212598.0, 1178268.0, 1183809.0, 1237634.0, 1297378.0,
                        1297847.0, 1299009.0, 1285290.0, 1296957.0, 1304349.0,
                        1204022.0, 1174193.0, 1113022.0, 1163575.0, 1153696.0,
                        1182815.0, 1142170.0, 1166603.0, 1195780.0, 1174990.0,
                        1164021.0, 1122277.0, 1176026.0, 1087318.0, 1034924.0]
        }

This file can then be read in the main Python method at the beginning of the analysis process and passed to the relevant classes to be be used in the their constructor as shown below:

# Load data from JSON file
    with open('/python/genai/data/property_data.json', 'r') as f:
       data = json.load(f)

# pass this file to every class that needs it
predictor = PropertyPricePredictor(data)

Analysis of data

To start analysing data, we need to deploy visualisation tools and appropriate plotting tools. like pandas and matplotlib among others

import matplotlib.pyplot as plt
import pandas as pd

We then create a panda's dataframe from Spark dataframe as below where

p_dfm = df3.toPandas()  # converting spark DF to Pandas DF

Once that is done, we can use Pandas dataframe for plotting

There are several ways to get information about columns in a Pandas DataFrame:. For example

Use info() for a quick overview of the entire DataFrame structure.
Use column attributes for specific information about individual columns.
Use dtypes and shape for programmatic access to data types and dimensions.

Plotting the distribution of average prices for Kensington and Chelsea from 2020-2023

# Plot the distribution of AveragePrice
        plt.figure(figsize=(10, 6))
     
        plt.hist(p_dfm['Average Price'], bins=30, edgecolor='black')
        plt.xlabel('Average Price', fontsize=14)
        plt.ylabel('Frequency', fontsize=14)
        plt.title(f'Distribution of Average price in {RegionName} from {StartDate} to {EndDate}', fontsize=16)
        plt.grid(True)
        plt.savefig(f'{new_string}_AveragePricePaid.png')
        # Show the plot
        #plt.show()      
        plt.close()

The times series plot is shown below

A combined plot for all properties types for the same period is shown below and the associated code

      plt.figure(figsize=(12, 6))

        # Plot each house type price
        plt.plot(p_dfm['DateTaken'], p_dfm['Average Price'], label='Average Price', marker='o')
        plt.plot(p_dfm['DateTaken'], p_dfm['Detached Price'], label='Detached Price', marker='o')
        plt.plot(p_dfm['DateTaken'], p_dfm['Semi Detached Price'], label='Semi Detached Price', marker='o')
        plt.plot(p_dfm['DateTaken'], p_dfm['Terraced Price'], label='Terraced Price', marker='o')
        plt.plot(p_dfm['DateTaken'], p_dfm['Flat Price'], label='Flat Price', marker='o')

        # Adding titles and labels
        plt.xlabel('Date')
        plt.ylabel('Price (￡)')
        plt.title(f"UK Registered House Prices in {RegionName} From {StartDate} until {EndDate}")
        plt.legend()
        plt.grid(True)
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.savefig(f'{new_string}_CombinedPricePaid_from_{StartDate}_until_{EndDate}.png')

Interpretation of data

The plot displays the trends in registered house prices for different types of properties in Kensington and Chelsea from January 1, 2020, to December 31, 2023. Here are the key points to note:

1. Property Types:

- Average Price (Blue)

- Detached Price (Orange)

- Semi Detached Price (Green)

- Terraced Price (Red)

- Flat Price (Purple)

2. Y-Axis (Price in ￡): The price is represented in millions of pounds (￡). The average price line is the lowest, while semi-detached and detached prices are higher.

3. X-Axis (Date): The time period spans from early 2020 to the end of 2023. The x-axis labels are dates with a format showing the year and month.

4. General Trends:

- Semi Detached Price: This has been consistently the highest among all property types. There is an observable peak around mid-2022.

- Detached Price: Closely follows the semi-detached prices but is slightly lower. This also peaks around the same time.

- Terraced Price: Lower than semi-detached and detached prices but higher than flats and average prices. It shows some fluctuations but a generally stable trend.

- Flat Price: This has the lowest price point among the specific property types, showing slight fluctuations but maintaining a relatively stable trend.

- Average Price: This represents the overall average of the prices and is the lowest, reflecting the combined trend of all property types.

5. Observations:

- There is a noticeable peak in property prices around mid-2022, particularly for semi-detached and detached properties.

- Post mid-2022, there is a downward trend in prices for all property types.

- The flat prices and average prices have the least fluctuation, indicating less volatility compared to detached and semi-detached properties.

- Detached and semi-detached properties follow similar trends, indicating they might be influenced by similar market factors.

6. Legend and Annotations:

- The legend in the top left corner clarifies which colour corresponds to which property type.

- The title of the plot specifies the geographical location (Kensington and Chelsea) and the time frame.

In summary, this plot shows that semi-detached and detached properties in Kensington and Chelsea are the most expensive and volatile in price. There was a significant price peak around mid-2022, followed by a general decline. Flats and the average prices are more stable and lower compared to other property types.

Average Prices for Each Property Type in Kensington and Chelsea from 2020 till end of 2023

Property Type Average Price

Average ￡1,378,783.06

Detached ￡3,921,601.46

Semi Detached ￡3,991,464.06

Terraced ￡2,642,818.49

Flat ￡1,182,931.77

Future trends, Justification for Using Different Techniques for Predicting Data In 2024

In the context of forecasting property prices for Kensington and Chelsea in 2024, employing multiple prediction techniques offers several advantages over relying on a single model. Here is a detailed justification for the use of Linear Regression, Gradient Boosting, ARIMA, and Combined Predictions:

1. Diversity in Modelling Assumptions and Strengths

Each predictive model has its own set of assumptions, strengths, and weaknesses. By combining different models, we can leverage the unique advantages of each, leading to a more comprehensive and reliable forecast.

- Linear Regression:

- Assumption: Assumes a linear relationship between time (independent variable) and property prices (dependent variable).

- Strength: Simple and interpretable, provides a baseline trend.

- Weakness: May not capture non-linear patterns or complex interactions.

- Gradient Boosting:

- Assumption: Flexible, does not assume a specific form of relationship between variables.

- Strength: Capable of modelling complex, non-linear relationships and interactions in the data.

- Weakness: Can be sensitive to overfitting, especially with small datasets.

- ARIMA:

- Assumption: Captures temporal dependencies, assumes stationarity after differencing.

- Strength: Effective for time-series data, adapts to trends and seasonality.

- Weakness: Requires careful parameter tuning, may struggle with sudden changes or non-linear trends.

2. Enhanced Predictive Accuracy

Combining multiple models can improve predictive accuracy. Each model contributes to the overall prediction based on its ability to capture different aspects of the data.

- Linear Regression provides a foundational trend.

领英推荐

A transformation framework that understands your data:…

Theory Ventures 9 个月前

What is Graph Data Science and Neo4j?

i3, LLC 1 年前

Mastering SQL?Joins: A Comprehensive Guide for…

Leonardo A. 1 年前

- Gradient Boosting captures intricate patterns and interactions.

- ARIMA accounts for temporal dependencies and recent trends.

By averaging these predictions, the combined model mitigates the individual weaknesses of each approach, leading to more robust and accurate forecasts.

3. Risk Mitigation

Using multiple models reduces the risk of relying on a single method that may fail to capture important data characteristics. For instance, if the property market undergoes unexpected changes or anomalies, one model might be better suited to adapt than others. This diversification ensures that the overall prediction is less susceptible to errors specific to any one model.

4. Confidence Intervals and Uncertainty Quantification

The ARIMA model provides confidence intervals, which quantify the uncertainty in the predictions. This is crucial for decision-making, as it offers a range within which the true property prices are expected to lie. By combining this with other models, stakeholders get not only a point estimate but also a sense of the reliability of these estimates.

5. Data Complexity and Non-Stationarity

Property prices are influenced by a multitude of factors including economic conditions, policy changes, and local market dynamics. Such complexity often leads to non-stationarity and non-linear patterns in the data. No single model is likely to capture all these aspects adequately. By using a combination of Linear Regression, Gradient Boosting, and ARIMA, we can better address these complexities and provide a more comprehensive forecast.

6. Stakeholder Confidence and Decision Support

Providing forecasts from multiple models can enhance the confidence of stakeholders (e.g., investors, policymakers, and real estate professionals) in the predictions. The rationale for using a combination of models demonstrates a thorough and thoughtful approach, which is likely to be more persuasive and credible.

The decision to use multiple prediction techniques for forecasting property prices in Kensington and Chelsea is driven by the need for accuracy, robustness, and comprehensive analysis. By leveraging the strengths of Linear Regression, Gradient Boosting, and ARIMA models, and combining their predictions, we achieve a balanced forecast that accounts for linear trends, complex patterns, and temporal dependencies. This approach not only enhances the reliability of the predictions but also provides stakeholders with a clear understanding of potential uncertainties, ultimately supporting more informed decision-making in the property market. Below shows the code used

class PropertyPricePredictor:
    def __init__(self, data):
        self.scaler = StandardScaler()  # Initialize a scaler for normalizing data
        self.linear_model = LinearRegression()  # Initialize Linear Regression model
        self.gb_model = GradientBoostingRegressor()  # Initialize Gradient Boosting model
        self.data = data  # Store the input data

    def train_and_predict(self, RegionName, column_name):
        """
        Workflow for Linear Regression, Gradient Boosting Regressor, and ARIMA
        """
        new_string = RegionName.replace(" ", "")  # Remove spaces from region name for file naming
        df = pd.DataFrame(self.data)  # Convert the input data to a Pandas DataFrame

        # Check if 'DateTaken' is in the DataFrame
        if 'DateTaken' not in df.columns:
            raise KeyError("'DateTaken' column not found in the data")

        # Convert DateTaken to datetime format
        df['DateTaken'] = pd.to_datetime(df['DateTaken'])

        # Convert DateTaken to a numerical format (year and month as integer)
        df['year_month'] = df['DateTaken'].dt.year * 100 + df['DateTaken'].dt.month

        # Extract feature (year_month) and target (property price) columns
        X = df[['year_month']].values
        y = df[column_name]

        # Scale the feature data
        X_scaled = self.scaler.fit_transform(X)

        # Train the Linear Regression model
        self.linear_model.fit(X_scaled, y)
        
        # Train the Gradient Boosting model
        self.gb_model.fit(X_scaled, y)

        # Prepare future dates for prediction (next 12 months starting from January 2024)
        future_dates = pd.date_range(start='2024-01-01', periods=12, freq='MS')
        future_year_month = future_dates.year * 100 + future_dates.month
        future_year_month_scaled = self.scaler.transform(future_year_month.values.reshape(-1, 1))

        # Predict future prices using Linear Regression
        linear_predictions = self.linear_model.predict(future_year_month_scaled)
        
        # Predict future prices using Gradient Boosting
        gb_predictions = self.gb_model.predict(future_year_month_scaled)

        # Fit the ARIMA model on the historical data
        arima_model = SARIMAX(y, order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
        arima_results = arima_model.fit()

        # Forecast future prices using the ARIMA model for the next 12 months
        forecast = arima_results.get_forecast(steps=12)
        arima_predictions = forecast.predicted_mean
        arima_conf_int = forecast.conf_int()  # Confidence intervals for the ARIMA predictions

        # Combine the predictions from Linear Regression, Gradient Boosting, and ARIMA
        combined_predictions = (linear_predictions + gb_predictions + arima_predictions) / 3

        # Print the predictions for each month
        display_name = column_name.replace(" Price", "")  # Clean the column name for display
        print(f"Property type: {display_name}")
        for date, lin_pred, gb_pred, arima_pred, combined_pred in zip(future_dates, linear_predictions, gb_predictions, arima_predictions, combined_predictions):
            print(f"Predicted {column_name} for {date.strftime('%Y-%m')} - Linear: {lin_pred:.2f}, Gradient Boosting: {gb_pred:.2f}, ARIMA: {arima_pred:.2f}, Combined: {combined_pred:.2f}")

        # Plot the historical data and predictions
        plt.figure(figsize=(14, 7))
        plt.plot(df['DateTaken'], df[column_name], label='Historical Data')  # Historical data

        # Plot predictions from each model
        plt.plot(future_dates, linear_predictions, label='Linear Regression Predictions', linestyle='--', marker='o')
        plt.plot(future_dates, gb_predictions, label='Gradient Boosting Predictions', linestyle='--', marker='x')
        plt.plot(future_dates, arima_predictions, label='ARIMA Predictions', linestyle='--', marker='d')
        plt.plot(future_dates, combined_predictions, label='Combined Predictions', linestyle='--', marker='^')

        # Add confidence intervals for ARIMA predictions
        plt.fill_between(future_dates, arima_conf_int.iloc[:, 0], arima_conf_int.iloc[:, 1], color='k', alpha=0.1, label='ARIMA Confidence Interval')

        # Customize the plot
        plt.xlabel('Date')
        plt.ylabel(column_name)
        plt.title(f'{RegionName} Historical and Predicted {column_name}')
        plt.legend()
        plt.grid(True)
        plt.savefig(f'{new_string}_combined_predictions_with_confidence_intervals.png')  # Save the plot as a PNG file

        # Show the plot
        # plt.show()  # Uncomment this line if you want to display the plot interactively

Key Points Highlighted in Comments added in the code above:

Initialization:

The code initializes the necessary models and scaler, and stores the input data.

Data Preparation:

The train_and_predict method converts the input data to a DataFrame, checks for necessary columns, and converts date information into a numerical format for modelling.

Feature Scaling:

The feature data (year_month) is scaled using StandardScaler.

Model Training:

Linear Regression and Gradient Boosting models are trained on the scaled data.

Future Predictions:

Future dates are generated and scaled for prediction.

Predictions are made using Linear Regression and Gradient Boosting models.

ARIMA (Autoregressive integrated moving average) Model:

The ARIMA model is fitted to the historical data, and future prices are forecasted.

Confidence intervals for the ARIMA predictions are also obtained.

Combining Predictions:

Predictions from all three models are averaged to obtain combined predictions.

Output and Visualization:

Predictions are printed and plotted along with the historical data.

The plot includes confidence intervals for ARIMA predictions and is saved as a PNG file.

The plot is shown below

Interpretation of Data for Predictions as Shown on the Plot

The plot of the historical and predicted average property prices for Kensington and Chelsea, as shown in the provided diagram, presents a comprehensive view of past trends and future forecasts. Here’s a detailed interpretation of the data:

Historical Data (2021-2023)

1. Trends and Fluctuations

- 2021: The average property prices started around ￡1.2 million and showed a gradual increase towards mid-2021, reaching around ￡1.4 million.

- 2022: There was a notable peak in early 2022, with prices surpassing ￡1.6 million. However, prices then began to decline, stabilizing around ￡1.3 million by the end of 2022.

- 2023: The prices continued to fluctuate throughout 2023, showing some volatility but generally trending downward towards the end of the year, closing below ￡1.2 million.

2. Seasonality and Market Dynamics

- The historical data reflects typical market dynamics, including seasonal variations, economic conditions, and possibly policy changes affecting the property market.

Predicted Data (2024)

1. Linear Regression Predictions

- The Linear Regression model forecasts a stable trend for 2024, with average prices hovering around ￡1.375 million consistently across all months. This indicates a steady market without significant upswings or downturns according to the linear trend observed from historical data.

2. Gradient Boosting Predictions

- The Gradient Boosting model predicts a lower average price, approximately ￡1.201 million, which remains constant throughout 2024. This model, being more sensitive to complex patterns, suggests a stabilization at a lower price point, possibly reflecting the model's recognition of more recent downward trends or other complex interactions within the data.

3. ARIMA Predictions

- The ARIMA model provides a fluctuating forecast, starting at ￡1.261 million in January and peaking at ￡1.351 million in March before gradually declining to ￡1.200 million by December 2024. The ARIMA model captures temporal dependencies and seasonality, indicating potential periodic increases and decreases in prices.

4. Combined Predictions

- The Combined Predictions, an average of the outputs from the three models, suggest a moderate outlook. Starting at ￡1.279 million in January, the combined forecast peaks at ￡1.306 million in March and then shows slight fluctuations, ending at ￡1.259 million in December. This combined approach smooths out extreme predictions and provides a balanced forecast.

Confidence Intervals

1. ARIMA Confidence Interval

- The shaded area represents the confidence intervals for the ARIMA predictions, indicating the range within which the true average prices are expected to lie with a certain probability.

- The confidence interval widens over time, reflecting increasing uncertainty in the predictions as the forecast horizon extends. This is typical in time-series forecasting as the model's accuracy diminishes with time due to potential unforeseen market changes.

Key Insights

1. Stabilization vs. Fluctuation

- The Linear Regression and Gradient Boosting models suggest a more stable trend compared to the ARIMA model, which anticipates more fluctuations. This difference highlights the various aspects each model captures from the historical data.

2. Moderate Growth with Uncertainty

- The combined predictions suggest moderate growth with some level of uncertainty, as indicated by the confidence intervals. Stakeholders should be aware of the potential range of prices and not rely solely on point estimates.

3. Model Complementarity

- Using multiple models provides a comprehensive view. While Linear Regression offers a simple trend, Gradient Boosting captures complex interactions, and ARIMA models temporal dependencies. The combined prediction leverages these strengths, offering a balanced forecast.

4. Market Implications

- Investors, policymakers, and real estate professionals can use these insights to make informed decisions. The predictions suggest a relatively stable market in 2024, but the confidence intervals highlight the importance of considering potential variability.

Conclusion

The forecasted average property prices for Kensington and Chelsea in 2024, derived from Linear Regression, Gradient Boosting, and ARIMA models, along with their combined predictions, provide a good understanding of future market trends. This multi-model approach ensures a robust and reliable forecast, accounting for different aspects of the historical data and potential future variations. Stakeholders should use these insights to navigate the property market with a balanced perspective on expected trends and associated uncertainties.

Technical Artifacts and Python Packages Used

In this project, a range of technical artifacts and Python packages were employed to efficiently handle, process, and analyze the property price data. Below is an overview of the tools and libraries used:

Technical Artifacts

Apache Spark: Utilized for distributed data processing and analysis. Spark's ability to handle large datasets efficiently was crucial for this project.

HDFS (Hadoop Distributed File System): Used for storing the large CSV files containing the property price data.

Apache Hive: Employed to manage and query the data stored in HDFS. Hive provided a SQL-like interface to interact with the data.

Yarn (Yet Another Resource Negotiator): Managed the cluster resources for running Apache Spark jobs.

Jenkins: Implemented for continuous integration and continuous deployment (CI/CD) of the data processing pipelines.

Airflow: Used for orchestrating the ETL workflows and scheduling the data processing tasks.

Visual Studio Code (VSC): The primary development environment for writing and debugging the Python code.

Python Packages

Python: The main programming language used for data processing, analysis, and model implementation.

PySpark: A Python API for Apache Spark, used to perform data transformations and machine learning tasks on the Spark cluster.

PySpark ML: The machine learning library in PySpark, utilized for building and training the prediction models.

Pandas: A data manipulation and analysis library, used for handling and transforming data within the Python environment.

Matplotlib: A plotting library for creating static, animated, and interactive visualizations in Python. Used here to generate the prediction plots.

PyTorch: A deep learning library used for more advanced machine learning tasks and model development.

sklearn.linear_model: A module from the scikit-learn library, specifically used for implementing linear regression models.

statsmodels: A library for estimating and testing statistical models. Utilized here for the ARIMA model implementation.

Limfit: A fitting library used for minimizing the least-squares error in model fitting.

SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors): A model used for time series forecasting in the statsmodels library.

Disclaimer

Great care has been taken to make sure that the technical information presented in this article is accurate, but any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on its content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

Mich Talebzadeh (Ph.D.)

7 个月

The full code , plots and how to run in https://github.com/michTalebzadeh/property_market_trends Please check the README file in GitHub

要查看或添加评论，请登录

Mich Talebzadeh (Ph.D.)的更多文章

Metadata Analysis: Indications of Tampering with Records

2025年2月7日

Metadata Analysis: Indications of Tampering with Records

What is Metadata and Why It Matters in Forensic Analysis Metadata is the hidden digital footprint embedded within files…

1 条评论
The Rise of Fake Reviews: How to Spot Online Deception with Data Science

2025年1月31日

The Rise of Fake Reviews: How to Spot Online Deception with Data Science

Introduction Fake online reviews have become a widespread problem, deceiving consumers and manipulating brand…
The Dark Side of the Self Storage Industry: Why Regulation is Long Overdue

2025年1月29日

The Dark Side of the Self Storage Industry: Why Regulation is Long Overdue

Introduction The self-storage industry operates in a legal grey area, where companies market themselves as secure and…
Beyond Digital: The Overlooked GDPR Risks in Handling Physical Data

2024年10月30日

Beyond Digital: The Overlooked GDPR Risks in Handling Physical Data

Introduction: —Understanding the Full Scope of GDPR When people think about the General Data Protection Regulation…

2 条评论
Recovering from Critical Data Loss: My Experience with Oracle 12c, RMAN, and Spark

2024年8月21日

Recovering from Critical Data Loss: My Experience with Oracle 12c, RMAN, and Spark

Introduction Data loss incidents are among the most challenging scenarios for IT professionals. This article details my…
Feature Engineering for Data Engineers: Building Blocks for ML Success

2024年8月3日

Feature Engineering for Data Engineers: Building Blocks for ML Success

Introduction to features in machine learning Features are akin to the individual columns in a dataset, representing…

2 条评论
Review of the First 20 Pages of "The Big Book of Generative AI from Databricks"

2024年7月24日

Review of the First 20 Pages of "The Big Book of Generative AI from Databricks"

"The Big Book of Generative AI from Databricks" provides a comprehensive guide to understanding and implementing…
The Role of Artificial Intelligence in Combating Financial Fraud

2024年7月21日

The Role of Artificial Intelligence in Combating Financial Fraud

The financial industry, often seen as traditional (like high street banks) and resistant to change, is undergoing a…

1 条评论
A Comprehensive Study in Financial Fraud

2024年7月19日

A Comprehensive Study in Financial Fraud

Financial crime refers to non-violent crime that involves the illegal acquisition, manipulation, or misuse of financial…
From Stability to Volatility: Property Price Trends in Kensington and Chelsea (2020-2023) and beyond

2024年7月9日

From Stability to Volatility: Property Price Trends in Kensington and Chelsea (2020-2023) and beyond

Introduction The property market in the prestigious London borough of Kensington and Chelsea has always been a focal…

See all articles

Predicting Property Market Trends: The Role of Data Science, Sentiment Analysis, and Time-Series Forecasting

Mich Talebzadeh (Ph.D.)

Introduction

Data Preparation

Analysis of data

Interpretation of data

Average Prices for Each Property Type in Kensington and Chelsea from 2020 till end of 2023

Property Type Average Price

Future trends, Justification for Using Different Techniques for Predicting Data In 2024

领英推荐

Key Points Highlighted in Comments added in the code above:

Interpretation of Data for Predictions as Shown on the Plot

Predicted Data (2024)

Conclusion

Technical Artifacts and Python Packages Used

Disclaimer

Mich Talebzadeh (Ph.D.)的更多文章

社区洞察

其他会员也浏览了

9 Predictions for Data in 2023

The Great Data Debate: Unbundling or Bundling?

Generative AI Tools Landscape - Data Applications – Part1

Embark on Your Data Odyssey: Unveiling the Data Science Guidebook for Success

Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Unlocking NYC Taxi Data Insights: Data Analysis with Spark, Delta Lake, PostgreSQL, Docker, and Apache Superset

Taming the Slowdown: A Comprehensive Guide to Optimizing Spark Queries

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

DATA Pill #014 - Future-Aware Data Engineering & Post-Deployment Data Science

Introduction

Data Preparation

Analysis of data

Interpretation of data

Average Prices for Each Property Type in Kensington and Chelsea from 2020 till end of 2023

Property Type Average Price

Future trends, Justification for Using Different Techniques for Predicting Data In 2024

领英推荐

Key Points Highlighted in Comments added in the code above:

Interpretation of Data for Predictions as Shown on the Plot

Predicted Data (2024)

Conclusion

Technical Artifacts and Python Packages Used

Disclaimer

Mich Talebzadeh (Ph.D.)的更多文章

Metadata Analysis: Indications of Tampering with Records

The Rise of Fake Reviews: How to Spot Online Deception with Data Science

The Dark Side of the Self Storage Industry: Why Regulation is Long Overdue

Beyond Digital: The Overlooked GDPR Risks in Handling Physical Data

Recovering from Critical Data Loss: My Experience with Oracle 12c, RMAN, and Spark

Feature Engineering for Data Engineers: Building Blocks for ML Success

Review of the First 20 Pages of "The Big Book of Generative AI from Databricks"

The Role of Artificial Intelligence in Combating Financial Fraud

A Comprehensive Study in Financial Fraud

From Stability to Volatility: Property Price Trends in Kensington and Chelsea (2020-2023) and beyond

社区洞察

其他会员也浏览了

9 Predictions for Data in 2023

The Great Data Debate: Unbundling or Bundling?

Generative AI Tools Landscape - Data Applications – Part1

Embark on Your Data Odyssey: Unveiling the Data Science Guidebook for Success

Bing New Search - End-to-End Azure Data Engineering Project using Microsoft Fabric.

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Unlocking NYC Taxi Data Insights: Data Analysis with Spark, Delta Lake, PostgreSQL, Docker, and Apache Superset

Taming the Slowdown: A Comprehensive Guide to Optimizing Spark Queries

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

DATA Pill #014 - Future-Aware Data Engineering & Post-Deployment Data Science