登录查看更多内容

Classical ML with Classical Data: Linear Regression on Real Estate Price Prediction in Bengaluru

Rukshar A.

AI/ML/Data/Python Engineer, Data Scientist, JavaScript Developer

发布日期: 2024年3月21日

Code repository: https://github.com/rukshar69/bengaluru-house-prices

For tabular data, classical ML methods like Linear Regression can work quite well. In this article, I implement this model on a classical problem: predicting prices of properties in Bengaluru(India). There are 2 steps to this:

Data preparation
Training

We get the data on Bengaluru house prices from Kaggle. The data contains about 13k rows and 9 columns about property prices. The columns are:

area_type, availability, location, size, society, total_sqft, bath, and balcony price

The price is the target variable here.

Data Processing

The aim is to create a simplified version of the data for linear regression.

5 columns: 'location', 'size', 'total_sqft', 'bath', 'price' are kept
Any row with nan values is dropped
The size column, referring to the number of bedrooms, is processed to construct a new column bhk. The size column contains string values like 2 BHK. We take only the number value and insert it in the bhk column. The size column is then dropped.
Values in total_sqft were found to have range values like 1133 - 1384. So, the column is modified to have only float values. For the previously mentioned range values, the average is taken and the range value is replaced with the average float value. Cases like 34.46Sq. Meter are dropped to keep things simple.
A new feature price_per_sqft is created through dividing the price column by total_sqft

Dimensionality Reduction in the Location Column

There are 1287 unique locations mentioned in the location column. The distribution of location values is very skewed

领英推荐

5 Key Data Science Trends You Can’t Ignore

Walter Shields 5 个月前

“Small data” – the untapped gold mine!

Digitate 2 年前

Unlocking the Power of Data: Insights from Todd…

Sherif Aslam Razadukhan 7 个月前

Given a large number of locations don't have many datapoints, we need to apply a dimensionality reduction technique here to reduce the number of locations. locations having less than 10 rows are tagged as other locations. So, the number of categories is reduced by a lot. When using one-hot encoding, it will help having fewer dummy columns. Now, the number of unique locations is 241.

Outlier Removal

We consider that a normal bedroom size is 300 sqft. We remove properties where per bhk size is less than 300 sqft. We now have about 12.5k rows.
The price per sqft data reveals a significant price disparity, ranging from a minimum of 267 rupees to a maximum of around 175,000 rupees. To address this variation, we identify and remove outliers within each location using the mean and standard deviation. We keep properties for a particular location if the price per square foot is within 1 standard deviation of the mean for that location. We now have about 10.2k rows.
Let's consider another condition. For the same location, the price of n bed apt should be greater than the mean of n-1 bed apt. The datapoints failing to meet the condition are the outliers and will be removed. So for a given location, we build a dictionary of stats(mean, std, etc.) of price per sqft per bhk
Now we remove those n BHK apartments whose price_per_sqft is less than the mean price_per_sqft of n-1 BHK apartment. We now have about 7.3k rows.

We can see for Rajaji Nagar and Hebbal, some of the 3 BHK properties with per sqft price less than the mean per sqft price of 2 BHK properties have been removed.
We now consider a condition where an apartment with n bhk should have no more than n+2 bathrooms. It would be quite absurd or erroneous to have apartments where for n bhks there are more than n+2 baths. Such apartments are thus considered outliers and are removed from the dataset.

After such thorough cleaning, we move on to training a Linear Regression model using this clean and slimmed-down dataset.

Training

There are 4 features: location, total_sqft, bath, and bhkThe price column is the target variable.
One-hot Encoding is performed for location, a string categorical feature. For each location, we get a new binary column. Its value is 1 if the datapoint belongs to that location otherwise it's 0. However, we drop the other location column since any datapoint belonging to the other category will have 0s in all other location columns.
The test set is 20% of all datapoints.
The coefficient of determination/R^2 value for the trained model is about 86% on the test set, pretty good for a simplified dataset.

要查看或添加评论，请登录

Rukshar A.的更多文章

?? Problem-Solving w/ DP: House Robber (LeetCode) ??

2024年11月21日

?? Problem-Solving w/ DP: House Robber (LeetCode) ??

Problem description: https://leetcode.com/problems/house-robber/ How to maximize your loot without triggering alarms?…
?? LeetCode: Climbing Stairs Problem - Intuition(DP/Fibonacci) Behind the Solution ??♂??

2024年10月7日

?? LeetCode: Climbing Stairs Problem - Intuition(DP/Fibonacci) Behind the Solution ??♂??

Problem statement: https://leetcode.com/problems/climbing-stairs/description/ You're climbing a staircase with steps…

1 条评论
?? Solved A Fun LeetCode Problem: Length of the Last Word ??

2024年9月15日

?? Solved A Fun LeetCode Problem: Length of the Last Word ??

Problem statement: https://leetcode.com/problems/length-of-last-word/description/ Given a string of words and spaces…
?? LeetCode Problem Solved: Search Insert Position ??

2024年9月1日

?? LeetCode Problem Solved: Search Insert Position ??

Recently, I tackled a classic problem: Search Insert Position ??. The challenge is to find the position of a target…
A Classic DP Problem: Longest Palindromic Substring

2024年3月24日

A Classic DP Problem: Longest Palindromic Substring

Problem statement: Longest Palindromic Substring Intuition If s[i] equals s[j] and the substring from i - 1 to j + 1 is…
LeetCode: Add digits of 2 Linked Lists into a New Linked List

2024年3月17日

LeetCode: Add digits of 2 Linked Lists into a New Linked List

Problem statement: Add 2 Numbers Intuition Given 2 linked lists of digits of 2 numbers, we create a normal list of…
Kaggle Accelerators: A Comparison

2024年3月16日

Kaggle Accelerators: A Comparison

While using Kaggle accelerators for a personal project, I discovered they offered 3 accelerators: GPU T4 GPU P100 TPU…
LeetCode Solutions: Remove Duplicates from Sorted Array (With a Twist)

2024年3月15日

LeetCode Solutions: Remove Duplicates from Sorted Array (With a Twist)

Problem statement: Remove Duplicates from Sorted Array The twist is to change the array nums such that the first k…
Deciphering Dhaka's House Rent Trends: A Comprehensive Analysis

2024年2月19日

Deciphering Dhaka's House Rent Trends: A Comprehensive Analysis

Dhaka, the bustling capital of Bangladesh, stands as one of the world's most densely populated cities, housing over…

See all articles

Classical ML with Classical Data: Linear Regression on Real Estate Price Prediction in Bengaluru

Rukshar A.

AI/ML/Data/Python Engineer, Data Scientist, JavaScript Developer

Data Processing

领英推荐

Training

Rukshar A.的更多文章

社区洞察

其他会员也浏览了

Unlocking Insights from Timeline Data Using Regression Modeling

Why High-Frequency Analysis is the Future of Economic Forecasting

Data vs. Features: The Building Blocks of Data Science

Robust Data Models: Building Resilient Systems Against Outliers

Decision Tree Classification

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Statistical Distributions: Types and Importance.

Essentials of Time Series Forecasting: Key Components, Challenges, and Algorithms

Understanding Graph Technology: Casual Graphs vs. Physical Graphs

Data Processing

领英推荐

Training

Rukshar A.的更多文章

?? Problem-Solving w/ DP: House Robber (LeetCode) ??

?? LeetCode: Climbing Stairs Problem - Intuition(DP/Fibonacci) Behind the Solution ??♂??

?? Solved A Fun LeetCode Problem: Length of the Last Word ??

?? LeetCode Problem Solved: Search Insert Position ??

A Classic DP Problem: Longest Palindromic Substring

LeetCode: Add digits of 2 Linked Lists into a New Linked List

Kaggle Accelerators: A Comparison

LeetCode Solutions: Remove Duplicates from Sorted Array (With a Twist)

Deciphering Dhaka's House Rent Trends: A Comprehensive Analysis

社区洞察

其他会员也浏览了

Unlocking Insights from Timeline Data Using Regression Modeling

Why High-Frequency Analysis is the Future of Economic Forecasting

Data vs. Features: The Building Blocks of Data Science

Robust Data Models: Building Resilient Systems Against Outliers

Decision Tree Classification

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Statistical Distributions: Types and Importance.

Essentials of Time Series Forecasting: Key Components, Challenges, and Algorithms

Understanding Graph Technology: Casual Graphs vs. Physical Graphs