Mastering Outliers: A Data-Driven Journey into Used Car Prices

?? The Story Behind the Data

Have you ever faced the challenge of making sense of a dataset that didn’t quite add up? Recently, I embarked on a project to predict the prices of used cars listed on Craigslist. (Click here to find the data). The dataset, sourced from Kaggle, provided a fascinating opportunity to showcase my skills in machine learning and data analysis while exploring concepts like outlier removal and central tendency measures.

The objective was simple: build a robust machine learning model to predict car prices. However, as is often the case, the journey was anything but straightforward.


?? Initial Observations

When I first plotted the price data as a histogram, the results were shocking—prices ranged from $0 to a staggering $500 million!

(1.) Histogram of Used Cars Price
(1.) Boxplot of Used Cars Price

Clearly, this was not a realistic representation of used car prices. My understanding of measures of central tendency immediately raised a red flag: extreme values, or outliers, were heavily skewing the data.

To address this, I decided to explore two outlier removal methods:

  1. Interquartile Range (IQR): Identified any value above $57,400 as influential.
  2. Z-Score: Flagged values above $99,999,999 as outliers.

Given the nature of car pricing, I leaned towards the IQR method, which seemed more aligned with practical expectations.


?? Cleaning the Data

After trimming prices above $57,400, the histogram became more meaningful. However, another issue surfaced: an overwhelming number of listings (over 40,000) with a price of $0.

(2.) Histogram of Used Cars Price after Trimming for Price < 57400$

This anomaly, likely caused by data parsing errors or placeholder entries, posed a new challenge. I decided to get the picture of the distribution after eliminating the entries for 0$.

(3.) Histogram for Used Car Prices < 57400$ and not equal to 0$

Removing $0 entries reduced the dataset by 17%, but the resulting histogram revealed a much clearer picture of price distribution, paving the way for more reliable analysis.


?? Insights from Central Tendency

Through these steps, I realized how profoundly outliers can impact central tendency measures. Using the mean, for instance, without accounting for extreme values, would have produced a misleading representation of the data. The median, however, proved more robust, offering a truer sense of centrality in this case.


?? Looking Ahead

This process reaffirmed an important lesson: Data cleaning is as much an art as it is a science. By trimming outliers and addressing anomalies, I’ve laid the groundwork for building a predictive model that reflects reality more accurately.

Next steps? I plan to investigate these anomalies further and use the excluded data as a test set after training the model.

Stay tuned for the next part of this series as I dive deeper into predicting used car prices!


?? Key Visuals Include the following visuals:

  1. Initial Histogram - Highlighting the skewed data.
  2. Post-IQR Trimming Histogram - Showing improvement after outlier removal.
  3. Final Histogram (Excluding $0 Prices) - A clean and reliable distribution.

You can find details of this analysis in here.


Conclusion:

Understanding and applying measures of central tendency isn’t just theoretical—it’s a critical skill for meaningful data analysis. This project highlighted how mean and median react to skewed data and demonstrated the importance of careful data preprocessing for actionable insights.


#DataScience #MachineLearning #OutlierRemoval #DataCleaning #Kaggle #DataAnalysis #UsedCars #Statistics

Elisabeth Membel

Business Data Analyst | Administrative Assistant @ Life Safety Consultants

3 个月

This piqued my curiosity - I’ll be following to see what trends you discover! Also reading your NLP Restaurant Reviews capstone project- thanks for posting!

要查看或添加评论,请登录

Mahmuda Y.的更多文章

社区洞察

其他会员也浏览了