Mastering Outliers: A Data-Driven Journey into Used Car Prices
Mahmuda Y.
Data Visualization Analyst | Data Scientist | Specializing in Machine Learning, NLP and LLM | Authorized to work in US and Canada
?? The Story Behind the Data
Have you ever faced the challenge of making sense of a dataset that didn’t quite add up? Recently, I embarked on a project to predict the prices of used cars listed on Craigslist. (Click here to find the data). The dataset, sourced from Kaggle, provided a fascinating opportunity to showcase my skills in machine learning and data analysis while exploring concepts like outlier removal and central tendency measures.
The objective was simple: build a robust machine learning model to predict car prices. However, as is often the case, the journey was anything but straightforward.
?? Initial Observations
When I first plotted the price data as a histogram, the results were shocking—prices ranged from $0 to a staggering $500 million!
Clearly, this was not a realistic representation of used car prices. My understanding of measures of central tendency immediately raised a red flag: extreme values, or outliers, were heavily skewing the data.
To address this, I decided to explore two outlier removal methods:
Given the nature of car pricing, I leaned towards the IQR method, which seemed more aligned with practical expectations.
?? Cleaning the Data
After trimming prices above $57,400, the histogram became more meaningful. However, another issue surfaced: an overwhelming number of listings (over 40,000) with a price of $0.
This anomaly, likely caused by data parsing errors or placeholder entries, posed a new challenge. I decided to get the picture of the distribution after eliminating the entries for 0$.
领英推荐
Removing $0 entries reduced the dataset by 17%, but the resulting histogram revealed a much clearer picture of price distribution, paving the way for more reliable analysis.
?? Insights from Central Tendency
Through these steps, I realized how profoundly outliers can impact central tendency measures. Using the mean, for instance, without accounting for extreme values, would have produced a misleading representation of the data. The median, however, proved more robust, offering a truer sense of centrality in this case.
?? Looking Ahead
This process reaffirmed an important lesson: Data cleaning is as much an art as it is a science. By trimming outliers and addressing anomalies, I’ve laid the groundwork for building a predictive model that reflects reality more accurately.
Next steps? I plan to investigate these anomalies further and use the excluded data as a test set after training the model.
Stay tuned for the next part of this series as I dive deeper into predicting used car prices!
?? Key Visuals Include the following visuals:
You can find details of this analysis in here.
Conclusion:
Understanding and applying measures of central tendency isn’t just theoretical—it’s a critical skill for meaningful data analysis. This project highlighted how mean and median react to skewed data and demonstrated the importance of careful data preprocessing for actionable insights.
#DataScience #MachineLearning #OutlierRemoval #DataCleaning #Kaggle #DataAnalysis #UsedCars #Statistics
Business Data Analyst | Administrative Assistant @ Life Safety Consultants
3 个月This piqued my curiosity - I’ll be following to see what trends you discover! Also reading your NLP Restaurant Reviews capstone project- thanks for posting!