登录查看更多内容

Mastering Outliers: A Data-Driven Journey into Used Car Prices

Mahmuda Y.

Data Visualization Analyst | Data Scientist | Specializing in Machine Learning, NLP and LLM | Authorized to work in US and Canada

发布日期: 2024年11月25日

?? The Story Behind the Data

Have you ever faced the challenge of making sense of a dataset that didn’t quite add up? Recently, I embarked on a project to predict the prices of used cars listed on Craigslist. (Click here to find the data). The dataset, sourced from Kaggle, provided a fascinating opportunity to showcase my skills in machine learning and data analysis while exploring concepts like outlier removal and central tendency measures.

The objective was simple: build a robust machine learning model to predict car prices. However, as is often the case, the journey was anything but straightforward.

?? Initial Observations

When I first plotted the price data as a histogram, the results were shocking—prices ranged from $0 to a staggering $500 million!

Clearly, this was not a realistic representation of used car prices. My understanding of measures of central tendency immediately raised a red flag: extreme values, or outliers, were heavily skewing the data.

To address this, I decided to explore two outlier removal methods:

Interquartile Range (IQR): Identified any value above $57,400 as influential.
Z-Score: Flagged values above $99,999,999 as outliers.

Given the nature of car pricing, I leaned towards the IQR method, which seemed more aligned with practical expectations.

?? Cleaning the Data

After trimming prices above $57,400, the histogram became more meaningful. However, another issue surfaced: an overwhelming number of listings (over 40,000) with a price of $0.

(2.) Histogram of Used Cars Price after Trimming for Price < 57400$

This anomaly, likely caused by data parsing errors or placeholder entries, posed a new challenge. I decided to get the picture of the distribution after eliminating the entries for 0$.

领英推荐

Mapping the Future of Automotive Data Monetization…

Amit Kumar 1 年前

Automotive Data Monetization Market Top Competitors…

Amit Kumar 1 年前

Automotive Data Monetization Market Analysis, Share…

Vishal Kulkarni 1 年前

(3.) Histogram for Used Car Prices < 57400$ and not equal to 0$

Removing $0 entries reduced the dataset by 17%, but the resulting histogram revealed a much clearer picture of price distribution, paving the way for more reliable analysis.

?? Insights from Central Tendency

Through these steps, I realized how profoundly outliers can impact central tendency measures. Using the mean, for instance, without accounting for extreme values, would have produced a misleading representation of the data. The median, however, proved more robust, offering a truer sense of centrality in this case.

?? Looking Ahead

This process reaffirmed an important lesson: Data cleaning is as much an art as it is a science. By trimming outliers and addressing anomalies, I’ve laid the groundwork for building a predictive model that reflects reality more accurately.

Next steps? I plan to investigate these anomalies further and use the excluded data as a test set after training the model.

Stay tuned for the next part of this series as I dive deeper into predicting used car prices!

?? Key Visuals Include the following visuals:

Initial Histogram - Highlighting the skewed data.
Post-IQR Trimming Histogram - Showing improvement after outlier removal.
Final Histogram (Excluding $0 Prices) - A clean and reliable distribution.

You can find details of this analysis in here.

Conclusion:

Understanding and applying measures of central tendency isn’t just theoretical—it’s a critical skill for meaningful data analysis. This project highlighted how mean and median react to skewed data and demonstrated the importance of careful data preprocessing for actionable insights.

#DataScience #MachineLearning #OutlierRemoval #DataCleaning #Kaggle #DataAnalysis #UsedCars #Statistics

Elisabeth Membel

Business Data Analyst | Administrative Assistant @ Life Safety Consultants

3 个月

This piqued my curiosity - I’ll be following to see what trends you discover! Also reading your NLP Restaurant Reviews capstone project- thanks for posting!

1 次回应

查看更多评论

要查看或添加评论，请登录

Mahmuda Y.的更多文章

Decoding the Remote Work Revolution: A Data-Driven Exploration of Productivity, Preferences, and the Future of Work ??

2024年12月13日

Decoding the Remote Work Revolution: A Data-Driven Exploration of Productivity, Preferences, and the Future of Work ??

Introduction: The Rise of Remote Work ?? The COVID-19 pandemic catalyzed a dramatic shift in the way we work…
Analyzing Sentiments in London Restaurant Reviews: An NLP-Powered Approach

2024年10月26日

Analyzing Sentiments in London Restaurant Reviews: An NLP-Powered Approach

Introduction In today's digital world, online reviews significantly impact the success of businesses, particularly in…
?? Project Spotlight: TripAdvisor Review Analysis ??

2024年10月15日

?? Project Spotlight: TripAdvisor Review Analysis ??

Harnessing the power of LinkedIn's expansive professional network, I’m excited to share a recent project where I…

Mastering Outliers: A Data-Driven Journey into Used Car Prices

Mahmuda Y.

Data Visualization Analyst | Data Scientist | Specializing in Machine Learning, NLP and LLM | Authorized to work in US and Canada

领英推荐

Mahmuda Y.的更多文章

社区洞察

其他会员也浏览了

Automotive Data Monetization Market Analysis, Share, Trends, Challenges, and Growth Opportunities in 2022-2027

Unlock Market Insights with Used Car Data Scraping from Autotrader by WebsiteDataScraping.com

Harnessing the real value of automotive data

Indicators tree

Automotive Gesture Recognition System Market Analysis by Size, Share, Growth, Trends, Opportunities and Forecast (2024-2032)

(Episode 6) Interview with Christoph Zengler: What are the consequences for vehicle variant management in a software-defined car world?

As The Automotive Industry Continues to Grow Starting using Data

No flying cars, just meaningful data predictions.

Gumtree.com.au Cars Data Scraping

A 200 year old history of Business Dashboarding

领英推荐

Mahmuda Y.的更多文章

Decoding the Remote Work Revolution: A Data-Driven Exploration of Productivity, Preferences, and the Future of Work ??

Analyzing Sentiments in London Restaurant Reviews: An NLP-Powered Approach

?? Project Spotlight: TripAdvisor Review Analysis ??

社区洞察

其他会员也浏览了

Automotive Data Monetization Market Analysis, Share, Trends, Challenges, and Growth Opportunities in 2022-2027

Unlock Market Insights with Used Car Data Scraping from Autotrader by WebsiteDataScraping.com

Harnessing the real value of automotive data

Indicators tree

Automotive Gesture Recognition System Market Analysis by Size, Share, Growth, Trends, Opportunities and Forecast (2024-2032)

(Episode 6) Interview with Christoph Zengler: What are the consequences for vehicle variant management in a software-defined car world?

As The Automotive Industry Continues to Grow Starting using Data

No flying cars, just meaningful data predictions.

Gumtree.com.au Cars Data Scraping

A 200 year old history of Business Dashboarding