登录查看更多内容

Exploratory Data Analysis - Airbnb Bookings Dataset

Pradeepchandra Reddy S C

Data Scientist @ Elfonze Technologies | IESA | 5 ? SQL | Appeared for Civil Services Examination | Goal - is to decrease Elderly poverty in India

发布日期: 2023年2月19日

+ 关注

Exploratory Data Analysis — Airbnb Bookings?Dataset

About Airbnb -

Airbnb, Inc., based in San Francisco, California, operates an online marketplace focused on short-term homestays and experiences. The company acts as a broker and charges a commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk, and Joe Gebbia.

Business Context

Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data — data that can be analyzed and used for security, business decisions, understanding of customers’ and providers’ (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values. Let’s Explore and analyze the data to discover key understandings.

Project Architecture

No alt text provided for this image — Project Architecture

Step -1?: Basic Dataset Understanding -

This dataset has around 48,895 observations with 16 columns and it is a mix between categorical and numeric values.

print(f'This dataset has {booking_df.shape} rows and columns respectively.')

This dataset has (48895, 16) rows and columns respectively.

The info( ) method prints information about the DataFrame.

The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

By basic inspection, a particular property name will have one particular host name hosted by that same individual but a particular host name can have multiple properties in an area. So, host_name is a categorical variable here. Also neighbourhood_group (comprising of Manhattan, Brooklyn, Queens, Bronx, Staten Island), neighbourhood and room_type (private,shared,Entire home/apt) fall into this category.

Checking for the null values and null values percentage & Visually representing it -

As we can see from above the column ‘name’, ‘host_name’, ‘last_review’, ‘reviews_per_month’ has the missing values

The describe( ) method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column?:

?count — The number of not-empty values.

?mean — The average (mean) value.?

?std — The standard deviation.

Cleaning the Null Values -

Either we can drop the null values or we can fill it as per the requirement and I’m going to drop the ‘ID’ column as it’s not needed for this project and fill the rest of the columns column ‘name’, ‘host_name’, ‘last_review’, ‘reviews_per_month


booking_df1.drop('id', axis = 1, inplace = True)
booking_df1.fillna({'name' : 'No_Name'}, inplace = True )
booking_df1.fillna({'host_name' : 'No_Name'}, inplace = True )
booking_df1.fillna({'last_review' : 'Not_Revieved'}, inplace = True )
booking_df1.fillna({'reviews_per_month' : 0}, inplace = True )

As Part of this EDA Project I went on analyzing the data by asking Questions (Which is really important).?

Step 2?: Asking Questions.

Question 1 -Which ‘neighbourhood_group’ has the highest number of AirBnb’s??

As we can see from the bar chart above Manhattan neighborhood has the highest number of Airbnb's

Manhattan
Brooklyn
Queens
Bronx
Staten Island

And Manhattan and Brooklyn has more than 75% of the AirBnb’s.


neighbourhood/(sum(neighbourhood['No of AirBnbs'] / 100))

Now Precisely we can say 85.41 ~ 86% Airbnb’s are there in Manhattan and Brooklyn.

Question 2 - Which type of properties are there in all the neighborhood??

1. Brooklyn has Highest number of Private Rooms.

2. Manhattan has Highest number of Entire Home/Apartment.

3. Manhattan has Highest number of Shared Room.

Question 3 - Which properties are the busiest Host in terms of Number of Bookings??


highest_bookings= booking_df.groupby(['neighbourhood_group','name'])['name'].agg({'count'}).reset_index().rename(columns={'count': "Most_Bookings" }).sort_values(by='Most_Bookings',ascending=False)

top_ten_highest_bookings= highest_bookings[:10]

top_ten_highest_bookings

As we can see from above Hillside Hotel property in Queens had the most number of bookings followed by Brooklyn Apartment in Brooklyn and Loft Suite at Brooklyn.

Question 4 - Who are the busiest Host in terms of Number of Bookings with host name??


host = booking_df1[['neighbourhood_group','host_name']].value_counts().reset_index().head(10)
host.rename(columns = {0 : 'Most_Bookings'}, inplace = True)
host

As we can see from above Sonder (NYC) at Manhattan had the most number of bookings followed by Blueground at Manhattan and Michael at Manhattan.

Question 5 - If I choose Brooklyn Neighborhood to live there for 20 days. Let’s check it will be cheaper to stay there compare to other neighborhoods or not.

I used plotly express here to plot the bar chart since it is more Interactive


import plotly.express as px

brook_df = booking_df1.loc[booking_df1['neighbourhood_group'].isin(['Brooklyn', 'Manhattan', 'Queens', 'Bronx'])]

fig = px.bar(x = 'neighbourhood_group',y = 'price', data_frame = brook_df.groupby(['neighbourhood_group']).mean().reset_index(), text='neighbourhood_group',
             color = 'neighbourhood_group',opacity = .8)
fig.update_traces(textfont=dict(size=15,color='White'))
fig.update_layout(title='Neigborhood based on the property prices',yaxis=dict(showgrid=False,showticklabels=True),autosize=False,width=800,height=500)
fig.show()

From the above bar chart we can conclude?:

1. Brooklyn is not the cheapest nor the costliest neighborhood to stay.

2. Manhattan is the costliest place to stay.

3. Bronx is the cheapest neighborhood to stay.

Question 6 - As we see Bronx neighborhood is the cheaper place to stay so which room type can I prefer and area(neighborhood) best to visit at Affordable cost.

From the above bar chart we can conclude?:

1. Riverdale is the most expensive area to stay for both Private and Shared Rooms.

2. City Island is the most expensive for Entire Home/ Apt.

3. Cheapest or More Affordable Private room is available at Van Nest.

4. Cheapest or More Affordable Shared room is available at Morris Heights, Pehlam Gardens, Schuylerville and Van Nest.

5. Cheapest or More Affordable Entire Home/ Apt is available at Woodlawn.

领英推荐

Q&A with Andrea, Sr. Business Intelligence Engineer at…

Janne Carolina Rettmer 4 个月前

Location Analytics for Business: Turning Data into…

Santosh Kumar Bhoda 3 个月前

Spatial Data Analytics

Mohammad Arshad 2 年前

Question 7 -If I choose to stay in Manhattan, which room type can I prefer and area(neighborhood) best to visit at Affordable cost.

From the above bar chart we can conclude?:

1. Tribeca is the most expensive area to stay for Entire Home/ Apt.

2. Midtown is the most expensive area to stay for Private Room.

3. Financial District is the most expensive area to stay for Shared room.

4. Cheapest or More Affordable Private room is available at Washington Heights.

5. Cheapest or More Affordable Shared room is available at Roosevelt Island.

6. Cheapest or More Affordable Entire Home/ Apt is available at Marble Hill.

Question 8 - Assume I stayed in Manhattan for 20 days & I had a balance amount of 5000$ only. Then I decided to stay other 20 days at Queens. Is this amount is sufficient only for room expenses?


queens_df = booking_df1.loc[booking_df1['neighbourhood_group'].isin(['Queens'])]
price = queens_df['price'].mean()*20
price = round(price, 2)

if price <= 5000:
    print(f'The average amount to stay in Queens for 20 days is {price} $. We think your amount is more than Sufficient to stay there.')
else:
    print("The amount exceeds your budget plan for room in Queens.")

The average amount to stay in Queens for 20 days is 1990.35 $. We think your amount is more than Sufficient to stay there.

The average amount to stay in Queens for 20 days is 1990.35 $. We think your amount is more than Sufficient to stay there will be the output.

Question 9 - Since money is sufficient stay for 20 days, which type of luxury property can you select to stay at Queens

Entire home/Apt is considered to be luxury in the Queens Neighborhood so staying there for 20 days is both affordable and luxurious at 147 $.

Question 10 - Most visited properties based on number of reviews


booking_df1.loc[:,['name','number_of_reviews']].groupby(['name'])['number_of_reviews'].sum().sort_values(ascending=False)[:10].reset_index().rename(columns={'name': "Property Names" })

Top 3 properties visited by People based on number of reviews are?:

1. Private Bedroom in Manhattan

2. Room near JFK Queen Bed

3. Beautiful Bedroom in Manhattan

Scenario - I’m thinking like a data scientist who is working at property developers giants like Brigade Group or Godrej Properties and they gave me this Question?:

Question 11 - Pradeep can you check who has the potential to open an Airbnb franchise in Queens in coming days, consider the most number of reviews as metric?

#Converting some columns to date time


property_df = booking_df1.copy()

property_df['duration'] = round((property_df['number_of_reviews']/property_df['reviews_per_month']) / 12)

property_df['possible_year_of_start'] = property_df['last_review'].dt.year - property_df['duration']

property_df['possible_year_of_start'] = pd.to_datetime(property_df['possible_year_of_start'], format = '%Y').dt.year

busiest_host_queens = property_df.loc[property_df.neighbourhood_group=='Queens',['host_id','host_name','number_of_property','reviews_per_month','availability_365','price']].groupby(['host_id','host_name']).agg({'reviews_per_month':'sum','number_of_property':'count','availability_365':'median','price':'min'}).sort_values(by=['reviews_per_month','number_of_property','availability_365'],ascending=[False,False,False]).reset_index()[:20]

busiest_host_queens

# Top 10 reviewed in Queens

avg_review_queens = avg_review_df.sum().sort_values(ascending=False)[:10].reset_index().rename(columns = {'index' : 'Properties Names', 0 : 'Average Reviews'})
avg_review_queens

We observe that top reviewed neighborhoods are?:

Astoria with average reviews of 901.27

East Elmhurst with average reviews of 824.73

Flushing with average reviews of 819.85

Long Island City with average reviews of 615.86

Jamaica with average reviews of 604.56

Ridgewood with average reviews of 430.54

Sunnyside with average reviews of 413.95

Ditmars Steinway with average reviews of 372.97

Springfield Gardens with average reviews of 356.42

Elmhurst with average reviews of 318.89

So, The Hosts with property listings over these neighborhoods of Queens are more likely to have a new franchise in the near future.

Question 12 - Pradeep can you check who has the potential to open an Airbnb franchise in Manhattan in coming days, consider there most number of reviews as metric?


busiest_host_manhattan = property_df.loc[property_df.neighbourhood_group=='Manhattan',['host_id','host_name','reviews_per_month','number_of_property','availability_365','price']].groupby(['host_id','host_name']).agg({'reviews_per_month':'sum','number_of_property':'count','availability_365':'median','price':'min'}).sort_values(by=['reviews_per_month','number_of_property','availability_365'],ascending=[False,False,False]).reset_index()[:20]

busiest_host_manhattan

avg_review_manhattan = avg_review_df.sum().sort_values(ascending=False)[:10].reset_index().rename(columns = {'index' : 'Properties Names', 0 : 'Average Reviews'})

avg_review_manhattan

We observe that top reviewed neighborhoods are?:

Harlem with average reviews of 2956.23

Hell’s Kitchen with average reviews of 2818.79

East Village with average reviews of 1668.17

East Harlem with average reviews of 1579.06

Upper East Side with average reviews of 1523.66

Upper West Side with average reviews of 1487.27

Midtown with average reviews of 1264.09

Chelsea with average reviews of 1038.89

Lower East Side with average reviews of 919.93

Washington Heights with average reviews of 876.70

So, The Hosts with property listings over these neighborhoods of Manhattan are more likely to have a new franchise in the near future.

Question 13 - Plot the Airbnb Spatial Data on New York City Map

The above Map shows all the Airbnb Properties of the dataset in New York City.

Question 14 - Plot the 50 Busiest Airbnb properties on the New York City Map

The above Map shows the top 50 Busiest Airbnb properties in the New York City

Question 15 - Plot the top 1000 Airbnb properties that are most visited by people in New York City Map


most_visited_map = booking_df1.groupby(['name'])['number_of_reviews','latitude','longitude'].sum().sort_values(by = 'number_of_reviews',ascending=False)[:1000]

most_visited_map.rename(columns = {'name' : 'Property Names'}, inplace = True).

most_visited_map.head()

The above Map shows The top 1000 Airbnb properties that are most visited by people in New York City

Through this exploratory data analysis and visualization, we gained several interesting insights into the Airbnb rental market. This Airbnb dataset for 2019 year appeared to be a very rich dataset with a variety of columns that allowed us to do deep data exploration on each significant column presented. We proceeded with Questions and Scenarios’s like which ‘neighbourhood_group’ has the highest number of AirBnb’s and what areas were more popular than another, their price variations, their availability as per room types. Also we emphasized on key findings which host has the highest bookings, Reviews. Also we emphasized on Which neighborhood’s to stay if I’m low on cash in that also how we can stay in luxury properties. Next, we put good use of latitude and longitude columns to create a geographical heatmap color-coded by the price of listings

I have used Seaborn, Matplotlib and Plotly express for creating all the visualizations. This is just a glimpse of EDA on the airbnb dataset and there are no predictions involved. Also, here’s my for the full code reference.

https://github.com/soopertramp

If you like this give love for this post/article and Please do connect me on LinkedIn.

https://www.dhirubhai.net/in/pradeepchandra-reddy-s-c/

Thanks a lot for reading. Feel free to give any feedback!

Harshit Nangia

ASE @TCS || Java || Python || SQL || JavaScript || CSS3 || HTML5 || C || MySQL || PL/SQL

2 年

Great Share Pradeepchandra Reddy S C

1 次回应

查看更多评论

要查看或添加评论，请登录

Pradeepchandra Reddy S C的更多文章

Mathematics for Data Science and Machine Learning - Part 1

2023年1月3日

Mathematics for Data Science and Machine Learning - Part 1

???????????? ?????? ?????????? - ?????????????????????? ?????????? ???????? ???????????? ?????? ???????????? ??????????…

1 条评论
???????????????????? ?????? ?????????????????????? ?????? ???????? ?????????????? ?????? ?????????????? ????????????????

2023年1月2日

???????????????????? ?????? ?????????????????????? ?????? ???????? ?????????????? ?????? ?????????????? ????????????????

???????????????????? ?????? ?????????????????????? ?????? ???????? ?????????????? ?????? ??????????????…
How you can start and learn MySQL

2022年12月30日

How you can start and learn MySQL

?????? ?????? ???????????? ???? ?????????? ?????? ?????? ?????? ?????????????? ?????? ???? ?????????? ???? ??????????…

10 条评论

Exploratory Data Analysis - Airbnb Bookings Dataset

Pradeepchandra Reddy S C

Data Scientist @ Elfonze Technologies | IESA | 5 ? SQL | Appeared for Civil Services Examination | Goal - is to decrease Elderly poverty in India

Exploratory Data Analysis — Airbnb Bookings?Dataset

About Airbnb -

Business Context

Project Architecture

Step -1?: Basic Dataset Understanding -

Step 2?: Asking Questions.

Question 1 -Which ‘neighbourhood_group’ has the highest number of AirBnb’s??

Question 2 - Which type of properties are there in all the neighborhood??

Question 3 - Which properties are the busiest Host in terms of Number of Bookings??

Question 4 - Who are the busiest Host in terms of Number of Bookings with host name??

Question 5 - If I choose Brooklyn Neighborhood to live there for 20 days. Let’s check it will be cheaper to stay there compare to other neighborhoods or not.

Question 6 - As we see Bronx neighborhood is the cheaper place to stay so which room type can I prefer and area(neighborhood) best to visit at Affordable cost.

领英推荐

Question 7 -If I choose to stay in Manhattan, which room type can I prefer and area(neighborhood) best to visit at Affordable cost.

Question 8 - Assume I stayed in Manhattan for 20 days & I had a balance amount of 5000$ only. Then I decided to stay other 20 days at Queens. Is this amount is sufficient only for room expenses?

Question 9 - Since money is sufficient stay for 20 days, which type of luxury property can you select to stay at Queens

Question 10 - Most visited properties based on number of reviews

Scenario - I’m thinking like a data scientist who is working at property developers giants like Brigade Group or Godrej Properties and they gave me this Question?:

Question 11 - Pradeep can you check who has the potential to open an Airbnb franchise in Queens in coming days, consider the most number of reviews as metric?

Question 12 - Pradeep can you check who has the potential to open an Airbnb franchise in Manhattan in coming days, consider there most number of reviews as metric?

Question 13 - Plot the Airbnb Spatial Data on New York City Map

Question 14 - Plot the 50 Busiest Airbnb properties on the New York City Map

Question 15 - Plot the top 1000 Airbnb properties that are most visited by people in New York City Map

Pradeepchandra Reddy S C的更多文章

社区洞察

其他会员也浏览了

Dynamic Pricing

Global Location Intelligence & Location Analytics Market Forecasts to 2030

What Is Location Intelligence and Why You Should Care.

Data Leader on Insights: Senthu Jegadheesan

Maximizing Geospatial Insights with Feedback: Chalanas Strategy and Consulting’s Approach

The Basics of Data Science: Techniques for Subscription Services

Data Democratization at Airbnb: A sneak-peek into data wonderland

The Exponential Rise of the Location Analytics Market: A Technological Overview

How Location Intelligence is Revolutionizing Business Strategy Across Industries

Turn Data to Action: How Companies are Leveraging Analytics to Drive Business Transformation

Exploratory Data Analysis — Airbnb Bookings?Dataset

About Airbnb -

Business Context

Project Architecture

Step -1?: Basic Dataset Understanding -

Step 2?: Asking Questions.

Question 1 -Which ‘neighbourhood_group’ has the highest number of AirBnb’s??

Question 2 - Which type of properties are there in all the neighborhood??

Question 3 - Which properties are the busiest Host in terms of Number of Bookings??

Question 4 - Who are the busiest Host in terms of Number of Bookings with host name??

Question 5 - If I choose Brooklyn Neighborhood to live there for 20 days. Let’s check it will be cheaper to stay there compare to other neighborhoods or not.

Question 6 - As we see Bronx neighborhood is the cheaper place to stay so which room type can I prefer and area(neighborhood) best to visit at Affordable cost.

领英推荐

Question 7 -If I choose to stay in Manhattan, which room type can I prefer and area(neighborhood) best to visit at Affordable cost.

Question 8 - Assume I stayed in Manhattan for 20 days & I had a balance amount of 5000$ only. Then I decided to stay other 20 days at Queens. Is this amount is sufficient only for room expenses?

Question 9 - Since money is sufficient stay for 20 days, which type of luxury property can you select to stay at Queens

Question 10 - Most visited properties based on number of reviews

Scenario - I’m thinking like a data scientist who is working at property developers giants like Brigade Group or Godrej Properties and they gave me this Question?:

Question 11 - Pradeep can you check who has the potential to open an Airbnb franchise in Queens in coming days, consider the most number of reviews as metric?

Question 12 - Pradeep can you check who has the potential to open an Airbnb franchise in Manhattan in coming days, consider there most number of reviews as metric?

Question 13 - Plot the Airbnb Spatial Data on New York City Map

Question 14 - Plot the 50 Busiest Airbnb properties on the New York City Map

Question 15 - Plot the top 1000 Airbnb properties that are most visited by people in New York City Map

Pradeepchandra Reddy S C的更多文章

Mathematics for Data Science and Machine Learning - Part 1

???????????????????? ?????? ?????????????????????? ?????? ???????? ?????????????? ?????? ?????????????? ????????????????

How you can start and learn MySQL

社区洞察

其他会员也浏览了

Dynamic Pricing

Global Location Intelligence & Location Analytics Market Forecasts to 2030

What Is Location Intelligence and Why You Should Care.

Data Leader on Insights: Senthu Jegadheesan

Maximizing Geospatial Insights with Feedback: Chalanas Strategy and Consulting’s Approach

The Basics of Data Science: Techniques for Subscription Services

Data Democratization at Airbnb: A sneak-peek into data wonderland

The Exponential Rise of the Location Analytics Market: A Technological Overview

How Location Intelligence is Revolutionizing Business Strategy Across Industries

Turn Data to Action: How Companies are Leveraging Analytics to Drive Business Transformation