Exploratory Data Analysis - Airbnb Bookings Dataset
Airbnb

Exploratory Data Analysis - Airbnb Bookings Dataset


Exploratory Data Analysis — Airbnb Bookings?Dataset

About Airbnb -

Airbnb, Inc., based in San Francisco, California, operates an online marketplace focused on short-term homestays and experiences. The company acts as a broker and charges a commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk, and Joe Gebbia.


Business Context

Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data — data that can be analyzed and used for security, business decisions, understanding of customers’ and providers’ (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.

This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values. Let’s Explore and analyze the data to discover key understandings.


Project Architecture




No alt text provided for this image
Project Architecture

Step -1?: Basic Dataset Understanding -

This dataset has around 48,895 observations with 16 columns and it is a mix between categorical and numeric values.

print(f'This dataset has {booking_df.shape} rows and columns respectively.')

This dataset has (48895, 16) rows and columns respectively.        

The info( ) method prints information about the DataFrame.

The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

No alt text provided for this image
The info( )

By basic inspection, a particular property name will have one particular host name hosted by that same individual but a particular host name can have multiple properties in an area. So, host_name is a categorical variable here. Also neighbourhood_group (comprising of Manhattan, Brooklyn, Queens, Bronx, Staten Island), neighbourhood and room_type (private,shared,Entire home/apt) fall into this category.

Checking for the null values and null values percentage & Visually representing it -

No alt text provided for this image
Null Values and Their Percentage


No alt text provided for this image
Null Values Representation

As we can see from above the column ‘name’, ‘host_name’, ‘last_review’, ‘reviews_per_month’ has the missing values


The describe( ) method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column?:

?count — The number of not-empty values.

?mean — The average (mean) value.?

?std — The standard deviation.

No alt text provided for this image
describe( )

Cleaning the Null Values -

Either we can drop the null values or we can fill it as per the requirement and I’m going to drop the ‘ID’ column as it’s not needed for this project and fill the rest of the columns column ‘name’, ‘host_name’, ‘last_review’, ‘reviews_per_month


booking_df1.drop('id', axis = 1, inplace = True)
booking_df1.fillna({'name' : 'No_Name'}, inplace = True )
booking_df1.fillna({'host_name' : 'No_Name'}, inplace = True )
booking_df1.fillna({'last_review' : 'Not_Revieved'}, inplace = True )
booking_df1.fillna({'reviews_per_month' : 0}, inplace = True )        
No alt text provided for this image
After cleaning the Data

As Part of this EDA Project I went on analyzing the data by asking Questions (Which is really important).?

Step 2?: Asking Questions.

Question 1 -Which ‘neighbourhood_group’ has the highest number of AirBnb’s??

No alt text provided for this image
The highest number of AirBnb's

As we can see from the bar chart above Manhattan neighborhood has the highest number of Airbnb's

  1. Manhattan
  2. Brooklyn
  3. Queens
  4. Bronx
  5. Staten Island

And Manhattan and Brooklyn has more than 75% of the AirBnb’s.


neighbourhood/(sum(neighbourhood['No of AirBnbs'] / 100))        
No alt text provided for this image
Now Precisely we can 85.41 ~ 86% Airbnb's are there in Manhattan and Brooklyn

Now Precisely we can say 85.41 ~ 86% Airbnb’s are there in Manhattan and Brooklyn.


Question 2 - Which type of properties are there in all the neighborhood??

No alt text provided for this image
Type of properties

1. Brooklyn has Highest number of Private Rooms.

2. Manhattan has Highest number of Entire Home/Apartment.

3. Manhattan has Highest number of Shared Room.


Question 3 - Which properties are the busiest Host in terms of Number of Bookings??


highest_bookings= booking_df.groupby(['neighbourhood_group','name'])['name'].agg({'count'}).reset_index().rename(columns={'count': "Most_Bookings" }).sort_values(by='Most_Bookings',ascending=False)

top_ten_highest_bookings= highest_bookings[:10]

top_ten_highest_bookings        
No alt text provided for this image
top_ten_highest_bookings
No alt text provided for this image
top_ten_highest_bookings

As we can see from above Hillside Hotel property in Queens had the most number of bookings followed by Brooklyn Apartment in Brooklyn and Loft Suite at Brooklyn.


Question 4 - Who are the busiest Host in terms of Number of Bookings with host name??


host = booking_df1[['neighbourhood_group','host_name']].value_counts().reset_index().head(10)
host.rename(columns = {0 : 'Most_Bookings'}, inplace = True)
host        
No alt text provided for this image
Busiest Host in terms of Number of Bookings
No alt text provided for this image

As we can see from above Sonder (NYC) at Manhattan had the most number of bookings followed by Blueground at Manhattan and Michael at Manhattan.


Question 5 - If I choose Brooklyn Neighborhood to live there for 20 days. Let’s check it will be cheaper to stay there compare to other neighborhoods or not.

I used plotly express here to plot the bar chart since it is more Interactive


import plotly.express as px

brook_df = booking_df1.loc[booking_df1['neighbourhood_group'].isin(['Brooklyn', 'Manhattan', 'Queens', 'Bronx'])]

fig = px.bar(x = 'neighbourhood_group',y = 'price', data_frame = brook_df.groupby(['neighbourhood_group']).mean().reset_index(), text='neighbourhood_group',
             color = 'neighbourhood_group',opacity = .8)
fig.update_traces(textfont=dict(size=15,color='White'))
fig.update_layout(title='Neigborhood based on the property prices',yaxis=dict(showgrid=False,showticklabels=True),autosize=False,width=800,height=500)
fig.show()        
No alt text provided for this image
Neighborhood based on the property prices

From the above bar chart we can conclude?:

1. Brooklyn is not the cheapest nor the costliest neighborhood to stay.

2. Manhattan is the costliest place to stay.

3. Bronx is the cheapest neighborhood to stay.


Question 6 - As we see Bronx neighborhood is the cheaper place to stay so which room type can I prefer and area(neighborhood) best to visit at Affordable cost.

No alt text provided for this image
Bronx Room type and Properties to Stay in Bronx at Affordable Cost

From the above bar chart we can conclude?:

1. Riverdale is the most expensive area to stay for both Private and Shared Rooms.

2. City Island is the most expensive for Entire Home/ Apt.

3. Cheapest or More Affordable Private room is available at Van Nest.

4. Cheapest or More Affordable Shared room is available at Morris Heights, Pehlam Gardens, Schuylerville and Van Nest.

5. Cheapest or More Affordable Entire Home/ Apt is available at Woodlawn.


Question 7 -If I choose to stay in Manhattan, which room type can I prefer and area(neighborhood) best to visit at Affordable cost.

No alt text provided for this image
Manhattan Room type and Properties to Stay in Manhattan at Affordable Cost

From the above bar chart we can conclude?:

1. Tribeca is the most expensive area to stay for Entire Home/ Apt.

2. Midtown is the most expensive area to stay for Private Room.

3. Financial District is the most expensive area to stay for Shared room.

4. Cheapest or More Affordable Private room is available at Washington Heights.

5. Cheapest or More Affordable Shared room is available at Roosevelt Island.

6. Cheapest or More Affordable Entire Home/ Apt is available at Marble Hill.


Question 8 - Assume I stayed in Manhattan for 20 days & I had a balance amount of 5000$ only. Then I decided to stay other 20 days at Queens. Is this amount is sufficient only for room expenses?


queens_df = booking_df1.loc[booking_df1['neighbourhood_group'].isin(['Queens'])]
price = queens_df['price'].mean()*20
price = round(price, 2)

if price <= 5000:
    print(f'The average amount to stay in Queens for 20 days is {price} $. We think your amount is more than Sufficient to stay there.')
else:
    print("The amount exceeds your budget plan for room in Queens.")

The average amount to stay in Queens for 20 days is 1990.35 $. We think your amount is more than Sufficient to stay there.
        

The average amount to stay in Queens for 20 days is 1990.35 $. We think your amount is more than Sufficient to stay there will be the output.

Question 9 - Since money is sufficient stay for 20 days, which type of luxury property can you select to stay at Queens

No alt text provided for this image
Properties that are considered Luxury at Queens to Stay

Entire home/Apt is considered to be luxury in the Queens Neighborhood so staying there for 20 days is both affordable and luxurious at 147 $.


Question 10 - Most visited properties based on number of reviews


booking_df1.loc[:,['name','number_of_reviews']].groupby(['name'])['number_of_reviews'].sum().sort_values(ascending=False)[:10].reset_index().rename(columns={'name': "Property Names" })        
No alt text provided for this image
Most visited properties
No alt text provided for this image
Most visited properties

Top 3 properties visited by People based on number of reviews are?:

1. Private Bedroom in Manhattan

2. Room near JFK Queen Bed

3. Beautiful Bedroom in Manhattan


Scenario - I’m thinking like a data scientist who is working at property developers giants like Brigade Group or Godrej Properties and they gave me this Question?:

Question 11 - Pradeep can you check who has the potential to open an Airbnb franchise in Queens in coming days, consider the most number of reviews as metric?

#Converting some columns to date time


property_df = booking_df1.copy()

property_df['duration'] = round((property_df['number_of_reviews']/property_df['reviews_per_month']) / 12)

property_df['possible_year_of_start'] = property_df['last_review'].dt.year - property_df['duration']

property_df['possible_year_of_start'] = pd.to_datetime(property_df['possible_year_of_start'], format = '%Y').dt.year

busiest_host_queens = property_df.loc[property_df.neighbourhood_group=='Queens',['host_id','host_name','number_of_property','reviews_per_month','availability_365','price']].groupby(['host_id','host_name']).agg({'reviews_per_month':'sum','number_of_property':'count','availability_365':'median','price':'min'}).sort_values(by=['reviews_per_month','number_of_property','availability_365'],ascending=[False,False,False]).reset_index()[:20]

busiest_host_queens

# Top 10 reviewed in Queens

avg_review_queens = avg_review_df.sum().sort_values(ascending=False)[:10].reset_index().rename(columns = {'index' : 'Properties Names', 0 : 'Average Reviews'})
avg_review_queens        
No alt text provided for this image
Last 3 Columns that are added after some changes
No alt text provided for this image
busiest_host_queens
No alt text provided for this image
avg_review_queens
No alt text provided for this image
New Franchises Possible for each Neighbourhood Group

We observe that top reviewed neighborhoods are?:

Astoria with average reviews of 901.27

East Elmhurst with average reviews of 824.73

Flushing with average reviews of 819.85

Long Island City with average reviews of 615.86

Jamaica with average reviews of 604.56

Ridgewood with average reviews of 430.54

Sunnyside with average reviews of 413.95

Ditmars Steinway with average reviews of 372.97

Springfield Gardens with average reviews of 356.42

Elmhurst with average reviews of 318.89

So, The Hosts with property listings over these neighborhoods of Queens are more likely to have a new franchise in the near future.


Question 12 - Pradeep can you check who has the potential to open an Airbnb franchise in Manhattan in coming days, consider there most number of reviews as metric?


busiest_host_manhattan = property_df.loc[property_df.neighbourhood_group=='Manhattan',['host_id','host_name','reviews_per_month','number_of_property','availability_365','price']].groupby(['host_id','host_name']).agg({'reviews_per_month':'sum','number_of_property':'count','availability_365':'median','price':'min'}).sort_values(by=['reviews_per_month','number_of_property','availability_365'],ascending=[False,False,False]).reset_index()[:20]

busiest_host_manhattan

avg_review_manhattan = avg_review_df.sum().sort_values(ascending=False)[:10].reset_index().rename(columns = {'index' : 'Properties Names', 0 : 'Average Reviews'})

avg_review_manhattan        
No alt text provided for this image
busiest_host_manhattan
No alt text provided for this image
avg_review_manhattan
No alt text provided for this image
New Franchises Possible for each Neighborhood Group

We observe that top reviewed neighborhoods are?:

Harlem with average reviews of 2956.23

Hell’s Kitchen with average reviews of 2818.79

East Village with average reviews of 1668.17

East Harlem with average reviews of 1579.06

Upper East Side with average reviews of 1523.66

Upper West Side with average reviews of 1487.27

Midtown with average reviews of 1264.09

Chelsea with average reviews of 1038.89

Lower East Side with average reviews of 919.93

Washington Heights with average reviews of 876.70

So, The Hosts with property listings over these neighborhoods of Manhattan are more likely to have a new franchise in the near future.


Question 13 - Plot the Airbnb Spatial Data on New York City Map

No alt text provided for this image
Airbnb Spatial Data on New York City Map

The above Map shows all the Airbnb Properties of the dataset in New York City.


Question 14 - Plot the 50 Busiest Airbnb properties on the New York City Map

No alt text provided for this image
The 50 Busiest Airbnb properties on the New York City Map

The above Map shows the top 50 Busiest Airbnb properties in the New York City


Question 15 - Plot the top 1000 Airbnb properties that are most visited by people in New York City Map


most_visited_map = booking_df1.groupby(['name'])['number_of_reviews','latitude','longitude'].sum().sort_values(by = 'number_of_reviews',ascending=False)[:1000]

most_visited_map.rename(columns = {'name' : 'Property Names'}, inplace = True).

most_visited_map.head()        
No alt text provided for this image
Top 1000 Airbnb Properties

The above Map shows The top 1000 Airbnb properties that are most visited by people in New York City


Through this exploratory data analysis and visualization, we gained several interesting insights into the Airbnb rental market. This Airbnb dataset for 2019 year appeared to be a very rich dataset with a variety of columns that allowed us to do deep data exploration on each significant column presented. We proceeded with Questions and Scenarios’s like which ‘neighbourhood_group’ has the highest number of AirBnb’s and what areas were more popular than another, their price variations, their availability as per room types. Also we emphasized on key findings which host has the highest bookings, Reviews. Also we emphasized on Which neighborhood’s to stay if I’m low on cash in that also how we can stay in luxury properties. Next, we put good use of latitude and longitude columns to create a geographical heatmap color-coded by the price of listings

I have used Seaborn, Matplotlib and Plotly express for creating all the visualizations. This is just a glimpse of EDA on the airbnb dataset and there are no predictions involved. Also, here’s my for the full code reference.

https://github.com/soopertramp

If you like this give love for this post/article and Please do connect me on LinkedIn.

https://www.dhirubhai.net/in/pradeepchandra-reddy-s-c/

Thanks a lot for reading. Feel free to give any feedback!

Harshit Nangia

ASE @TCS || Java || Python || SQL || JavaScript || CSS3 || HTML5 || C || MySQL || PL/SQL

2 年

要查看或添加评论,请登录

Pradeepchandra Reddy S C的更多文章

社区洞察

其他会员也浏览了