Exploratory Data Analysis - Airbnb Bookings Dataset
Pradeepchandra Reddy S C
Data Scientist @ Elfonze Technologies | IESA | 5 ? SQL | Appeared for Civil Services Examination | Goal - is to decrease Elderly poverty in India
Exploratory Data Analysis — Airbnb Bookings?Dataset
About Airbnb -
Airbnb, Inc., based in San Francisco, California, operates an online marketplace focused on short-term homestays and experiences. The company acts as a broker and charges a commission from each booking. The company was founded in 2008 by Brian Chesky, Nathan Blecharczyk, and Joe Gebbia.
Business Context
Since 2008, guests and hosts have used Airbnb to expand on travelling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data — data that can be analyzed and used for security, business decisions, understanding of customers’ and providers’ (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.
This dataset has around 49,000 observations in it with 16 columns and it is a mix of categorical and numeric values. Let’s Explore and analyze the data to discover key understandings.
Project Architecture
Step -1?: Basic Dataset Understanding -
This dataset has around 48,895 observations with 16 columns and it is a mix between categorical and numeric values.
print(f'This dataset has {booking_df.shape} rows and columns respectively.')
This dataset has (48895, 16) rows and columns respectively.
The info( ) method prints information about the DataFrame.
The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).
By basic inspection, a particular property name will have one particular host name hosted by that same individual but a particular host name can have multiple properties in an area. So, host_name is a categorical variable here. Also neighbourhood_group (comprising of Manhattan, Brooklyn, Queens, Bronx, Staten Island), neighbourhood and room_type (private,shared,Entire home/apt) fall into this category.
Checking for the null values and null values percentage & Visually representing it -
As we can see from above the column ‘name’, ‘host_name’, ‘last_review’, ‘reviews_per_month’ has the missing values
The describe( ) method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column?:
?count — The number of not-empty values.
?mean — The average (mean) value.?
?std — The standard deviation.
Cleaning the Null Values -
Either we can drop the null values or we can fill it as per the requirement and I’m going to drop the ‘ID’ column as it’s not needed for this project and fill the rest of the columns column ‘name’, ‘host_name’, ‘last_review’, ‘reviews_per_month
booking_df1.drop('id', axis = 1, inplace = True)
booking_df1.fillna({'name' : 'No_Name'}, inplace = True )
booking_df1.fillna({'host_name' : 'No_Name'}, inplace = True )
booking_df1.fillna({'last_review' : 'Not_Revieved'}, inplace = True )
booking_df1.fillna({'reviews_per_month' : 0}, inplace = True )
As Part of this EDA Project I went on analyzing the data by asking Questions (Which is really important).?
Step 2?: Asking Questions.
Question 1 -Which ‘neighbourhood_group’ has the highest number of AirBnb’s??
As we can see from the bar chart above Manhattan neighborhood has the highest number of Airbnb's
And Manhattan and Brooklyn has more than 75% of the AirBnb’s.
neighbourhood/(sum(neighbourhood['No of AirBnbs'] / 100))
Now Precisely we can say 85.41 ~ 86% Airbnb’s are there in Manhattan and Brooklyn.
Question 2 - Which type of properties are there in all the neighborhood??
1. Brooklyn has Highest number of Private Rooms.
2. Manhattan has Highest number of Entire Home/Apartment.
3. Manhattan has Highest number of Shared Room.
Question 3 - Which properties are the busiest Host in terms of Number of Bookings??
highest_bookings= booking_df.groupby(['neighbourhood_group','name'])['name'].agg({'count'}).reset_index().rename(columns={'count': "Most_Bookings" }).sort_values(by='Most_Bookings',ascending=False)
top_ten_highest_bookings= highest_bookings[:10]
top_ten_highest_bookings
As we can see from above Hillside Hotel property in Queens had the most number of bookings followed by Brooklyn Apartment in Brooklyn and Loft Suite at Brooklyn.
Question 4 - Who are the busiest Host in terms of Number of Bookings with host name??
host = booking_df1[['neighbourhood_group','host_name']].value_counts().reset_index().head(10)
host.rename(columns = {0 : 'Most_Bookings'}, inplace = True)
host
As we can see from above Sonder (NYC) at Manhattan had the most number of bookings followed by Blueground at Manhattan and Michael at Manhattan.
Question 5 - If I choose Brooklyn Neighborhood to live there for 20 days. Let’s check it will be cheaper to stay there compare to other neighborhoods or not.
I used plotly express here to plot the bar chart since it is more Interactive
import plotly.express as px
brook_df = booking_df1.loc[booking_df1['neighbourhood_group'].isin(['Brooklyn', 'Manhattan', 'Queens', 'Bronx'])]
fig = px.bar(x = 'neighbourhood_group',y = 'price', data_frame = brook_df.groupby(['neighbourhood_group']).mean().reset_index(), text='neighbourhood_group',
color = 'neighbourhood_group',opacity = .8)
fig.update_traces(textfont=dict(size=15,color='White'))
fig.update_layout(title='Neigborhood based on the property prices',yaxis=dict(showgrid=False,showticklabels=True),autosize=False,width=800,height=500)
fig.show()
From the above bar chart we can conclude?:
1. Brooklyn is not the cheapest nor the costliest neighborhood to stay.
2. Manhattan is the costliest place to stay.
3. Bronx is the cheapest neighborhood to stay.
Question 6 - As we see Bronx neighborhood is the cheaper place to stay so which room type can I prefer and area(neighborhood) best to visit at Affordable cost.
From the above bar chart we can conclude?:
1. Riverdale is the most expensive area to stay for both Private and Shared Rooms.
2. City Island is the most expensive for Entire Home/ Apt.
3. Cheapest or More Affordable Private room is available at Van Nest.
4. Cheapest or More Affordable Shared room is available at Morris Heights, Pehlam Gardens, Schuylerville and Van Nest.
5. Cheapest or More Affordable Entire Home/ Apt is available at Woodlawn.
领英推荐
Question 7 -If I choose to stay in Manhattan, which room type can I prefer and area(neighborhood) best to visit at Affordable cost.
From the above bar chart we can conclude?:
1. Tribeca is the most expensive area to stay for Entire Home/ Apt.
2. Midtown is the most expensive area to stay for Private Room.
3. Financial District is the most expensive area to stay for Shared room.
4. Cheapest or More Affordable Private room is available at Washington Heights.
5. Cheapest or More Affordable Shared room is available at Roosevelt Island.
6. Cheapest or More Affordable Entire Home/ Apt is available at Marble Hill.
Question 8 - Assume I stayed in Manhattan for 20 days & I had a balance amount of 5000$ only. Then I decided to stay other 20 days at Queens. Is this amount is sufficient only for room expenses?
queens_df = booking_df1.loc[booking_df1['neighbourhood_group'].isin(['Queens'])]
price = queens_df['price'].mean()*20
price = round(price, 2)
if price <= 5000:
print(f'The average amount to stay in Queens for 20 days is {price} $. We think your amount is more than Sufficient to stay there.')
else:
print("The amount exceeds your budget plan for room in Queens.")
The average amount to stay in Queens for 20 days is 1990.35 $. We think your amount is more than Sufficient to stay there.
The average amount to stay in Queens for 20 days is 1990.35 $. We think your amount is more than Sufficient to stay there will be the output.
Question 9 - Since money is sufficient stay for 20 days, which type of luxury property can you select to stay at Queens
Entire home/Apt is considered to be luxury in the Queens Neighborhood so staying there for 20 days is both affordable and luxurious at 147 $.
Question 10 - Most visited properties based on number of reviews
booking_df1.loc[:,['name','number_of_reviews']].groupby(['name'])['number_of_reviews'].sum().sort_values(ascending=False)[:10].reset_index().rename(columns={'name': "Property Names" })
Top 3 properties visited by People based on number of reviews are?:
1. Private Bedroom in Manhattan
2. Room near JFK Queen Bed
3. Beautiful Bedroom in Manhattan
Scenario - I’m thinking like a data scientist who is working at property developers giants like Brigade Group or Godrej Properties and they gave me this Question?:
Question 11 - Pradeep can you check who has the potential to open an Airbnb franchise in Queens in coming days, consider the most number of reviews as metric?
#Converting some columns to date time
property_df = booking_df1.copy()
property_df['duration'] = round((property_df['number_of_reviews']/property_df['reviews_per_month']) / 12)
property_df['possible_year_of_start'] = property_df['last_review'].dt.year - property_df['duration']
property_df['possible_year_of_start'] = pd.to_datetime(property_df['possible_year_of_start'], format = '%Y').dt.year
busiest_host_queens = property_df.loc[property_df.neighbourhood_group=='Queens',['host_id','host_name','number_of_property','reviews_per_month','availability_365','price']].groupby(['host_id','host_name']).agg({'reviews_per_month':'sum','number_of_property':'count','availability_365':'median','price':'min'}).sort_values(by=['reviews_per_month','number_of_property','availability_365'],ascending=[False,False,False]).reset_index()[:20]
busiest_host_queens
# Top 10 reviewed in Queens
avg_review_queens = avg_review_df.sum().sort_values(ascending=False)[:10].reset_index().rename(columns = {'index' : 'Properties Names', 0 : 'Average Reviews'})
avg_review_queens
We observe that top reviewed neighborhoods are?:
Astoria with average reviews of 901.27
East Elmhurst with average reviews of 824.73
Flushing with average reviews of 819.85
Long Island City with average reviews of 615.86
Jamaica with average reviews of 604.56
Ridgewood with average reviews of 430.54
Sunnyside with average reviews of 413.95
Ditmars Steinway with average reviews of 372.97
Springfield Gardens with average reviews of 356.42
Elmhurst with average reviews of 318.89
So, The Hosts with property listings over these neighborhoods of Queens are more likely to have a new franchise in the near future.
Question 12 - Pradeep can you check who has the potential to open an Airbnb franchise in Manhattan in coming days, consider there most number of reviews as metric?
busiest_host_manhattan = property_df.loc[property_df.neighbourhood_group=='Manhattan',['host_id','host_name','reviews_per_month','number_of_property','availability_365','price']].groupby(['host_id','host_name']).agg({'reviews_per_month':'sum','number_of_property':'count','availability_365':'median','price':'min'}).sort_values(by=['reviews_per_month','number_of_property','availability_365'],ascending=[False,False,False]).reset_index()[:20]
busiest_host_manhattan
avg_review_manhattan = avg_review_df.sum().sort_values(ascending=False)[:10].reset_index().rename(columns = {'index' : 'Properties Names', 0 : 'Average Reviews'})
avg_review_manhattan
We observe that top reviewed neighborhoods are?:
Harlem with average reviews of 2956.23
Hell’s Kitchen with average reviews of 2818.79
East Village with average reviews of 1668.17
East Harlem with average reviews of 1579.06
Upper East Side with average reviews of 1523.66
Upper West Side with average reviews of 1487.27
Midtown with average reviews of 1264.09
Chelsea with average reviews of 1038.89
Lower East Side with average reviews of 919.93
Washington Heights with average reviews of 876.70
So, The Hosts with property listings over these neighborhoods of Manhattan are more likely to have a new franchise in the near future.
Question 13 - Plot the Airbnb Spatial Data on New York City Map
The above Map shows all the Airbnb Properties of the dataset in New York City.
Question 14 - Plot the 50 Busiest Airbnb properties on the New York City Map
The above Map shows the top 50 Busiest Airbnb properties in the New York City
Question 15 - Plot the top 1000 Airbnb properties that are most visited by people in New York City Map
most_visited_map = booking_df1.groupby(['name'])['number_of_reviews','latitude','longitude'].sum().sort_values(by = 'number_of_reviews',ascending=False)[:1000]
most_visited_map.rename(columns = {'name' : 'Property Names'}, inplace = True).
most_visited_map.head()
The above Map shows The top 1000 Airbnb properties that are most visited by people in New York City
Through this exploratory data analysis and visualization, we gained several interesting insights into the Airbnb rental market. This Airbnb dataset for 2019 year appeared to be a very rich dataset with a variety of columns that allowed us to do deep data exploration on each significant column presented. We proceeded with Questions and Scenarios’s like which ‘neighbourhood_group’ has the highest number of AirBnb’s and what areas were more popular than another, their price variations, their availability as per room types. Also we emphasized on key findings which host has the highest bookings, Reviews. Also we emphasized on Which neighborhood’s to stay if I’m low on cash in that also how we can stay in luxury properties. Next, we put good use of latitude and longitude columns to create a geographical heatmap color-coded by the price of listings
I have used Seaborn, Matplotlib and Plotly express for creating all the visualizations. This is just a glimpse of EDA on the airbnb dataset and there are no predictions involved. Also, here’s my for the full code reference.
If you like this give love for this post/article and Please do connect me on LinkedIn.
Thanks a lot for reading. Feel free to give any feedback!
ASE @TCS || Java || Python || SQL || JavaScript || CSS3 || HTML5 || C || MySQL || PL/SQL
2 年Great Share Pradeepchandra Reddy S C