Exploratory Data Analysis

Author

Daniel Redel

Published

March 11, 2024

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore')

my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C', '#28AFB0']
file = "D:/Career/Data Science/Portfolios/Inside AirBnB - Netherlands/Amsterdam/"

listings = pd.read_csv("listings_processed.csv") # processed data
calendar = pd.read_csv('calendar_processed.csv') # processed data
reviews_detailed = pd.read_csv('reviews_processed.csv') # processed data
neighbourhoods = pd.read_csv(file + 'neighbourhoods.csv')

In this section, we will conduct an Exploratory Data Analysis (EDA) to gain a deeper understanding of the Amsterdam Airbnb dataset.

This EDA serves as a important step in extracting valuable information and identifying key factors that influence Airbnb pricing in Amsterdam.

Listings Statistics

Here is a quick overview of our listings dataset:

Code
listings.head(2)
id name host_id host_name neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 number_of_reviews_ltm
0 2818 Quiet Garden View Room & Super Fast Wi-Fi 3159 Daniel Oostelijk Havengebied - Indische Buurt 52.36435 4.94358 Private room 69 3 322 2023-02-28 1.90 1 44 37
1 20168 Studio with private bathroom in the centre 1 59484 Alexander Centrum-Oost 52.36407 4.89393 Private room 106 1 339 2020-04-09 2.14 2 0 0

Price

As previously shown, Figure 1 presents the distribution of price of all our units in the sample:

Code
import seaborn as sns
my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C']

# Set up Figure
#fig, ax = plt.subplots(figsize=(8,5))

# Hist + KDE
sns.displot(data=listings, x="price", kde=True, color=my_colors[0], aspect=8/5)

# Labels
plt.xlabel('Price')
plt.ylabel('')
plt.title('Distribution of Price')

# Show the Plot
plt.show()
Figure 1: Distribution of Price

The average price per night for Airbnb listings in Amsterdam is approximately 190 euros.

Let’s also check price evolution:

Code
calendar.price = calendar.price.replace('[\$,]', '', regex=True).astype(float)
Code
import seaborn as sns
my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C']

calendar['date'] = pd.to_datetime(calendar['date'])
price_series = calendar.groupby("date")["price"].agg(["mean","median"]).reset_index()

#Filter
filter_date = price_series['date'] <= pd.Timestamp('2024-01-01')

# Set up Figure
fig, ax = plt.subplots(figsize=(8,4))

# Line Plot
sns.lineplot(price_series[filter_date], x='date', y='mean', color=my_colors[0])

# Labels
plt.xlabel('Date')
plt.ylabel('Mean Price')
plt.title('Avg. Price Trend')

# Show the Plot
plt.tight_layout()
plt.show()
Figure 2: Average Price Trend

Property Characteristics - Type

We compute summary statistics of prices categorized by room type, which include private room, entire home, hotel room, and shared room.

Code
listings.groupby("room_type")['price'].agg(['mean','count']).round(1)
Table 1: Price by Type
mean count
room_type
Entire home/apt 214.4 4294
Hotel room 154.0 58
Private room 140.9 1630
Shared room 95.9 36
Code
import seaborn as sns
my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C']

# Set up Figure
fig, ax = plt.subplots(figsize=(8,4))

# Violin Plot
sns.violinplot(data=listings, x="price", y="room_type", palette=my_colors)

# Labels
plt.xlabel('Price')
plt.ylabel('')
plt.title('Distribution of Prices, by Type of Room')

# Show the Plot
plt.tight_layout()
plt.show()
Figure 3: Distribution of Prices, by Type of Room

Upon analyzing Table 1 and Figure 3, it’s evident that entire homes command the highest prices in the Amsterdam Airbnb market (214 euros), followed by hotel rooms (154 euros). This observation suggests that guests are willing to pay a premium for the privacy and amenities offered by entire accommodations.

Additionally, we note that private rooms exhibit less dispersion in prices compared to entire homes and hotel rooms. This might imply a more consistent pricing structure within the private room category, possibly influenced by factors such as location, amenities, and room size.

We also computed the average minimum nights required for booking across different room types in the Amsterdam Airbnb market:

Code
listings.groupby("room_type")['minimum_nights'].agg(['mean']).round(1)
Table 2: Minimum Nights by Type
mean
room_type
Entire home/apt 4.5
Hotel room 1.3
Private room 3.3
Shared room 1.5

Table 2 suggest that entire homes and private rooms typically require a longer minimum stay compared to hotel rooms and shared rooms. One potential factor contributing to this discrepancy is economies of scale. Larger accommodations such as entire homes and private rooms may require longer minimum stays to offset operational costs and maximize profitability. Conversely, hotel rooms and shared rooms, which typically offer more compact and flexible accommodations, tend to have shorter minimum stay requirements.

Code
listings.groupby("room_type")['number_of_reviews'].agg(['mean','sum']).round(1).sort_values('mean', ascending= False)
Table 3: Number of Reviews by Type
mean sum
room_type
Hotel room 138.9 8610
Private room 120.9 210680
Shared room 115.7 4514
Entire home/apt 22.5 116001
Code
listings.groupby("room_type")['reviews_per_month'].agg(['mean']).round(1).sort_values('mean', ascending= False)
Table 4: Reviews per Month by Type
mean
room_type
Shared room 4.5
Hotel room 2.7
Private room 2.4
Entire home/apt 0.6

Hotel rooms have the highest average number of reviews per listing, followed by private rooms and shared rooms, while entire homes/apartments have the lowest average number of reviews.

This suggests that hotel rooms tend to receive more feedback from guests compared to other accommodation types.

Prices by Neighbourhood

Code
my_statistics = ['mean','median', 'count']

neigh_df = listings.groupby("neighbourhood")['price'].agg(
    my_statistics).round(1).sort_values(
    "mean", ascending = False)
neigh_df
Table 5: Prices by Neighbourhood
mean median count
neighbourhood
De Pijp - Rivierenbuurt 213.0 200.0 685
Centrum-Oost 210.2 195.0 584
Zuid 205.6 195.0 368
Centrum-West 202.8 180.0 789
De Baarsjes - Oud-West 200.8 184.0 979
Westerpark 193.7 179.0 429
IJburg - Zeeburgereiland 192.3 170.0 135
Oud-Oost 190.3 175.0 366
Watergraafsmeer 185.3 177.5 172
Oud-Noord 175.6 158.5 268
Bos en Lommer 175.0 160.0 303
Buitenveldert - Zuidas 174.6 166.0 65
Geuzenveld - Slotermeer 173.2 149.0 81
Noord-West 170.9 150.0 158
Oostelijk Havengebied - Indische Buurt 168.2 150.0 233
Noord-Oost 163.2 146.0 97
Osdorp 152.9 122.5 40
Slotervaart 152.8 137.0 126
De Aker - Nieuw Sloten 141.8 100.0 40
Bijlmer-Centrum 135.1 100.0 40
Bijlmer-Oost 130.0 130.0 20
Gaasperdam - Driemond 123.9 100.0 40

In our analysis of Amsterdam Airbnb listings, we’ve identified several neighborhoods commanding higher average prices, notably De Pijp - Rivierenbuurt, Centrum-Oost, Zuid, and Centrum-West. With average prices ranging from 202 to 213 euros per night, these neighborhoods emerge as premium destinations within the Amsterdam accommodation market. Notably, De Pijp - Rivierenbuurt, historically recognized for its affluent reputation, continues to attract visitors seeking upscale experiences.

Code
# Set up Figure
fig, axs = plt.subplots(1, 2, figsize=(12,4))

# Plot
neigh_df.sort_values("count")['count'].plot(ax=axs[0], color=my_colors[0], kind='barh')
neigh_df.sort_values("mean")['mean'].plot(ax=axs[1], color=my_colors[1], kind='barh')

# Labels
axs[0].set_title('Frequency')
axs[1].set_title('Mean Price')

axs[0].set_xlabel('')
axs[1].set_xlabel('')

axs[0].set_ylabel('neighbourhood')
axs[1].set_ylabel('')

# Show the Plot
plt.tight_layout()
plt.show()
Figure 4: Count and Mean Price by Neighbourhood
Code
import seaborn as sns
my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C']

# Set up Figure
fig, ax = plt.subplots(figsize=(8,4))

# Filter
some_neighs = listings['neighbourhood'].isin(
    ['Centrum-Oost','Zuid', 'De Pijp - Rivierenbuurt', 'Noord-Oost', 'Bijlmer-Centrum'])

# Violin Plot
sns.violinplot(data=listings[some_neighs], x="price", y="neighbourhood", palette=my_colors)

# Labels
plt.xlabel('Price')
plt.ylabel('')
plt.title('Distribution of Prices by Neighbourhood (subsample)')

# Show the Plot
plt.tight_layout()
plt.show()
Figure 5: Distribution of Prices by Neighbourhood (subsample)
Code
import seaborn as sns
my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C']

calendar['date'] = pd.to_datetime(calendar['date'])

#merge
merged_calendar = calendar.merge(listings[["id",'neighbourhood']], left_on='listing_id', right_on='id', how="left")

# group
price_series = merged_calendar.groupby(["date", "neighbourhood"])["price"].agg(["mean","median"]).reset_index()

#Filter
filter_date = price_series['date'] <= pd.Timestamp('2024-01-01')

# Filter
some_neighs = price_series['neighbourhood'].isin(
    ['Centrum-Oost','Zuid', 'De Pijp - Rivierenbuurt', 'Noord-Oost', 'Bijlmer-Centrum'])


## Set up Figure ##
###################

fig, ax = plt.subplots(figsize=(9,4))

# Line Plot
sns.lineplot(price_series[filter_date & some_neighs], x='date', y='mean', hue="neighbourhood", palette=my_colors)
sns.move_legend(ax, "lower center", bbox_to_anchor=(0.5, 1), ncol=3, title=None, frameon=False)


# Labels
plt.xlabel('Date')
plt.ylabel('Mean Price')
plt.title('')

# Show the Plot
plt.tight_layout()
plt.show()
Figure 6: Average Price Trend, by some Neighbourhoods

After plotting the time series of average prices across the top neighborhoods and other neighborhoods in Amsterdam ( Figure 6 ), we observe a consistent and steady evolution of prices over time, with no significant increasing trend.

This stable pricing pattern suggests that the Amsterdam Airbnb market maintains a relatively balanced and predictable pricing environment across different neighborhoods. Despite fluctuations influenced by seasonal variations or occasional events, there is no clear upward trajectory in prices over the observed time period

Reviews Statistics

We can use the jointplot function to visualize the relationship between price, number of reviews, and room type:

Code
import seaborn as sns
my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C']

# Set up Figure
#fig, ax = plt.subplots(figsize=(8,5))

# Hist + KDE
sns.jointplot(data=listings, y="number_of_reviews", x="price", hue='room_type', palette=my_colors)

# Labels
plt.ylabel('Reviews')
plt.xlabel('Price')
#plt.title('Distribution of Price')

# Show the Plot
plt.show()
Figure 7: Reviews and Price, by Type

Private rooms, often priced lower than entire homes, may attract guests looking for budget-friendly accommodations, which could contribute to lower review counts.

Reviews per Month

Code
import seaborn as sns
my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C']

# Set up Figure
#fig, ax = plt.subplots(figsize=(8,5))

# Hist + KDE
sns.jointplot(data=listings, y="reviews_per_month", x="price", hue='room_type', palette=my_colors)

# Labels
plt.ylabel('Reviews per month')
plt.xlabel('Price')
#plt.title('Distribution of Price')

# Show the Plot
plt.show()
Figure 8: Reviews per month and Price, by Type