Pre-Processing Data

Author

Daniel Redel

Published

March 11, 2024

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore')

my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C', '#28AFB0']
file = "D:/Career/Data Science/Portfolios/Inside AirBnB - Netherlands/Amsterdam/"

listings = pd.read_csv(file + "listings.csv")
listings_detailed = pd.read_csv(file + "listings_detailed.csv")
calendar = pd.read_csv(file + 'calendar.csv')
reviews = pd.read_csv(file + 'reviews.csv')
reviews_detailed = pd.read_csv(file + 'reviews_detailed.csv')
neighbourhoods = pd.read_csv(file + 'neighbourhoods.csv')

1. Missing Values

We will start our project by dealing with potential missing values.

`Listings` dataset

Let’s inspect the listings dataset first:

listings.shape

(6998, 18)

We have that the original dataset contains 6,998 observations and 18 variables. Let’s check know the potential missing values:

Code

listings.isna().sum() / len(listings) * 100

id                                  0.000000
name                                0.000000
host_id                             0.000000
host_name                           0.000000
neighbourhood_group               100.000000
neighbourhood                       0.000000
latitude                            0.000000
longitude                           0.000000
room_type                           0.000000
price                               0.000000
minimum_nights                      0.000000
number_of_reviews                   0.000000
last_review                         9.745642
reviews_per_month                   9.745642
calculated_host_listings_count      0.000000
availability_365                    0.000000
number_of_reviews_ltm               0.000000
license                             0.342955
dtype: float64

Since neighbourhood_group has only NA’s and license does not seem to be a relevant variable, we will drop them:

drop_var = ['neighbourhood_group', 'license']
listings = listings.drop(columns = drop_var)
listings.shape

(6998, 16)

What about last_review and reviews_per_month? No more than 9.7% of them are missing values. Seems reasonable to just drop those, instead of applying imputation methods:

na_vars = ['last_review', 'reviews_per_month']
listings.dropna(subset=na_vars, axis=0, inplace=True)
print('Observations:', len(listings) )

Observations: 6316

This lead us to 6,316 observations.

`calendar` dataset

We repeat the process with the calendar dataset:

Code

print( calendar.shape )
print ( calendar.isna().sum() / len(calendar) * 100 )

(2554278, 7)
listing_id        0.000000
date              0.000000
available         0.000000
price             0.029715
adjusted_price    0.029715
minimum_nights    0.000078
maximum_nights    0.000078
dtype: float64

Since the proportion of missing values is low, let’s just drop those observations:

calendar.dropna(axis=0, inplace=True)
print('Observations:', len(calendar) )

Observations: 2553517

`reviews` dataset

Same process with the reviews dataset:

Code

print( reviews_detailed.shape )
print( reviews_detailed.isna().sum() / len(reviews_detailed) * 100 )

(339805, 6)
listing_id       0.000000
id               0.000000
date             0.000000
reviewer_id      0.000000
reviewer_name    0.000000
comments         0.004414
dtype: float64

Code

reviews_detailed.dropna(axis=0, inplace=True)
print('Observations:', len(reviews_detailed) )

Observations: 339790

2. Outliers

Now we can check and remove potential outliers in our dataset. Let’s start by checking our price variable, which will be our main outcome of interest.

`listings` - Price

Plotting an histogram on price already tell of something about having the existence of outliers:

Code

import seaborn as sns
my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C']

# Set up Figure
#fig, ax = plt.subplots(figsize=(8,4))

# Hist + KDE
sns.displot(data=listings, x="price", kde=True, color=my_colors[0], aspect=8/5)

# Labels
plt.xlabel('Price')
plt.ylabel('')
plt.title('Distribution of Prices')

# Show the Plot
plt.show()

Figure 1 reveals a pronounced tail towards the higher end, indicating the presence of outliers—entries with significantly higher prices compared to the majority of listings. These outliers can skew statistical analysis and machine learning models, potentially leading to inaccurate predictions. Employing the Interquartile Range (IQR) approach is a robust method for detecting and managing outliers.

By calculating the IQR, which represents the range between the 25th and 75th percentiles of the data distribution, we can identify values that fall beyond a certain multiple of the IQR from the quartiles. Let’s identify the outliers by calculating the IQR:

# calculate IQR for column Height
Q1 = listings['price'].quantile(0.25)
Q3 = listings['price'].quantile(0.75)
IQR = Q3 - Q1

# identify outliers
thres = 1.5
outliers = listings[(listings['price'] < Q1 - thres * IQR) | (listings['price'] > Q3 + thres * IQR)]

# drop rows containing outliers
listings = listings.drop(outliers.index)

With this we can plot again our histogram without the outliers:

Code

import seaborn as sns
my_colors =['#28AFB0', '#F46036', '#F1E3D3', '#2D1E2F', '#26547C']

# Set up Figure
#fig, ax = plt.subplots(figsize=(8,4))

# Hist + KDE
sns.displot(data=listings, x="price", kde=True, color=my_colors[0], aspect=8/5)

# Labels
plt.xlabel('Price')
plt.ylabel('')
plt.title('Distribution of Prices')

# Show the Plot
plt.show()
print(listings.shape)

Figure 2: Distribution of Price (No Outliers)

(6018, 16)

The histogram in Figure 2 now exhibits a more symmetrical and bell-shaped distribution. This indicates a more normalized spread of prices, suggesting that extreme outliers have been successfully mitigated.

Notably, with outliers removed, the average price per night for Airbnb listings in Amsterdam is approximately 190 euros.

Data Export

We’ll export this finalized dataset. Going forward, all our analyses will rely on this processed dataset, which has undergone outlier removal using the IQR method.

listings.to_csv('listings_processed.csv', index=False)
calendar.to_csv('calendar_processed.csv', index=False)
# reviews_detailed.to_csv('reviews_processed.csv', index=False)

1. Missing Values

Listings dataset

calendar dataset

reviews dataset

2. Outliers

listings - Price

Data Export

`Listings` dataset

`calendar` dataset

`reviews` dataset

`listings` - Price