AI带你省钱旅游！精准预测民宿房源价格！

阅读原文时间：2023年07月08日阅读：2

作者：韩信子@ShowMeAI

数据分析实战系列：https://www.showmeai.tech/tutorials/40

机器学习实战系列：https://www.showmeai.tech/tutorials/41

本文地址：https://www.showmeai.tech/article-detail/316

声明：版权所有，转载请联系平台与作者并注明出处

收藏ShowMeAI查看更多精彩内容

大家出去旅游最关心的问题之一就是住宿，在国外以 Airbnb 为代表的民宿互联网模式彻底改变了酒店业，很多游客更喜欢预订 Airbnb 而不是酒店，而在国内的美团飞猪等平台，也有大量的民宿入驻。

在现在这个信息透明开放的互联网时代，我们能否收集数据信息，开发一个机器学习模型来预测房源价格，为自己的出行提供更智能化的信息呢？肯定是可以的，下面ShowMeAI以Airbnb在大曼彻斯特地区的房源数据为例（截至 2022 年 3 月），来演示数据分析与挖掘建模的全过程，同样的方法模式可以应用在大家熟悉的国内平台上。

下面的项目业务和 Airbnb民宿数据 来源于 Inside Airbnb，包含有关 Airbnb 对住宅社区影响的数据和宣传。数据源可以在上述链接中获取，大家也可以访问ShowMeAI的百度网盘地址，获取我们为大家存储好的项目数据。

实战数据集下载（百度网盘）：公众号『ShowMeAI研究中心』回复『实战』，或者点击这里获取本文 [22]基于Airbnb数据的民宿房价预测模型『Airbnb民宿数据』

ShowMeAI官方GitHub：https://github.com/ShowMeAI-Hub

业务问题

一般我们需要在开始挖掘和建模之前，深入了解我们的业务场景和数据情况，我们先总结了一些在这个业务场景下我们关心的一些业务问题，我们将通过数据分析挖掘来完成这些业务问题的理解。

哪些地区或城镇的 Airbnb 房源最多？
最受欢迎的房型是什么？
大曼彻斯特地区的 Airbnb 房源价格特点是什么？
房源与房东的分布情况？
大曼彻斯特地区有哪些房型可供选择？
机器学习模型预测该地区 Airbnb 房源价格的思路是什么样的？
在预测大曼彻斯特地区 Airbnb 房源的价格时，哪些特征更重要？

数据读取与初探

我们先导入本次需要使用到的分析挖掘与建模工具库

import numpy as np
import pandas as pd
from tqdm.notebook import tqdm, trange
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.inspection import permutation_importance

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

接下来我们读取大曼彻斯特地区的房源数据

gm_listings = pd.read_csv('gm_listings-2.csv')
gm_calendar = pd.read_csv('calendar-2.csv')
gm_reviews = pd.read_csv('reviews-2.csv')

查看数据的基础信息如下

gm_listings.head()

gm_listings.shape
# (3584, 74)
gm_listings.columns

gm_calendar.head()

gm_reviews.head()

我们对数据的初览可以看到，大曼彻斯特地区的房源数据集包含 3584 行和 78 列，包含有关房东、房源类型、区域和评级的信息。

数据清洗

数据清洗是机器学习建模应用的【特征工程】阶段的核心步骤，它涉及的方法技能欢迎大家查阅ShowMeAI对应的教程文章，快学快用。

机器学习实战 | 机器学习特征工程最全解读

因为数据中的字段众多，有些字段比较乱，我们需要做一些数据清洗的工作，数据包含一些带有URL的列，对最后的预测作用不大，我们把它们清洗掉。

# 删除url字段
def drop_function(df):
    df = df.drop(columns=['listing_url', 'description', 'host_thumbnail_url', 'host_picture_url', 'latitude', 'longitude', 'picture_url', 'host_url', 'host_location', 'neighbourhood', 'neighbourhood_cleansed', 'host_about', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped'])

    return df

gm_df = drop_function(gm_listings)

删除过后的数据如下，干净很多

数据中也包含了一些缺失值，我们对它们进行分析处理：

# 查看缺失值百分比
(gm_df.isnull().sum()/gm_df.shape[0])* 100

得到如下结果

id                                                0.000000
scrape_id                                         0.000000
last_scraped                                      0.000000
name                                              0.000000
neighborhood_overview                            41.266741
host_id                                           0.000000
host_name                                         0.000000
host_since                                        0.000000
host_response_time                               10.212054
host_response_rate                               10.212054
host_acceptance_rate                              5.636161
host_is_superhost                                 0.000000
host_neighbourhood                               91.657366
host_listings_count                               0.000000
host_total_listings_count                         0.000000
host_verifications                                0.000000
host_has_profile_pic                              0.000000
host_identity_verified                            0.000000
neighbourhood_group_cleansed                      0.000000
property_type                                     0.000000
room_type                                         0.000000
accommodates                                      0.000000
bathrooms                                       100.000000
bathrooms_text                                    0.306920
bedrooms                                          4.687500
beds                                              2.120536
amenities                                         0.000000
price                                             0.000000
minimum_nights                                    0.000000
maximum_nights                                    0.000000
minimum_minimum_nights                            0.000000
maximum_minimum_nights                            0.000000
minimum_maximum_nights                            0.000000
maximum_maximum_nights                            0.000000
minimum_nights_avg_ntm                            0.000000
maximum_nights_avg_ntm                            0.000000
calendar_updated                                100.000000
number_of_reviews                                 0.000000
number_of_reviews_ltm                             0.000000
number_of_reviews_l30d                            0.000000
first_review                                     19.810268
last_review                                      19.810268
review_scores_rating                             19.810268
review_scores_accuracy                           20.089286
review_scores_cleanliness                        20.089286
review_scores_checkin                            20.089286
review_scores_communication                      20.089286
review_scores_location                           20.089286
review_scores_value                              20.089286
license                                         100.000000
instant_bookable                                  0.000000
calculated_host_listings_count                    0.000000
calculated_host_listings_count_entire_homes       0.000000
calculated_host_listings_count_private_rooms      0.000000
calculated_host_listings_count_shared_rooms       0.000000
reviews_per_month                                19.810268
dtype: float64

我们分几种不同的比例情况对缺失值进行处理：

高缺失比例的字段，如license、calendar_updated、bathrooms、host_neighborhood等包含90%以上的NaN值，包括neighborhood overview是41%的NaN，并且包含文本数据。我们会直接剔除这些字段。
数值型字段，缺失不多的情况下，我们用字段平均值进行填充。这保证了这些值的分布被保留下来。这些列包括bedrooms、beds、review_scores_rating、review_scores_accuracy和其他打分字段。
类别型字段，像bathrooms_text和host_response_time，我们用众数进行填充。

剔除高缺失比例字段

def drop_function_2(df):
df = df.drop(columns=['license', 'calendar_updated', 'bathrooms', 'host_neighbourhood', 'neighborhood_overview'])
```
return df
```
gm_df = drop_function_2(gm_df)

均值填充

def input_mean(df, column_list):
for columns in column_list:
df[columns].fillna(value = df[columns].mean(), inplace=True)
```
return df
```
column_list = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
'review_scores_value', 'reviews_per_month',
'bedrooms', 'beds']
gm_df = input_mean(gm_df, column_list)

众数填充

def input_mode(df, column_list):
for columns in column_list:
df[columns].fillna(value = df[columns].mode()[0], inplace=True)
```
return df
```
column_list = ['first_review', 'last_review', 'bathrooms_text', 'host_acceptance_rate',
'host_response_rate', 'host_response_time']

gm_df = input_mode(gm_df, column_list)

host_is_superhost 和 has_availability 等列对应的字符串含义为 true 或 false，我们对其编码替换为0或1。

gm_df = gm_df.replace({'host_is_superhost': 't', 'host_has_profile_pic': 't', 'host_identity_verified': 't', 'has_availability': 't', 'instant_bookable': 't'}, 1)

gm_df = gm_df.replace({'host_is_superhost': 'f', 'host_has_profile_pic': 'f', 'host_identity_verified': 'f', 'has_availability': 'f', 'instant_bookable': 'f'}, 0)

我们查看下替换后的数据分布

gm_df['host_is_superhost'].value_counts()

价格相关的字段，目前还是字符串类型，包含“$”等符号，我们对其处理并转换为数值型。

def string_to_int(df, column):
    # 字符串替换清理
    df[column] = df[column].str.replace("$", "")
    df[column] = df[column].str.replace(",", "")

    # 转为数值型
    df[column] = pd.to_numeric(df[column]).astype(int)

    return df

gm_df = string_to_int(gm_df, 'price')

像host_verifications和amenities这样的字段，取值为列表格式，我们对其进行编码处理（用哑变量替换）。

# 查看列表型取值字段
gm_df_copy = gm_df.copy()
gm_df_copy['amenities'].head()

gm_df_copy['host_verifications'].head()

# 哑变量编码
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('"', '')
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace(']', "")
gm_df_copy['amenities'] = gm_df_copy['amenities'].str.replace('[', "")

df_amenities = gm_df_copy['amenities'].str.get_dummies(sep = ",")

gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace("'", "")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace(']', "")
gm_df_copy['host_verifications'] = gm_df_copy['host_verifications'].str.replace('[', "")

df_host_ver = gm_df_copy['host_verifications'].str.get_dummies(sep = ",")

编码后的结果如下所示

df_amenities.head()
df_host_ver.head()

# 删除原始字段
gm_df = gm_df.drop(['host_verifications', 'amenities'], axis=1)

数据探索

下一步我们要进行更全面一些的探索性数据分析。

EDA数据分析部分涉及的工具库，大家可以参考ShowMeAI制作的工具库速查表和教程进行学习和快速使用。

数据科学工具库速查表 | Pandas 速查表

图解数据分析：从入门到精通系列教程

gm_df['neighbourhood_group_cleansed'].value_counts()

bar_data = gm_df['neighbourhood_group_cleansed'].value_counts().sort_values()

# 从bar_data构建新的dataframe
bar_data = pd.DataFrame(bar_data).reset_index()
bar_data['size'] = bar_data['neighbourhood_group_cleansed']/gm_df['neighbourhood_group_cleansed'].count()

# 排序
bar_data.sort_values(by='size', ascending=False)
bar_data = bar_data.rename(columns={'index' : 'Towns', 'neighbourhood_group_cleansed' : 'number_of_listings',
                        'size':'fraction_of_total'})

#绘图展示
#plt.figure(figsize=(10,10));
bar_data.plot(kind='barh', x ='Towns', y='fraction_of_total', figsize=(8,6))
plt.title('Towns with the Most listings');
plt.xlabel('Fraction of Total Listings');

曼彻斯特镇拥有大曼彻斯特地区的大部分房源，占总房源的 53% (1849)，其次是索尔福德，占总房源的 17% ；特拉福德，占总房源的 9%。

gm_df['price'].mean(), gm_df['price'].min(), gm_df['price'].max(),gm_df['price'].median()
# (143.47600446428572, 8, 7372, 79.0)

Airbnb 房源的均价为 143 美元，中位价为 79 美元，数据集中观察到的最高价格为 7372 美元。

# 划分价格档位区间
labels = ['$0 - $100', '$100 - $200', '$200 - $300', '$300 - $400', '$400 - $500', '$500 - $1000', '$1000 - $8000']
price_cuts = pd.cut(gm_df['price'], bins = [0, 100, 200, 300, 400, 500, 1000, 8000], right=True, labels= labels)

# 从价格档构建dataframe
price_clusters = pd.DataFrame(price_cuts).rename(columns={'price': 'price_clusters'})

# 拼接原始dataframe
gm_df = pd.concat([gm_df, price_clusters], axis=1)

# 分布绘图
def price_cluster_plot(df, column, title):
    plt.figure(figsize=(8,6));
    yx = sb.histplot(data = df[column]);

    total = float(df[column].count())
    for p in yx.patches:
        width = p.get_width()
        height = p.get_height()
        yx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')
    yx.set_title(title);
    plt.xticks(rotation=90)

    return yx

price_cluster_plot(gm_df, column='price_clusters',
                   title='Price distribution of Airbnb Listings in the Greater Manchester Area');

从上面的分析和可视化结果可以看出，65.4% 的总房源价格在 0-100 美元之间，而价格在 100-200 美元的房源占总房源的 23.4%。不过我们也观察到数据分布有很明显的长尾特性，也可以把特别高价的部分视作异常值，它们可能会对我们的分析有一些影响。

# 基于评论量统计排序
ax = gm_df.groupby('property_type').agg(
    median_rating=('review_scores_rating', 'median'),number_of_reviews=('number_of_reviews', 'max')).sort_values(
by='number_of_reviews', ascending=False).reset_index()

ax.head()

在评论最多的前 10 种房产类型中， Entire rental unit 评论数量最多，其次是Private room in rental unit。

# 可视化
bx = ax.loc[:10]
bx =sb.boxplot(data =bx, x='median_rating', y='property_type')
bx.set_xlim(4.5, 5)
plt.title('Most Enjoyed Property types');
plt.xlabel('Median Rating');
plt.ylabel('Property Type')

# 持有房源最多的房东
host_df = pd.DataFrame(gm_df['host_name'].value_counts()/gm_df['host_name'].count() *100).reset_index()
host_df = host_df.rename(columns={'index':'name', 'host_name':'perc_count'})
host_df.head(10)

host_df['perc_count'].loc[:10].sum()

从上述分析可以看出，房源最多的前 10 名房东占房源总数的 13.6%。

gm_df['room_type'].value_counts()

# 分布绘图
zx = sb.countplot(data=gm_df, x='room_type')

total = float(gm_df['room_type'].count())
for p in zx.patches:
    width = p.get_width()
    height = p.get_height()
    zx.text(p.get_x() + p.get_width()/2.,height+5, '{:1.1f}%'.format((height/total)*100), ha='center')
    zx.set_title('Plot showing different type of rooms available');
    plt.xlabel('Room')

大部分客房是 整栋房屋/公寓 ，占房源总数的 60%，其次是私人客房，占房源总数的 39%，共享房间 和 酒店房间 分别占房源的 0.7% 和 0.5%。

机器学习建模

下面我们使用回归建模方法来对民宿房源价格进行预估。

关于特征工程，欢迎大家查阅ShowMeAI对应的教程文章，快学快用。

机器学习实战 | 机器学习特征工程最全解读

我们首先对原始数据进行特征工程，得到适合建模的数据特征。

# 查看此时的数据集
gm_df.head()

# 回归数据集
gm_regression_df = gm_df.copy()

# 剔除无用字段
gm_regression_df = gm_regression_df.drop(columns=['id', 'scrape_id', 'last_scraped', 'name', 'host_id', 'host_since', 'first_review', 'last_review', 'price_clusters', 'host_name'])

# 再次查看数据
gm_regression_df.head()

我们发现host_response_rate 和 host_acceptance_rate字段带有百分号，我们再做一点数据清洗。

# 去除百分号并转换为数值型
gm_regression_df['host_response_rate'] =  gm_regression_df['host_response_rate'].str.replace("%", "")

gm_regression_df['host_acceptance_rate'] =  gm_regression_df['host_acceptance_rate'].str.replace("%", "")

# convert to int
gm_regression_df['host_response_rate'] = pd.to_numeric(gm_regression_df['host_response_rate']).astype(int)
gm_regression_df['host_acceptance_rate'] =  pd.to_numeric(gm_regression_df['host_acceptance_rate']).astype(int)

# 查看转换后结果
gm_regression_df['host_response_rate'].head()

bathrooms_text 列包含数字和文本数据的组合，我们对其做一些处理

# 查看原始字段
gm_regression_df['bathrooms_text'].value_counts()

# 切分与数据处理
def split_bathroom(df, column, text, new_column):
    df_2 = df[df[column].str.contains(text, case=False)]
    df.loc[df[column].str.contains(text, case=False), new_column] = df_2[column]
    return df

# 应用上述函数
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='shared', new_column='shared_bath')
gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='private', new_column='private_bath')
# 查看shared_bath字段
gm_regression_df['shared_bath'].value_counts()

# 查看private_bath字段
gm_regression_df['private_bath'].value_counts()

gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private bath", "pb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private baths", "pbs", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared bath", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared baths", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("shared half-bath", "sb", case=False)
gm_regression_df['bathrooms_text'] =  gm_regression_df['bathrooms_text'].str.replace("private half-bath", "sb", case=False)

gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='bath', new_column='bathrooms_new')

gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].str.split(" ", expand=True)
gm_regression_df['private_bath'] = gm_regression_df['private_bath'].str.split(" ", expand=True)
gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].str.split(" ", expand=True)

# 填充缺失值为0
gm_regression_df = gm_regression_df.fillna(0)

gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].replace(to_replace='Shared', value=0.5)
gm_regression_df['private_bath'] = gm_regression_df['private_bath'].replace(to_replace='Private', value=0.5)
gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].replace(to_replace='Half-bath', value=0.5)

# 转成数值型
gm_regression_df['shared_bath'] = pd.to_numeric(gm_regression_df['shared_bath']).astype(int)
gm_regression_df['private_bath'] = pd.to_numeric(gm_regression_df['private_bath']).astype(int)
gm_regression_df['bathrooms_new'] =  pd.to_numeric(gm_regression_df['bathrooms_new']).astype(int)

# 查看处理后的字段
gm_regression_df[['shared_bath', 'private_bath', 'bathrooms_new']].head()

下面我们对类别型字段进行编码，根据字段含义的不同，我们使用「序号编码」和「独热向量编码」等方法来完成。

# 序号编码
def encoder(df):
    for column in df[['neighbourhood_group_cleansed', 'property_type']].columns:
        labels = df[column].astype('category').cat.categories.tolist()
        replace_map = {column : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}
        df.replace(replace_map, inplace=True)
        print(replace_map)

    return df 

gm_regression_df = encoder(gm_regression_df)

我们对于host_response_time和room_type字段，使用独热向量编码（哑变量变换）

host_dummy = pd.get_dummies(gm_regression_df['host_response_time'], prefix='host_response')
room_dummy = pd.get_dummies(gm_regression_df['room_type'], prefix='room_type')

# 拼接编码后的字段
gm_regression_df = pd.concat([gm_regression_df, host_dummy, room_dummy], axis=1)

# 剔除原始字段
gm_regression_df = gm_regression_df.drop(columns=['host_response_time', 'room_type'], axis=1)

我们再把之前处理过的df_amenities做一点处理，再拼接到数据特征里

df_3 = pd.DataFrame(df_amenities.sum())
features = df_3['amenities'][:150].to_list()
amenities_updated = df_amenities.filter(items=(features))
gm_regression_df = pd.concat([gm_regression_df, amenities_updated], axis=1)

查看一下最终数据的维度

gm_regression_df.shape
# (3584, 198)

我们最后得到了198个字段，为了避免特征之间的多重共线性，使用方差因子法（VIF）来选择机器学习模型的特征。 VIF 大于 10 的特征被删除，因为这些特征的方差可以由数据集中的其他特征表示和解释。

# 计算VIF
vif_model = gm_regression_df.drop(['price'], axis=1)
vif_df = pd.DataFrame()
vif_df['feature'] = vif_model.columns
vif_df['VIF'] = [variance_inflation_factor(vif_model.values, i) for i in range(len(vif_model.columns))]

# 选出小于10的特征
vif_df_new = vif_df[vif_df['VIF']<=10]
feature_list =  vif_df_new['feature'].to_list()

# 选出这些特征对应的数据
model_df = gm_regression_df.filter(items=(feature_list))
model_df.head()

我们拼接上price目标标签字段，可以构建完整的数据集

price_col = gm_regression_df['price']
model_df = model_df.join(price_col)

我们在这里使用几个典型的回归算法，包括线性回归、RandomForestRegression、Lasso Regression 和 GradientBoostingRegression。

关于机器学习算法的应用方法，欢迎大家查阅ShowMeAI对应的教程与文章，快学快用。

机器学习实战：手把手教你玩转机器学习系列

机器学习实战 | SKLearn入门与简单应用案例

机器学习实战 | SKLearn最全应用指南

线性回归建模

def linear_reg(df, test_size=0.3, random_state=42):
    '''
    构建模型并返回评估结果
    输入: 数据dataframe
    输出: 特征重要度与评估准则（RMSE与R-squared）
    '''

    X = df.drop(columns=['price'])
    y = df[['price']]
    X_columns = X.columns

    # 切分训练集与测试集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state=random_state)

    # 线性回归分类器
    clf = LinearRegression()

    # 候选参数列表
    parameters = {
                  'n_jobs': [1, 2, 5, 10, 100],
                  'fit_intercept': [True, False]

                  }

    # 网格搜索交叉验证调参
    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=3, verbose=3)
    cv.fit(X_train,y_train)

    # 测试集预估
    pred = cv.predict(X_test)

    # 模型评估
    r2 = r2_score(y_test, pred)
    mse = mean_squared_error(y_test, pred)
    rmse = mse **.5

    # 最佳参数
    best_par = cv.best_params_
    coefficients = cv.best_estimator_.coef_

    #特征重要度
    importance = np.abs(coefficients)
    feature_importance = pd.DataFrame(importance, columns=X_columns).T
    #feature_importance = feature_importance.T
    feature_importance.columns = ['importance']
    feature_importance = feature_importance.sort_values('importance', ascending=False)

    print("The model performance for testing set")
    print("--------------------------------------")
    print('RMSE is {}'.format(rmse))
    print('R2 score is {}'.format(r2))
    print("\n")

    return feature_importance, rmse, r2

 linear_feat_importance, linear_rmse, linear_r2 = linear_reg(model_df)

随机森林建模

# 随机森林建模
def random_forest(df):
    '''
    构建模型并返回评估结果
    输入: 数据dataframe
    输出: 特征重要度与评估准则（RMSE与R-squared）
    '''

    X = df.drop(['price'], axis=1)
    X_columns = X.columns

    y = df['price']

    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

    # 随机森林模型
    clf = RandomForestRegressor()

    # 候选参数
    parameters = {

                'n_estimators': [50, 100, 200, 300, 400],
                'max_depth': [2, 3, 4, 5],
                 'max_depth': [80, 90, 100]

                     }

    # 网格搜索交叉验证调参
    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)
    model = cv
    model.fit(X_train, y_train)

    # 测试集预估
    pred = model.predict(X_test)

    # 模型评估
    mse = mean_squared_error(y_test, pred)
    rmse = mse**.5
    r2 = r2_score(y_test, pred)

    # 最佳超参数
    best_par = model.best_params_

    # 特征重要度
    r = permutation_importance(model, X_test, y_test,
                           n_repeats=10,
                           random_state=0)
    perm = pd.DataFrame(columns=['AVG_Importance'], index=[i for i in X_train.columns])
    perm['AVG_Importance'] = r.importances_mean
    perm = perm.sort_values(by='AVG_Importance', ascending=False);

    return rmse, r2, best_par, perm

# 运行建模
r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(model_df)

运行结果如下

Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV 1/5] END ..................max_depth=80, n_estimators=50; total time=   2.4s
[CV 2/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 3/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 4/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 5/5] END ..................max_depth=80, n_estimators=50; total time=   1.9s
[CV 1/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 2/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 3/5] END .................max_depth=80, n_estimators=100; total time=   3.9s
[CV 4/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 5/5] END .................max_depth=80, n_estimators=100; total time=   3.8s
[CV 1/5] END .................max_depth=80, n_estimators=200; total time=   7.5s
[CV 2/5] END .................max_depth=80, n_estimators=200; total time=   7.7s
[CV 3/5] END .................max_depth=80, n_estimators=200; total time=   7.7s
[CV 4/5] END .................max_depth=80, n_estimators=200; total time=   7.6s
[CV 5/5] END .................max_depth=80, n_estimators=200; total time=   7.6s
[CV 1/5] END .................max_depth=80, n_estimators=300; total time=  11.3s
[CV 2/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 3/5] END .................max_depth=80, n_estimators=300; total time=  11.7s
[CV 4/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 5/5] END .................max_depth=80, n_estimators=300; total time=  11.4s
[CV 1/5] END .................max_depth=80, n_estimators=400; total time=  15.1s
[CV 2/5] END .................max_depth=80, n_estimators=400; total time=  16.4s
[CV 3/5] END .................max_depth=80, n_estimators=400; total time=  15.6s
[CV 4/5] END .................max_depth=80, n_estimators=400; total time=  15.2s
[CV 5/5] END .................max_depth=80, n_estimators=400; total time=  15.6s
[CV 1/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s
[CV 2/5] END ..................max_depth=90, n_estimators=50; total time=   1.9s
[CV 3/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 4/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 5/5] END ..................max_depth=90, n_estimators=50; total time=   2.0s
[CV 1/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 2/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 3/5] END .................max_depth=90, n_estimators=100; total time=   4.0s
[CV 4/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 5/5] END .................max_depth=90, n_estimators=100; total time=   3.9s
[CV 1/5] END .................max_depth=90, n_estimators=200; total time=   8.7s
[CV 2/5] END .................max_depth=90, n_estimators=200; total time=   8.1s
[CV 3/5] END .................max_depth=90, n_estimators=200; total time=   8.1s
[CV 4/5] END .................max_depth=90, n_estimators=200; total time=   7.7s
[CV 5/5] END .................max_depth=90, n_estimators=200; total time=   8.0s
[CV 1/5] END .................max_depth=90, n_estimators=300; total time=  11.6s
[CV 2/5] END .................max_depth=90, n_estimators=300; total time=  11.8s
[CV 3/5] END .................max_depth=90, n_estimators=300; total time=  12.2s
[CV 4/5] END .................max_depth=90, n_estimators=300; total time=  12.0s
[CV 5/5] END .................max_depth=90, n_estimators=300; total time=  13.2s
[CV 1/5] END .................max_depth=90, n_estimators=400; total time=  15.6s
[CV 2/5] END .................max_depth=90, n_estimators=400; total time=  15.9s
[CV 3/5] END .................max_depth=90, n_estimators=400; total time=  16.1s
[CV 4/5] END .................max_depth=90, n_estimators=400; total time=  15.7s
[CV 5/5] END .................max_depth=90, n_estimators=400; total time=  15.8s
[CV 1/5] END .................max_depth=100, n_estimators=50; total time=   1.9s
[CV 2/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 3/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 4/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 5/5] END .................max_depth=100, n_estimators=50; total time=   2.0s
[CV 1/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 2/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 3/5] END ................max_depth=100, n_estimators=100; total time=   4.1s
[CV 4/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 5/5] END ................max_depth=100, n_estimators=100; total time=   4.0s
[CV 1/5] END ................max_depth=100, n_estimators=200; total time=   7.8s
[CV 2/5] END ................max_depth=100, n_estimators=200; total time=   7.9s
[CV 3/5] END ................max_depth=100, n_estimators=200; total time=   8.1s
[CV 4/5] END ................max_depth=100, n_estimators=200; total time=   7.9s
[CV 5/5] END ................max_depth=100, n_estimators=200; total time=   7.8s
[CV 1/5] END ................max_depth=100, n_estimators=300; total time=  11.8s
[CV 2/5] END ................max_depth=100, n_estimators=300; total time=  12.0s
[CV 3/5] END ................max_depth=100, n_estimators=300; total time=  12.8s
[CV 4/5] END ................max_depth=100, n_estimators=300; total time=  11.4s
[CV 5/5] END ................max_depth=100, n_estimators=300; total time=  11.5s
[CV 1/5] END ................max_depth=100, n_estimators=400; total time=  15.1s
[CV 2/5] END ................max_depth=100, n_estimators=400; total time=  15.3s
[CV 3/5] END ................max_depth=100, n_estimators=400; total time=  15.6s
[CV 4/5] END ................max_depth=100, n_estimators=400; total time=  15.3s
[CV 5/5] END ................max_depth=100, n_estimators=400; total time=  15.3s

随机森林最后的结果如下

r_forest_rmse, r_forest_r2
# (218.7941962807868, 0.4208644494689676)

GBDT建模

def GBDT_model(df):
    '''
    构建模型并返回评估结果
    输入: 数据dataframe
    输出: 特征重要度与评估准则（RMSE与R-squared）
    '''

    X = df.drop(['price'], axis=1)
    Y = df['price']
    X_columns = X.columns

    X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)

    clf = GradientBoostingRegressor()

    parameters = {

                'learning_rate': [0.1, 0.5, 1],
                'min_samples_leaf': [10, 20, 40 , 60]

                     }
    cv = GridSearchCV(estimator=clf, param_grid=parameters, cv=5, verbose=3)

    model = cv
    model.fit(X_train, y_train)
    pred = model.predict(X_test)

    r2 = r2_score(y_test, pred)
    mse = mean_squared_error(y_test, pred)
    rmse = mse**.5

    coefficients = model.best_estimator_.feature_importances_

    importance = np.abs(coefficients)
    feature_importance = pd.DataFrame(importance, index= X_columns,
                                      columns=['importance']).sort_values('importance', ascending=False)[:10]

    return r2, mse, rmse, feature_importance

GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDT_model(model_df)
GBDT_r2, GBDT_rmse
# (0.46352992147034244, 210.58063809645563)

目前随机森林的表现最稳定，而集成模型GradientBoostingRegression 的R²很高，RMSE 值也偏高，Boosting的模型受异常值影响很大，这可能是因为数据集中的异常值引起的。

下面我们来做一下优化，删除数据集中的异常值，看看是否可以提高模型性能。

异常值在早些时候就已经被识别出来了，我们基于统计的方法来对其进行处理。

# 基于统计方法计算价格边界
q3, q1 = np.percentile(model_df['price'], [75, 25])
iqr = q3 - q1
q3 + (iqr*1.5)

# 得到结果245.0

我们把任何高于 245 美元的值都视为异常值并删除。

new_model_df = model_df[model_df['price']<245]

# 绘制此时的价格分布
sb.histplot(new_model_df['price'])
plt.title('New price distribution in the dataset')

重新运行这些算法

linear_feat_importance, linear_rmse, linear_r2 = linear_reg(new_model_df)
r_forest_rmse, r_forest_r2, r_fores_best_params, r_forest_importance = random_forest(new_model_df)
GBDT_r2, GBDT_mse, GBDT_rmse, GBDT_feature_importance = GBDTboost(new_model_df)

得到的新结果如下

归因分析

那么，基于我们的模型来分析，在预测大曼彻斯特地区 Airbnb 房源的价格时，哪些因素更重要？

r_feature_importance = r_forest_importance.reset_index()
r_feature_importance = r_feature_importance.rename(columns={'index':'Feature'})
r_feature_importance[:15]

# 绘制最重要的15个因素
r_feature_importance[:15].sort_values(by='AVG_Importance').plot(kind='barh', x='Feature', y='AVG_Importance', figsize=(8,6));
plt.title('Top 15 Most Imporatant Features');