随机森林分类、回归算法原理及调参实战

阅读原文时间：2021年04月21日阅读：1

集成算法 ensemble

通过在数据集上构建多个模型，集成所有模型的建模结果，得到一个综合的结果，以此来获得比单个
模型更好的分类或回归表现

装袋法bagging：
    构建多个相互独立的基评估器（base estimator）,然后对其预测进行平均或多数表决原则来决定
    集成评估器的结果。装袋法的代表模型就是随机森林
提升法boosting:
    基评估器是相关的，是按顺序一一构建的。其核心思想是结合弱评估器的力量一次次对难以评估的样本
    进行预测，从而构成一个强评估器。提升法的代表模型有Adaboost和梯度提升树GBDT。

装袋法和提升法对比

装袋法bagging

提升法boosting

评估器

相互独立，同时运行

相互关联，按顺序依次构建，后建模型会在先建模型的预测失败的样本上有更多权重

抽样数集

有放回随机抽样

有放回抽样，但每次抽样时，会给预测失败的样本更多的权重

决定集成的结果

平均或多数表决

加权平均，训练集上表现更好的模型会得到更多的权重

目标

降低方差，提高模型整体稳定性

降低偏差，提高模型整体精确度

单个评估器过拟合问题

能一定程度上避免过拟合

可能会加剧过拟合

单个评估器的效力比较弱时

不是非常有帮助

可能会提升模型表现

代表算法

GBDT和adabooost

随机森林参数

控制基评估器参数：
    与决策树类似
n_estimators：
    决策树的数量，越大，模型的效果往往越好。但是相应的，任何模型都有决策边界，n_estimators达到
    一定的程度之后，随机森林的精确性往往不再上升或开始波动


%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine


wine = load_wine()

决策树与随机深林的对比

from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data,wine.target,test_size=0.3)

clf = DecisionTreeClassifier(random_state=0)
rfc = RandomForestClassifier(random_state=0)
clf = clf.fit(Xtrain,Ytrain)
rfc = rfc.fit(Xtrain,Ytrain)
score_c = clf.score(Xtest,Ytest)
score_r = rfc.score(Xtest,Ytest)

print("Single Tree:{}".format(score_c)
      ,"Random Forest:{}".format(score_r)
     )


Single Tree:0.9444444444444444 Random Forest:1.0

#交叉验证情况下

from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

rfc = RandomForestClassifier(n_estimators=25)
rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10)

clf = DecisionTreeClassifier()
clf_s = cross_val_score(clf,wine.data,wine.target,cv=10)

plt.plot(range(1,11),rfc_s,label = "RandomForest")
plt.plot(range(1,11),clf_s,label = "Decision Tree")
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-X1SsirVY-1587530490929)(output_12_0.png)]

# 多次交叉验证

rfc_l = []
clf_l = []

for i in range(10):
    rfc = RandomForestClassifier(n_estimators=25)
    rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10).mean()
    rfc_l.append(rfc_s)
    clf = DecisionTreeClassifier()
    clf_s = cross_val_score(clf,wine.data,wine.target,cv=10).mean()
    clf_l.append(clf_s)

plt.plot(range(1,11),rfc_l,label = "Random Forest")
plt.plot(range(1,11),clf_l,label = "Decision Tree")
plt.legend()
plt.show()

#是否有注意到，单个决策树的波动轨迹和随机森林一致？
#再次验证了我们之前提到的，单个决策树的准确率越高，随机森林的准确率也会越高

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rGAG65xY-1587530490936)(output_14_0.png)]

n_estimator 参数曲线

#####【TIME WARNING: 2mins 30 seconds】#####
superpa = []
for i in range(200):
    rfc = RandomForestClassifier(n_estimators=i+1,n_jobs=-1)
    rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10).mean()
    superpa.append(rfc_s)
print(max(superpa),superpa.index(max(superpa))+1)#打印出：最高精确度取值，max(superpa))+1指的是森林数目的数量n_estimators
plt.figure(figsize=[20,5])
plt.plot(range(1,201),superpa)
plt.show()


0.9888888888888889 32

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5ATU1OOh-1587530490942)(output_16_1.png)]

随机森林为什么准确率高？

随机森林的本质是一种装袋集成算法（bagging），装袋集成算法是对基评估器的预测结果进行平均
或用多数表决原则来决定集成评估器的结果。

在刚才的红酒例子中，我们建立了25棵树，对任何一个样本而言，平均或多数表决原则下，当且仅当
有13棵以上的树判断错误的时候，随机森林才会判断错误。单独一棵决策树对红酒数据集的分类准确
率在0.85上下浮动，假设一棵树判断错误的可能性为0.2，那13棵树以上都判断错误的可能性是


import numpy as np
from scipy.special import comb

np.array([comb(25,i)*(0.2**i)*((1-0.2)**(25-i)) for i in range(13,26)]).sum()



0.00036904803455582827



可以看到：远小于一棵树的错误率。同时也注意到，如果随机森林中每棵树都是一样的话，
那么一棵树判断错误，则整个森林就判读错误，也就和决策树没有区别。

所以随机森林的基分类器是相互独立的，互不相同的

怎样生成不同的树 ------双重随机性

参数：random_state 控制森林生成的模式

rfc = RandomForestClassifier(n_estimators=20,random_state=2)
rfc = rfc.fit(Xtrain, Ytrain)


#重要属性：estimators_ 存储所有树信息的列表
#查看所有树的random_state
for i in range(len(rfc.estimators_)):
    print(rfc.estimators_[i].random_state)


2056096140
1078729000
602442957
1514174439
1458549053
904046564
1214918618
655921571
139775012
293793817
864952371
2116213231
963777025
861270369
1156416813
2032972974
583060530
1909517413
1341241096
2058549495
533227000
1803803348
2056406370
1190856758
132094869


当random_state固定时，随机森林中生成是一组固定的树，但每棵树依然是不一致的，这是用
”随机挑选特征进行分枝“的方法得到的随机性。并且我们可以证明，当这种随机性越大的时
候，袋装法的效果一般会越来越好。

但这种做法的局限性是很强的，当我们需要成千上万棵树的时候，数据不一定能够提供成千上
万的特征来让我们构筑尽量多尽量不同的树

参数：bootstrap &oob_score 有放回的随机抽样

要让基分类器尽量都不一样，一种很容易理解的方法是使用不同的训练集来进行训练，而袋装
法正是通过有放回的随机抽样技术来形成不同的训练数据。

参数bootstrap默认为True，即默认使用有放回的随机抽样

有放回随机抽样：
    在一个含有n个样本的原始训练集中，我们进行随机采样，每次采样一个样本，并在抽取下
    一个样本之前将该样本放回原始训练集，也就是说下次采样时这个样本依然可能被采集到，
    这样采集n次，最终得到一个和原始训练集一样大的，n个样本组成的自助集

然而,由于是有放回，一些样本可能在同一个自助集中出现多次，而其他一些却可能被忽略，
因为每一个样本被抽到某个自助集中的概率为：

$1-(1-\frac{1}{n})^n$

$\displaystyle\lim_{n\to\infty}(1-(1-\frac{1}{n})^n)=(1-\frac{1}{e})=0.632$

一般来说，一个自助集大约平均会包含63%的原始数据。会有约37%的训练数据被浪费掉，没有
参与建模，这些数据被称为袋外数据(out of bag data，简写为oob)。除了我们最开始就划分好
的测试集之外，这些数据也可以被用来作为集成算法的测试集

也就是说，在使用随机森林时，我们可以不划分测试集和训练集，只需要用袋外数据来测试我们
的模型即可


rfc = RandomForestClassifier(n_estimators=25,oob_score=True) #默认为False
rfc = rfc.fit(wine.data,wine.target)


#重要属性 oob_score_ 使用袋外数据的模型评分
rfc.oob_score_


0.9719101123595506



 当然，这也不是绝对的，当n和n_estimators都不够大的时候，很可能就没有数据掉落在袋外
 ，自然也就无法使用oob数据来测试模型了

5个重要接口

fit:
    训练模型，输入训练集特征和标签

score:
    返回分类准确率，可以是测试集也可以使训练集。注意，指标是不能修改的，如果想使用其他
    衡量指标，则不用score，使用交叉验证。


    apply:
        返回每个样本所在叶子节点索引,可以是测试集也可以使训练集


    predict:
        返回每个样本的分类结果,可以是测试集也可以使训练集


    predict_proba:
        返回每个样本的分类到某一类的概率,可以是测试集也可以使训练集


rfc = RandomForestClassifier(n_estimators=25,random_state=20)
rfc = rfc.fit(Xtrain, Ytrain)
rfc.score(Xtest,Ytest)


rfc.apply(Xtest)


array([[10, 16, 16, ...,  8,  4,  6],
       [26, 18, 20, ..., 20, 14, 20],
       [10, 16, 16, ...,  8,  4,  2],
       ...,
       [26, 22, 20, ..., 20, 14, 15],
       [15,  4,  4, ..., 10, 11, 14],
       [19, 16, 16, ..., 19,  7,  2]], dtype=int64)


rfc.predict(Xtest)


array([2, 0, 2, 1, 0, 1, 0, 1, 0, 0, 2, 0, 1, 1, 0, 1, 0, 0, 1, 1, 2, 1,
       0, 0, 0, 1, 0, 1, 2, 0, 0, 2, 0, 0, 1, 1, 2, 0, 1, 2, 2, 1, 2, 2,
       2, 1, 0, 1, 1, 0, 0, 0, 1, 2])


rfc.predict_proba(Xtest)


array([[0.  , 0.  , 1.  ],
       [1.  , 0.  , 0.  ],
       [0.  , 0.  , 1.  ],
       [0.  , 1.  , 0.  ],
       [0.84, 0.16, 0.  ],
       [0.  , 0.96, 0.04],
       [0.88, 0.12, 0.  ],
       [0.12, 0.88, 0.  ],
       [0.96, 0.04, 0.  ],
       [1.  , 0.  , 0.  ],
       [0.04, 0.08, 0.88],
       [1.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [0.96, 0.04, 0.  ],
       [0.12, 0.76, 0.12],
       [0.96, 0.04, 0.  ],
       [0.96, 0.04, 0.  ],
       [0.  , 1.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [0.  , 0.  , 1.  ],
       [0.  , 1.  , 0.  ],
       [0.96, 0.04, 0.  ],
       [1.  , 0.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [0.  , 0.88, 0.12],
       [0.  , 0.04, 0.96],
       [1.  , 0.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [0.  , 0.  , 1.  ],
       [1.  , 0.  , 0.  ],
       [0.88, 0.12, 0.  ],
       [0.  , 1.  , 0.  ],
       [0.08, 0.84, 0.08],
       [0.  , 0.04, 0.96],
       [0.96, 0.04, 0.  ],
       [0.  , 1.  , 0.  ],
       [0.  , 0.08, 0.92],
       [0.  , 0.  , 1.  ],
       [0.  , 1.  , 0.  ],
       [0.04, 0.  , 0.96],
       [0.  , 0.08, 0.92],
       [0.  , 0.08, 0.92],
       [0.08, 0.92, 0.  ],
       [0.8 , 0.2 , 0.  ],
       [0.08, 0.92, 0.  ],
       [0.  , 1.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [0.96, 0.04, 0.  ],
       [1.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [0.04, 0.04, 0.92]])

属性：feature_importances_ 特征重要性评分

[*zip(wine.feature_names,rfc.feature_importances_)]


[('alcohol', 0.12198679686258679),
 ('malic_acid', 0.02762880277046342),
 ('ash', 0.020429313443862722),
 ('alcalinity_of_ash', 0.04431583357245361),
 ('magnesium', 0.033968954054681415),
 ('total_phenols', 0.038263957377275275),
 ('flavanoids', 0.1408080229991804),
 ('nonflavanoid_phenols', 0.010215119630325406),
 ('proanthocyanins', 0.005887369287630495),
 ('color_intensity', 0.18147980477356618),
 ('hue', 0.0848840620720855),
 ('od280/od315_of_diluted_wines', 0.14228838485786396),
 ('proline', 0.14784357829802486)]

袋装法的必要条件

基评估器相互独立
基评估器分类准确率大于50％

import numpy as np

x = np.linspace(0,1,20)

y = []
for epsilon in np.linspace(0,1,20):
E = np.array([comb(25,i)(epsiloni)((1-epsilon)(25-i)) for i in range(13,26)]).sum()
y.append(E)
plt.plot(x,y,"o-",label="when estimators are different")
plt.plot(x,x,"--",color="red",label="if all estimators are same")
plt.xlabel("individual estimator's error")
plt.ylabel("RandomForest's error")
plt.legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5e7eavXS-1587530490955)(output_48_0.png)]

随机森林回归

与分类树有两点不同：

模型评估指标：回归树用的MSE或R平方，默认接口score返回的是R方,且不能更改，想用MSE就只能使用交叉验证了。
分支质量的不纯度衡量指标：回归树支持三种指标，默认为MSE

from sklearn.datasets import load_boston#一个标签是连续西变量的数据集
from sklearn.model_selection import cross_val_score#导入交叉验证模块
from sklearn.ensemble import RandomForestRegressor#导入随机森林回归系

boston = load_boston()
regressor = RandomForestRegressor(n_estimators=100,random_state=0)
cross_val_score(regressor, boston.data, boston.target, cv=10
,scoring = "neg_mean_squared_error"#如果不写scoring，回归评估默认是R平方
)

array([-10.72900447, -5.36049859, -4.74614178, -20.84946337,
-12.23497347, -17.99274635, -6.8952756 , -93.78884428,
-29.80411702, -15.25776814])

sklearn中回归类模型的评估指标

import sklearn
sorted(sklearn.metrics.SCORERS.keys())#这些指标是scoring可选择的参数


['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_root_mean_squared_error',
 'normalized_mutual_info_score',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'roc_auc_ovo',
 'roc_auc_ovo_weighted',
 'roc_auc_ovr',
 'roc_auc_ovr_weighted',
 'v_measure_score']

实例：随机森林回归填补缺失值

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.impute import SimpleImputer #填补缺失值的类
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score


dataset = load_boston()

dataset.data#数据的特征矩阵,全部为数值类型
dataset.data.shape#数据的结构 总共506*13=6578个数据


(506, 13)

原始数据是没有缺失值的，需要构造含有缺失值的数据

X_full, y_full = dataset.data, dataset.target
n_samples = X_full.shape[0]#506
n_features = X_full.shape[1]#13


#首先确定我们希望放入的缺失数据的比例，在这里我们假设是50%，那总共就要有3289个数据缺失

rng = np.random.RandomState(0)#设置一个随机种子，方便观察
missing_rate = 0.5
n_missing_samples = int(np.floor(n_samples * n_features * missing_rate))
#np.floor向下取整，返回.0格式的浮点数


 #所有数据要随机遍布在数据集的各行各列当中，而一个缺失的数据会需要一个行索引和一个列索引
#如果能够创造一个数组，包含3289个分布在0~506中间的行索引，和3289个分布在0~13之间的列索引，那我们就可以利用索引来为数据中的任意3289个位置赋空值
#然后我们用0，均值和随机森林来填写这些缺失值，然后查看回归的结果如何

missing_features = rng.randint(0,n_features,n_missing_samples)
#randint（下限，上限，n）指在下限和上限之间取出n个整数
len(missing_features)#3289
missing_samples = rng.randint(0,n_samples,n_missing_samples)
len(missing_samples)#3289


3289


X_missing = X_full.copy()
y_missing = y_full.copy()

#置空
X_missing[missing_samples,missing_features] = np.nan


#转化为DataFrame
X_missing = pd.DataFrame(X_missing)
X_missing.head()

NaN

18.0

NaN

0.538

NaN

65.2

4.0900

1.0

296.0

NaN

4.98

0.02731

0.0

NaN

0.0

0.469

NaN

78.9

4.9671

2.0

NaN

396.9

9.14

0.02729

NaN

7.07

0.0

NaN

7.185

61.1

NaN

2.0

242.0

NaN

0.0

0.458

NaN

45.8

NaN

222.0

18.7

NaN

0.0

2.18

0.0

NaN

7.147

NaN

18.7

NaN

5.33

使用均值和0填补

#使用均值进行填补
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')#实例化
X_missing_mean = imp_mean.fit_transform(X_missing)#特殊的接口fit_transform = 训练fit + 导出predict

#查看是否含有null
#布尔值False = 0， True = 1 
# pd.DataFrame(X_missing_mean).isnull().sum()#如果求和为0可以彻底确认是否有NaN

#使用0进行填补
imp_0 = SimpleImputer(missing_values=np.nan, strategy="constant",fill_value=0)#constant指的是常数
X_missing_0 = imp_0.fit_transform(X_missing)

随机森林回归填补

1. 如果只有一个特征含有缺失值

训练集，特征T未缺失部分样本的其他n-1个特征 +原标签：作为训练集特征矩阵
特征T未缺失部分样本：作为训练集标签
测试集，特征T缺失部分的样本的其他n-1个特征 +原标签：作为测试集输入

2. 如果多个特征含有缺失值

答案是遍历所有的特征，从缺失最少的开始进行填补（因为填补缺失最少的特征所需要的
准确信息最少）。填补一个特征时，先将其他特征的缺失值用0代替，每完成一次回归
预测，就将预测值放到原本的特征矩阵中，再继续填补下一个特征

复制数据

X_missing_reg = X_missing.copy()

#找出数据集中，缺失值从小到大排列的特征们的顺序，并且有了这些的索引
sortindex = np.argsort(X_missing_reg.isnull().sum()).values#np.argsort()返回的是从小到大排序的顺序所对应的索引

for i in sortindex:

#构建我们的新特征矩阵（没有被选中去填充的特征 + 原始的标签）和新标签（被选中去填充的特征）
df = X_missing_reg
fillc = df.iloc[:,i]#新标签
df = pd.concat([df.iloc[:,df.columns != i],pd.DataFrame(y_full)],axis=1)#新特征矩阵

#在新特征矩阵中，对含有缺失值的列，进行0的填补
df_0 =SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0).fit_transform(df)

#找出我们的训练集和测试集
Ytrain = fillc[fillc.notnull()]# Ytrain是被选中要填充的特征中（现在是我们的标签），存在的那些值：非空值
Ytest = fillc[fillc.isnull()]#Ytest 是被选中要填充的特征中（现在是我们的标签），不存在的那些值：空值。注意我们需要的不是Ytest的值，需要的是Ytest所带的索引
Xtrain = df_0[Ytrain.index,:]#在新特征矩阵上，被选出来的要填充的特征的非空值所对应的记录
Xtest = df_0[Ytest.index,:]#在新特征矩阵上，被选出来的要填充的特征的空值所对应的记录

#用随机森林回归来填补缺失值
rfc = RandomForestRegressor(n_estimators=100)#实例化
rfc = rfc.fit(Xtrain, Ytrain)#导入训练集进行训练
Ypredict = rfc.predict(Xtest)#用predict接口将Xtest导入，得到我们的预测结果（回归结果），就是我们要用来填补空值的这些值

#将填补好的特征返回到我们的原始的特征矩阵中
X_missing_reg.loc[X_missing_reg.iloc[:,i].isnull(),i] = Ypredict

#检验是否有空值
X_missing_reg.isnull().sum()

0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
dtype: int64

#比较原数据建模结果用均值填补结果用0值填补结果用随机森林回归填补 4中情况建模的效果

X = [X_full,X_missing_mean,X_missing_0,X_missing_reg]

mse = []
std = []
for x in X:
estimator = RandomForestRegressor(random_state=0, n_estimators=100)#实例化
scores = cross_val_score(estimator,x,y_full,scoring='neg_mean_squared_error', cv=5).mean()
mse.append(scores * -1)

[*zip(['Full data','Zero Imputation','Mean Imputation','Regressor Imputation'],mse)]

[('Full data', 21.62860460743544),
('Zero Imputation', 40.84405476955929),
('Mean Imputation', 49.50657028893417),
('Regressor Imputation', 18.255872365734806)]

可视化之：条形图

x_labels = ['Full data',
            'Zero Imputation',
            'Mean Imputation',
            'Regressor Imputation']
colors = ['r', 'g', 'b', 'orange']

plt.figure(figsize=(12, 6))#画出画布
ax = plt.subplot(111)#添加子图
for i in np.arange(len(mse)):
    ax.barh(i, mse[i],color=colors[i], alpha=0.6, align='center')#bar为条形图，barh为横向条形图，alpha表示条的粗度
#条形图的一些主要设置
ax.set_title('Imputation Techniques with Boston Data') #子图的设置加上set_前缀
ax.set_xlim(left=np.min(mse) * 0.9,
             right=np.max(mse) * 1.1)#设置x轴取值范围
ax.set_xlabel('MSE') #x轴的标签
ax.set_yticks(np.arange(len(mse))) #y轴的刻度，长条的数量构成的整数列表
ax.set_yticklabels(x_labels) #y轴每个刻度上的标签
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-M1rSNYly-1587530490960)(output_74_0.png)]

实例：随机森林在乳腺癌数据上的调参

泛化误差：

衡量模型在未知数据上的表现
受模型结构（复杂度）影响。模型太复杂，模型就容易过拟合，泛化能力就不够，所以泛化误差大。当模型太简单，模型就会欠拟合，拟合能力就不够，所以误差也会大。只有当模型的复杂度刚刚好的才能够达到泛化误差最小的目标
对树和树的集成模型来说，树的深度越深，枝叶越多，模型越复杂
树和树的集成模型的学习能力很强，一般都会过拟合，所以都是往减少模型复杂度的方向调整参数

偏差方差

一个集成模型（f）在位置数据集（D）上的泛化误差：

$E(f;D)=biax^2+var(x)+ε^2$

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FBkijwdY-1587530490962)(./泛化误差.png)]

偏差:
模型的预测值与真实值之间的差异，即每一个红点到蓝线的距离。在集成算法中，每个基评估器都会有
自己的偏差，集成评估器的偏差是所有基评估器偏差的均值。衡量模型的精确性，模型越精确，偏差越低

方差:
反映的是模型每一次输出结果与模型预测值的平均水平之间的误差，即每一个红点到红色虚线的距离，
衡量模型的稳定性。模型越稳定，方差越低。

噪声:
机器学习无法干涉的部分

一个好的模型应当要对绝大多数未知数据预测的既准又稳

偏差大

偏差小

方差大

模型不适合数据，换模型

过拟合，模型很复杂，有的数据预测很准，有的预测很糟糕

方差小

欠拟合，模型相对简单，预测很稳定，但对所有数据预测都不太准确

泛化误差小，理想的模型

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-C6g9UKbZ-1587530490965)(./偏差方差.png)]|

基于经验，根据随机森林参数对模型复杂度的影响程度，调参的顺序为：

n_estimator：理论上越大越好，提升至平稳，该参数不影响单个模型的复杂度—>
max_depth：默认不限制，即最高复杂度。这个参数的难点在于不知道树实际的深度—>
min_samples_leaf：默认为1，即最高复杂度。增大会降低模型复杂度—>
min_samples_split：默认为2，即最高复杂度。增大会降低模型复杂度—>
max_features：默认auto，是特征总数的开平方，位于中间复杂度，增大则增加复杂度，减少会降低复杂度—>
criterion:默认gini

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = load_breast_cancer()

rfc = RandomForestClassifier(n_estimators=100,random_state=90)
score_pre = cross_val_score(rfc,data.data,data.target,cv=10).mean()#交叉验证的分类默认scoring='accuracy'
score_pre

0.9648809523809524

参数：n_estimators

在这里我们选择学习曲线，可以使用网格搜索吗？可以，但是只有学习曲线，才能看见趋势
我个人的倾向是，要看见n_estimators在什么取值开始变得平稳，是否一直推动模型整体准
确率的上升等信息。第一次的学习曲线，可以先用来帮助我们划定范围，我们取每十个数作
为一个阶段，来观察n_estimators的变化如何引起模型整体准确率的变化


scorel = []
for i in range(0,200,10):
    rfc = RandomForestClassifier(n_estimators=i+1,
                                 n_jobs=-1,
                                 random_state=90)
    score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),(scorel.index(max(scorel))*10)+1)
plt.figure(figsize=[20,5])
plt.plot(range(1,201,10),scorel)
plt.show()


0.9631265664160402 71

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-roKwcX8i-1587530490968)(output_89_1.png)]

scorel = []
for i in range(65,75):
    rfc = RandomForestClassifier(n_estimators=i,
                                 n_jobs=-1,
                                 random_state=90)
    score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
    scorel.append(score)
print(max(scorel),([*range(65,75)][scorel.index(max(scorel))]))
plt.figure(figsize=[20,5])
plt.plot(range(65,75),scorel)
plt.show()


0.9666353383458647 73

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mq5Njl6W-1587530490971)(output_90_1.png)]

有一些参数是没有参照的，很难说清一个范围，这种情况下我们使用学习曲线，看趋势从曲线跑出的结果中选取一个更小的区间，再跑曲线。如

param_grid = {‘n_estimators’:np.arange(0, 200, 10)}
param_grid = {‘max_depth’:np.arange(1, 20, 1)}
param_grid = {‘max_leaf_nodes’:np.arange(25,50,1)}
对于大型数据集，可以尝试从1000来构建，先输入1000，每100个叶子一个区间，再逐渐缩小范围
2. 有一些参数是可以找到一个范围的，或者说我们知道他们的取值和随着他们的取值，模型的整体准确率会如何变化，这样的参数我们就可以直接跑网格搜索
param_grid = {‘criterion’:[‘gini’, ‘entropy’]}
param_grid = {‘min_samples_split’:np.arange(2, 2+20, 1)}
param_grid = {‘min_samples_leaf’:np.arange(1, 1+10, 1)}
param_grid = {‘max_features’:np.arange(5,30,1)}

#调整max_depth

param_grid = {'max_depth':np.arange(1, 20, 1)}

一般根据数据的大小来进行一个试探，乳腺癌数据很小，所以可以采用1~10，或者1~20这样的试探

但对于像digit recognition那样的大型数据来说，我们应该尝试30~50层深度（或许还不足够

更应该画出学习曲线，来观察深度对模型的影响

rfc = RandomForestClassifier(n_estimators=73
,random_state=90
)
GS = GridSearchCV(rfc,param_grid,cv=10)#网格搜索
GS.fit(data.data,data.target)

GS.best_params_#显示调整出来的最佳参数

GS.best_score_#返回调整好的最佳参数对应的准确率

0.9666353383458647

#调整max_features

param_grid = {'max_features':np.arange(5,30,1)}

rfc = RandomForestClassifier(n_estimators=73
,random_state=90
)
GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)

GS.best_params_

GS.best_score_

0.9666666666666668

#调整min_samples_leaf

param_grid={'min_samples_leaf':np.arange(1, 1+10, 1)}

#对于min_samples_split和min_samples_leaf,一般是从他们的最小值开始向上增加10或20
#面对高维度高样本量数据，如果不放心，也可以直接+50，对于大型数据，可能需要200~300的范围
#如果调整的时候发现准确率无论如何都上不来，那可以放心大胆调一个很大的数据，大力限制模型的复杂度

rfc = RandomForestClassifier(n_estimators=39
,random_state=90
)
GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)

GS.best_params_

GS.best_score_

f:\Anaconda3\lib\site-packages\sklearn\model_selection_search.py:841: DeprecationWarning: The default of the iid parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
DeprecationWarning)

0.9718804920913884

#调整min_samples_split

param_grid={'min_samples_split':np.arange(2, 2+20, 1)}

rfc = RandomForestClassifier(n_estimators=39
,random_state=90
)
GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)

GS.best_params_

GS.best_score_

f:\Anaconda3\lib\site-packages\sklearn\model_selection_search.py:841: DeprecationWarning: The default of the iid parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
DeprecationWarning)

0.9718804920913884

#调整Criterion

param_grid = {'criterion':['gini', 'entropy']}

rfc = RandomForestClassifier(n_estimators=39
,random_state=90
)
GS = GridSearchCV(rfc,param_grid,cv=10)
GS.fit(data.data,data.target)

GS.best_params_

GS.best_score_

0.9718804920913884

rfc = RandomForestClassifier(n_estimators=39,random_state=90)
score = cross_val_score(rfc,data.data,data.target,cv=10).mean()
score

score - score_pre

0.005264238181661218

手机扫一扫

移动阅读更方便

你可能感兴趣的文章