集成学习与随机森林

更新权重

Adaboost

AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)
- base_estimator:可选参数，默认为DecisionTreeClassifier。
- algorithm：可选参数，默认为SAMME.R
循环训练，实例权重不断更新（不是是成本函数最小化，而是加入更多预测器）

Gradient Boosting

新预测器针对前一个预测器的残差进行拟合
GradientBoostingRegressor(max_depth=2,n_estimators=3,learning_rate=1.0,random_state=42)
- 提前停止法
  - 训练完之后测量每个阶段的训练验证误差，找到树的最优数量后重新训练
  - errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
  - bst_n_estimators = np.argmin(errors) + 1
  - 验证误差在连续某次未改善时停止训练

xgboost

xgbc = XGBClassifier(max_depth=2, 
                     learning_rate=1, 
                     n_estimators=2, # number of iterations or number of trees
                     slient=0,
                     objective="binary:logistic"
                    )

不更新权重

投票分类器

基于多分类器的结果聚合
- voting_clf = VotingClassifier(estimators=[ (’log_clf’, LogisticRegression()), (‘svm_clf’, SVC(probability=True)), (‘dt_clf’, DecisionTreeClassifier(random_state=10)), ], voting=‘soft’)

voting_clf.fit(X_train, y_train) voting_clf.score(X_test, y_test)

bagging./pasting

有放回抽样。在每个数据集上学习出一个模型，最后的预测结果利用N个模型的输出得到，具体地：分类问题采用N个模型预测投票的方式，回归问题采用N个模型预测平均的方式。
- 1.通过设置参数 bootstrap=False来切换为无放回采样。 2.n_estimators=500，表示有有500个相同的决策器。 3.max_samples=100，表示在数据集上有放回采样 100 个训练实例。 4.n_jobs=-1，n_jobs 参数告诉 sklearn 用于训练和预测所需要 CPU 核的数量。（-1 代表着 sklearn 会使用所有空闲核） 5.oob_score=True，表示包外评估bag_clf.oob_score_
随机森林
- rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, random_state=42)
- 重要参数
  - n_estimators，random_state，boostrap和oob_score
- 重要属性
  - .estimators_ .oob_score_ .feature_importances_
- 接口
  - apply，fit，predict，score和predict_proba

集成学习与随机森林#

更新权重#

Adaboost#

Gradient Boosting#

xgboost#

不更新权重#

投票分类器#

bagging./pasting#

集成学习与随机森林

更新权重

Adaboost

Gradient Boosting

xgboost

不更新权重

投票分类器

bagging./pasting