本帖最后由 matlab的旋律 于 2021-12-20 19:27 编辑
前面介绍了模型输入数据的预处理,下面通过分类模型使用训练集进行训练和测试集进行预测分类的实例进行说明,这一章主要讲解一些经典的机器学习分类模型,包括常见的DTREE(决策树)、SVM(支持向量机)和Adaboost。由于这些算法在机器学习库函数sklearn中都有封装库函数,因此所有的实例实现代码都基于sklearn库。其中决策树分类模型的调用代码如下:
- from sklearn.tree import DecisionTreeClassifier
- dtree_model = DecisionTreeClassifier()
- dtree_model.fit(x_train.reshape(x_train.shape[0], -1), y_train)
- y_dtree_predict = dtree_model.predict(x_test.reshape(x_test.shape[0], -1))
复制代码 这里y_dtree_predict就是模型对测试集的预测分类,一般评价分类效果的指标包括精确率(precision)、召回率(recall)和F1-score。以及常用的图像显示混淆矩阵,具体代码如下:
- dtree_precision, dtree_recall, dtree_f1, sup = precision_recall_fscore_support(y_test, y_dtree_predict, average='macro')
- dtree_cm = confusion_matrix(y_test, y_dtree_predict, sample_weight=None) # 获取混淆矩阵
- plot_model.plot_confusion_matrix(dtree_cm, title='Confusion Matrix of DTREE')
复制代码 其中混淆矩阵如下图所示:
DecisionTreeClassifier 中可调的参数包括,criterion:特征选择标准,splitter: 特征划分标准,max_depth:决策树最大深度,min_impurity_decrease:节点划分最小不纯度,min_samples_split:内部节点再划分所需最小样本数,min_samples_leaf:叶子节点最少样本数,max_leaf_nodes :最大叶子节点数,min_impurity_split:信息增益的阀值,min_weight_fraction_leaf:叶子节点最小的样本权重和,class_weight :类别权重。具体可以参考决策树分类函数参数解释,如何选出模型分类最优参数,主要是依赖研究人员的经验以及通过列举方法。如可以通过测试 DecisionTreeClassifier 的预测性能随 max_depth 参数的影响,代码如下: - def test_DecisionTreeClassifier_depth(*data,maxdepth):
- '''
- 测试 DecisionTreeClassifier 的预测性能随 max_depth 参数的影响
- :param data: 可变参数。它是一个元组,这里要求其元素依次为:训练样本集、测试样本集、训练样本的标记、测试样本的标记
- :param maxdepth: 一个整数,用于 DecisionTreeClassifier 的 max_depth 参数
- :return: None
- '''
- X_train, X_test, y_train, y_test = data
- depths = np.arange(1, maxdepth)
- training_scores = []
- testing_scores = []
- for depth in depths:
- clf = DecisionTreeClassifier(max_depth=depth)
- clf.fit(X_train.reshape(X_train.shape[0], -1), y_train)
- training_scores.append(clf.score(X_train.reshape(X_train.shape[0], -1), y_train))
- testing_scores.append(clf.score(X_test.reshape(X_test.shape[0], -1), y_test))
- ## 绘图
- fig = plt.figure()
- plt.plot(depths, training_scores, label="traing score", marker='o')
- plt.plot(depths, testing_scores, label="testing score", marker='*')
- plt.xlabel("maxdepth")
- plt.ylabel("score")
- plt.title("Decision Tree Classification")
- plt.legend(framealpha=0.5, loc='best')
- plt.show()
复制代码 运行结果如下图:
从图中可以看出当max_depth大小到达10以后训练集和测试集的准确率达到最大后稳定,结合模型在准确率相同运算量越小越好的规则,因此设定max_depth=10。接下来说明不同算法之间的性能比较,下面先给出SVM和Adaboost的调用代码,如题如下: - ########################################svm###############################################
- svm_model = SVC(kernel='linear', probability=True)
- svm_model.fit(x_train.reshape(x_train.shape[0], -1), y_train)
- y_svm_predict = svm_model.predict(x_test.reshape(x_test.shape[0], -1))
- svm_precision, svm_recall, svm_f1, sup = precision_recall_fscore_support(y_test, y_svm_predict, average='macro')
- svm_cm = confusion_matrix(y_test, y_svm_predict, sample_weight=None) # 获取混淆矩阵
- plot_model.plot_confusion_matrix(svm_cm, title='Confusion Matrix of SVM')
- ########################################adaboost###############################################
- adaboost_model = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2, min_samples_split=20, min_samples_leaf=5),
- learning_rate=0.8, n_estimators=100)
- adaboost_model.fit(x_train.reshape(x_train.shape[0], -1), y_train)
- y_adaboost_predict = adaboost_model.predict(x_test.reshape(x_test.shape[0], -1))
- adaboost_precision, adaboost_recall, adaboost_f1, sup = precision_recall_fscore_support(y_test, y_adaboost_predict, average='macro')
- adaboost_cm = confusion_matrix(y_test, y_adaboost_predict, sample_weight=None) # 获取混淆矩阵
- plot_model.plot_confusion_matrix(adaboost_cm, title='Confusion Matrix of ADABOOST')
复制代码 当然SVM和Adaboost也有许多的参数需要调试,可以参考测试 DecisionTreeClassifier 的预测性能随 max_depth 参数的影响,这里不做详细列举,仅对三种算法的指标使用柱状图进行比较。具体代码如下:- ########################################性能比较###############################################
- dtree_score = [dtree_precision, dtree_recall, dtree_f1]
- svm_score = [svm_precision, svm_recall, svm_precision]
- adaboost_score = [adaboost_precision, adaboost_recall, adaboost_f1]
- # 数据
- fig = plt.figure(figsize=(16, 8))
- x = np.arange(3)
- bar_width = 0.2
- str1 = ['dtree', 'svm', 'adaboost']
- # 绘图 x 表示 从那里开始
- plt.bar(x, dtree_score, bar_width, hatch="--", label='precision')
- plt.bar(x+bar_width, svm_score, bar_width, hatch="/", align="center", label="recall", tick_label=str1)
- plt.bar(x+2*bar_width, adaboost_score, bar_width, hatch="-.", label="f1-score")
- # 展示图片
- fig.legend(ncol=3, loc='upper center', fontsize=20)
- plt.show()
复制代码 运行结果如下:
|