这是上课时候的作业,使用sklearn做一下简单的数据分析。或者说聚类。
本次使用的是使用三种有监督的学习算法。1,高斯朴素贝叶斯。2,SVC,3,随机森林分裂器。以及三种算法的评估参数:Accuracy, F1-score,AUC ROC。代码如下:
from sklearn import datasets from sklearn import cross_validation from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn import metrics #创建(1000 * 10, 1000 * 1)的数据集 X, Y = datasets.make_classification(1000, 10, n_classes=2) #使用交叉验证。 #切割 kf = cross_validation.KFold(X.shape[0], n_folds=10, shuffle=True ) #高斯朴素贝叶斯 print("use gaussian naive bayes") for train_index, test_index in kf: X_train, y_train = X[train_index], Y[train_index] X_test, y_test = X[test_index], Y[test_index] clf = GaussianNB() clf.fit(X_train, y_train) pred = clf.predict(X_test) acc = metrics.accuracy_score(y_test, pred) print(acc) f1 = metrics.f1_score(y_test, pred) print(f1) auc = metrics.roc_auc_score(y_test, pred) print(auc) #SVC print("use SVC") for train_index, test_index in kf: X_train, y_train = X[train_index], Y[train_index] X_test, y_test = X[test_index], Y[test_index] clf = SVC(C=1e-01) clf.fit(X_train, y_train) pred = clf.predict(X_test) acc = metrics.accuracy_score(y_test, pred) print(acc) f1 = metrics.f1_score(y_test, pred) print(f1) auc = metrics.roc_auc_score(y_test, pred) print(auc) print("use RandomForestClassifier") for train_index, test_index in kf: X_train, y_train = X[train_index], Y[train_index] X_test, y_test = X[test_index], Y[test_index] clf = RandomForestClassifier(n_estimators=1000) clf.fit(X_train, y_train) pred = clf.predict(X_test) acc = metrics.accuracy_score(y_test, pred) print(acc) f1 = metrics.f1_score(y_test, pred) print(f1) auc = metrics.roc_auc_score(y_test, pred) print(auc) """ one sample result use gaussian naive bayes 0.89 0.888888888888889 0.891025641025641 0.93 0.9306930693069307 0.9305722288915567 0.96 0.9607843137254902 0.9607371794871795 0.93 0.9263157894736843 0.9294871794871794 0.95 0.9484536082474228 0.9503205128205129 0.95 0.9532710280373831 0.9525252525252524 0.92 0.9199999999999999 0.9233239662786029 0.93 0.9292929292929293 0.9310897435897436 0.88 0.8823529411764707 0.8819751103974307 0.94 0.9333333333333332 0.9415584415584416 use SVC 0.94 0.9361702127659574 0.9391025641025641 0.97 0.9696969696969697 0.9701880752300921 0.96 0.9600000000000001 0.9615384615384616 0.96 0.9591836734693877 0.9607371794871794 0.96 0.9591836734693877 0.9607371794871794 0.97 0.9719626168224299 0.9727272727272727 0.94 0.9411764705882353 0.9421918908069049 0.97 0.9690721649484536 0.9703525641025641 0.96 0.9615384615384616 0.9610598153352067 0.96 0.9555555555555557 0.9618506493506493 use RandomForestClassifier 0.99 0.9894736842105264 0.9895833333333333 0.99 0.98989898989899 0.9901960784313725 0.99 0.9902912621359222 0.9903846153846154 0.98 0.9795918367346939 0.9807692307692308 0.99 0.9896907216494846 0.9903846153846153 0.98 0.9814814814814815 0.9818181818181818 0.98 0.9811320754716981 0.9799277398635087 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 """
使用sklearn的成本很低,但是很快。而且具有很强的通用性,比如训练模型使用的都是fit方法,预测都是使用predict。这两三行的背后可能有几百行的代码在为你服务,所以使用框架的效率很高。
从结果上看,因为我将随机森林算法的分支数设置的比较多,效果最好,评价指数基本都在0.99左右。另一方面,SVC的效果要比高斯朴素贝叶斯好一点点。在评价指数上,并没有太大差距,应该都正确的反映了模型训练的好坏。