scikit-learn是一个非常强大的机器学习库, 提供了很多常见机器学习算法的实现.
scikit-learn可以通过pip进行安装:
pip install -U scikit-learn
不过这个包比较大, 若使用pip安装超时可以去pypi上下载适合自己系统的.exe或.whl文件进行安装.
安装成功后可以在python中导入:
import sklearn
sklearn的官方文档叙述非常详细清晰, 建议通过阅读User Guide学习sklearn.
Dataset Loading
sklearn基于numpy的矩阵与向量化运算支持, 可以采用类似numpy的导入:
import numpy
f = open('dataSet.txt')
dataSet = numpy.loadtxt(f)
dataSet为numpy的mat对象.
或者用libsvm的导入格式:
from sklearn.datasets import load_svmlight_file
X_train, y_train = load_svmlight_file("dataSet.txt")
X_train.todense() # 将稀疏矩阵转换为完整矩阵
sklearn包中内置了一些示例数据:
from sklearn import datasets iris = datasets.load_iris() print(iris.data)
上面导入了著名的安德森鸢尾花卉数据集, iris.data中存储了特征值, iris.target中存储了分类标签.
更多关于数据载入的内容请参见User Guide - Dataset loading utilities
Supervised learning
LinearRegression
线性回归是最经典的算法:
from sklearn import linear_model train_x = [[0, 0], [1, 1]] train_y = [0, 1] test_x = [[0, 0.2]] regr = linear_model.LinearRegression() regr.fit(train_x, train_y) print(regr.predict(test_x))
以及常见的变种逻辑回归:
from sklearn import linear_model train_x = [[0, 0], [1, 1]] train_y = [0, 1] test_x = [[0, 0.2]] regr = linear_model.LogisticRegression() regr.fit(train_x, train_y) print(regr.predict(test_x))
更多线性模型参见User Guide - Linear Model
Support Vector Machine
SVM是非常好用的分类算法, sklearn提供了SVC,NuSvc, LinearSVC三种基于SVM的分类器.
SVC与NuSVC非常类似, SVC用参数C(惩罚因子, Cost)设置拟合程度,取值1到无穷; nu则是错分样本所占比例,取值0到1.
from sklearn import svm train_x = [[0, 0], [1, 1]] train_y = [0, 1] clf = svm.SVC() clf.fit(train_x, train_y) print(clf.predict([0.9, 0.9])) from sklearn import svm train_x = [[0, 0], [1, 1]] train_y = [0, 1] clf = svm.SVC() clf.fit(train_x, train_y) print(clf.predict([0.9, 0.9]))
SVC和NuSVC采用one-against-one策略来进行多分类:
from sklearn import svm train_x = [[0, 0], [1, 1], [2,2], [3, 3]] train_y = [0, 1, 2, 3] clf = svm.SVC(decision_function_shape='ovo') clf.fit(train_x, train_y) print(clf.predict([1.9, 1.9]))
LinearSVC采用one-against-rest策略进行多分类:
from sklearn import svm train_x = [[0, 0], [1, 1], [2,2], [3, 3]] train_y = [0, 1, 2, 3] clf = svm.LinearSVC() clf.fit(train_x, train_y) print(clf.predict([1.9, 1.9]))
更多关于SVM的内容参见User Guide
K Nearest Neighbors
K临近算法是一种非常简单的分类算法:
from sklearn.neighbors import NearestNeighbors import numpy as np x = [[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]] y = [[0, 0], [-1, 2], [3,1]] nbrs = NearestNeighbors(n_neighbors=3, algorithm='ball_tree').fit(x) dist, index = nbrs.kneighbors(y) print(dist) print(index)
dist显示测试集y中各点在x中最近邻居的距离:
[[ 1.41421356 1.41421356 2.23606798] [ 2.23606798 3. 3.16227766] [ 1. 1. 2. ]]
index显示最近邻居的下标:
[[0 3 1] [3 0 1] [4 5 3]]
最近邻居的个数由n_neighbors参数指定, algorithm参数指定搜索算法, 可以选用"KDTree" 或"BallTree".
更多关于knn算法内容参见User Guide
Naive Bayes
朴素贝叶斯算法是经典的概率分类算法:
from sklearn import datasets from sklearn.naive_bayes import GaussianNB iris = datasets.load_iris() gnb = GaussianNB() gnb.fit(iris.data, iris.target) y_pred = gnb.predict(iris.data) y_proba= gnb.predict_proba(iris.data)
更多内容参见User Guide
Decision Tree
sklearn提供了决策树进行分类和回归的实现:
from sklearn import tree x = [[0, 0], [1, 1]] y = [0, 1] clf = tree.DecisionTreeClassifier() clf = clf.fit(x, y) clf.predict([[2, 2]]) # array([1]) 查看最优分类 clf.predict_proba([[2., 2.]]) # array([[ 0., 1.]]) 查看属于各类的贝叶斯概率值
回归:
from sklearn import tree x = [[0, 0], [2, 2]] y = [0.5, 2.5] clf = tree.DecisionTreeRegressor() clf = clf.fit(x, y) clf.predict([[1, 1]]) # array([ 0.5])
更多关于决策树算法的内容参见User Guide
Random Forest
随机森林是采用多个决策树进行分类的集成方法(Ensemble Method)
from sklearn.ensemble import RandomForestClassifier train_x = [[0, 0], [1, 1], [2,2], [3, 3]] train_y = [0, 1, 2, 3] test_x = [0.9, 0.9] clf = RandomForestClassifier(n_estimators=10) clf = clf.fit(train_x, train_y) clf.predict(test_x)
Cross validation
交叉验证是提高预测精确度的重要方法, sklearn提供了相应工具将数据集分为训练数据集和验证数据集,以提升训练效果:
from sklearn import cross_validation from sklearn import svm from sklearn import datasets iris = datasets.load_iris() clf = svm.SVC() confindence = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
confindence代表了对各类分类的准确程度(信心).
来源:https://www.cnblogs.com/Finley/p/5816097.html