基于 Python 和 Scikit-Learn 的机器学习介绍

数据加载¶

import numpy as np
import urllib
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
raw_data=urllib.urlopen(url)
dataset=np.loadtxt(raw_data,delimiter=',')
X=dataset[:,0:7]
Y=dataset[:,8]

数据标准化¶

我们都知道大多数的梯度方法（几乎所有的机器学习算法都基于此）对于数据的缩放很敏感。因此，在运行算法之前，我们应该进行标准化，或所谓的规格化。标准化包括替换所有特征的名义值，让它们每一个的值在0和1之间。而对于规格化，它包括数据的预处理，使得每个特征的值有0和1的离差。Scikit-Learn库已经为其提供了相应的函数。

from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
model=ExtraTreesClassifier()
model.fit(X,Y)
print model.feature_importances_

[ 0.12963978  0.27368045  0.10923178  0.09019604  0.0845399   0.17462474
  0.13808731]

特征的选取¶

毫无疑问，解决一个问题最重要的是是恰当选取特征、甚至创造特征的能力。这叫做特征选取和特征工程。虽然特征工程是一个相当有创造性的过程，有时候更多的是靠直觉和专业的知识，但对于特征的选取，已经有很多的算法可供直接使用。如树算法就可以计算特征的信息量。

其他所有的方法都是基于对特征子集的高效搜索，从而找到最好的子集，意味着演化了的模型在这个子集上有最好的质量。递归特征消除算法（RFE）是这些搜索算法的其中之一，Scikit-Learn库同样也有提供。

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
#create the RFE model and select 3 attributes
rfe=RFE(model,3)
rfe=rfe.fit(X,Y)
print rfe.support_
print rfe.ranking_

[ True False False False False  True  True]
[1 2 3 5 4 1 1]

逻辑回归¶

大多数情况下被用来解决分类问题（二元分类），但多类的分类（所谓的一对多方法）也适用。这个算法的优点是对于每一个输出的对象都有一个对应类别的概率。

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X,Y)
#print model
#make prediction
expected=Y
predicted=model.predict(X)
#summarize the fit of model
print metrics.classification_report(expected,predicted)
print metrics.confusion_matrix(expected,predicted)

             precision    recall  f1-score   support

        0.0       0.79      0.89      0.84       500
        1.0       0.74      0.55      0.63       268

avg / total       0.77      0.77      0.77       768

[[447  53]
 [120 148]]

confusion matrix表示本来是1被判别为1的个数，本来是1 被判别为2的个数。下面是一个例子。

import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm,datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
%matplotlib inline
#import some data to play with
iris=datasets.load_iris()
X=iris.data
y=iris.target

#split the data into a training se and a test set
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=0)
#run classifier, using a model that is too regularized(C too slow)to see the impact on the result
classifier=svm.SVC(kernel='linear',C=0.01)
y_pred=classifier.fit(X_train,y_train).predict(X_test)

def plot_confusion_matrix(cm,title='Confusion matrix',cmap=plt.cm.Blues):
    plt.imshow(cm,interpolation='nearest',cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks=np.arange(len(iris.target_names))
    plt.xticks(tick_marks,iris.target_names,rotation=45)
    plt.yticks(tick_marks,iris.target_names)
    plt.tight_layout()
    plt.xlabel('True label')
    plt.ylabel('Predicted label')
    
cm=confusion_matrix(y_test,y_pred)
np.set_printoptions(precision=2)
print 'Confusion marix, without normalization'
print cm
plt.figure()
plot_confusion_matrix(cm)

#normalize the confusion matrx by row
cm_normalized=cm.astype('float')/cm.sum(1)[:,np.newaxis]
print 'Normalized confusion matrix'
print cm_normalized
plt.figure()
plot_confusion_matrix(cm_normalized,title='Normalized confusion matrix')

plt.show()

Confusion marix, without normalization
[[13  0  0]
 [ 0 10  6]
 [ 0  0  9]]
Normalized confusion matrix
[[ 1.    0.    0.  ]
 [ 0.    0.62  0.38]
 [ 0.    0.    1.  ]]

朴素贝叶斯¶

它也是最有名的机器学习的算法之一，它的主要任务是恢复训练样本的数据分布密度。这个方法通常在多类的分类问题上表现的很好。

from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model=GaussianNB()
model.fit(X,Y)
expected=Y
predicted=model.predict(X)
print metrics.classification_report(expected,predicted)
print metrics.confusion_matrix(expected,predicted)

             precision    recall  f1-score   support

        0.0       0.80      0.86      0.83       500
        1.0       0.69      0.60      0.64       268

avg / total       0.76      0.77      0.76       768

[[429  71]
 [108 160]]

K-近邻算法（KNN）¶

kNN（k-最近邻）方法通常用于一个更复杂分类算法的一部分。例如，我们可以用它的估计值做为一个对象的特征。有时候，一个简单的kNN算法在良好选择的特征上会有很出色的表现。当参数（主要是metrics）被设置得当，这个算法在回归问题中通常表现出最好的质量。

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
model=KNeighborsClassifier()
model.fit(X,Y)

expected=Y
predicted=model.predict(X)
print metrics.classification_report(expected,predicted)
print metrics.confusion_matrix(expected,predicted)

             precision    recall  f1-score   support

        0.0       0.82      0.90      0.86       500
        1.0       0.77      0.63      0.69       268

avg / total       0.80      0.80      0.80       768

[[448  52]
 [ 98 170]]

决策树¶

分类和回归树通常被用于这么一类问题，在这类问题中对象有可分类的特征且被用于回归和分类问题。决策树用于多类分类。

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
model=DecisionTreeClassifier()
model.fit(X,Y)

expected=Y
predicted=model.predict(X)

print metrics.classification_report(expected,predicted)
print metrics.confusion_matrix(expected,predicted)

             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00       500
        1.0       1.00      1.00      1.00       268

avg / total       1.00      1.00      1.00       768

[[500   0]
 [  0 268]]

支持向量机（SVM）¶

支持向量机用于分类问题，也用于逻辑回归。

from sklearn import metrics
from sklearn.svm import SVC

model=SVC()
model.fit(X,Y)
expected=Y
predicted=model.predict(X)

print metrics.classification_report(expected,predicted)
print metrics.confusion_matrix(expected,predicted)

             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00       500
        1.0       1.00      1.00      1.00       268

avg / total       1.00      1.00      1.00       768

[[500   0]
 [  0 268]]

如何优化算法的参数¶

在编写高效的算法的过程中，最难的步骤之一就是选择正确的参数。

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV

alphas=np.array([1,0.1,0.01,0.001,0.0001,0])
model=Ridge()
grid=GridSearchCV(estimator=model,param_grid=dict(alpha=alphas))
grid.fit(X,Y)
print grid.best_score_
print grid.best_estimator_.alpha

0.282118955686
1.0

有时候随机地从既定的范围内选取一个参数更为高效，估计在参数下算法的质量，然后选出最好的·

import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV

param_grid={'alpha':sp_rand()}
model=Ridge()
research=RandomizedSearchCV(estimator=model,param_distributions=param_grid,n_iter=100)
research.fit(X,Y)
print research.best_score_
print research.best_estimator_.alpha

0.282118855912
0.996297445279

from sklearn.datasets import load_iris
iris=load_iris()
#print iris.feature_names
#print iris.data[:10]
import re
import pandas as pd
columns=[re.sub('[() ]','',col) for col in iris.feature_names]
iris_df=pd.DataFrame(iris.data,columns=columns)
#print iris.target_names
iris_df['species']=[iris.target_names[i] for i in iris.target]
iris_df.head(10)

	sepallengthcm	sepalwidthcm	petallengthcm	petalwidthcm	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa
6	4.6	3.4	1.4	0.3	setosa
7	5.0	3.4	1.5	0.2	setosa
8	4.4	2.9	1.4	0.2	setosa
9	4.9	3.1	1.5	0.1	setosa