数据挖掘-oneR算法-Iris数据集分析-使用oneR算法进行分类预测(五)

接上一篇，使用 oneR算法来实现iris分类. # coding: utf-8#使用oneR算法来实现iris分类#参考:http://www.cnblogs.com/htynkn/archive/2012/04/14/2446905.html#算法思路: 根据已有数据中，具有相同特征值的个体最可能属于哪个类别进行分类.#oneR是one rule(一条规则)的简写, 表示

zhangyingchengqi

2636人浏览 · 2017-01-30 22:26:59

zhangyingchengqi · 2017-01-30 22:26:59 发布

接上一篇，使用 oneR算法来实现iris分类.

# coding: utf-8  
#使用oneR算法来实现iris分类
#参考:   http://www.cnblogs.com/htynkn/archive/2012/04/14/2446905.html
#算法思路: 根据已有数据中，具有相同特征值的个体最可能属于哪个类别进行分类.
#oneR是one rule(一条规则)的简写, 表示只选取四个特征中分类效果最好的一个用作分类的依据. 
#步骤:
    #1. 离散化特征值: 因为oneR算法使用类别型特征值，而原数据集为连续值。因此需要把连续值转为类别型
        #简单的离散方法: 设定一个阈值，将低于该阈值的特征值置为0,高于阈值的置为1.  某特征的阈值设定为该特征所有特征值的均值.
    #2. 遍历每个特征的每一个取值，对于每一个特征值，统计它在各个类别中的出现次数。找到它出现次数最多的类别，并统计它在其它类别中的出现次数.
    #3. 统计完所有的特征值及其在每个类别的出现次数后，再计算每个特征的错误率。计算方法为把它的各个取值的错误率相加，选取错误率最低的特征作为唯一分类准则,用于接下来的分类
    

import numpy as np
from sklearn import datasets
iris=datasets.load_iris()
x=iris.data
y=iris.target
n_samples,n_features=x.shape    #  结果为: (150,4) 取出数据行数及列数

#计算每个特征列的均值
attribute_means=x.mean(axis=0)   #axis=0表示列    1表示行   
#以上结果为 ：  array([ 5.84333333,  3.054     ,  3.75866667,  1.19866667]) 
#将上面的结果转为一个数组，这个数组正好是四列，分别对应了四个特征值的均值，再用这个均值做阈值将数据集打散，将连续的特征值转为类别型，即完成步骤一
x_d=np.array( x>=attribute_means, dtype='int')

#接下来切分训练集和测试集， 切分数据集为训练集和测试集
#方案一: 像上一个案例一样，切成140:10的比例
#np.random.seed(0)
# permutation函数: 随机排列   https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.permutation.html
#i=np.random.permutation(len(x_d))
#训练集： 取出打乱后的前140条数据
#x_train=x_d[i[:-10]]   #前140条数据
#y_train=y[i[:-10]]   #前140条数据对应的花的类型
#输出x_train, y_train
#x_train
#y_train
#测试集
#x_test=x_d[i[-10:]]   #最后10条数据
#y_test=y[i[-10:]]   #最后10条数据对应的花的类型
#方案2: 利用 scikit-learn库提供的切分函数
from sklearn.cross_validation import train_test_split
#train_test_split知识 :  http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
x_train,x_test,y_train,y_test=train_test_split( x_d,y,random_state=14)
print("{} training samples".format(x_train.shape))
print("{} testing samples".format(y_test.shape))




from collections import defaultdict   # http://www.pythontab.com/html/2013/pythonjichu_1023/594.html
from operator import itemgetter       # http://www.cnblogs.com/100thMountain/p/4719503.html

#定义一个函数: 遍历数据集中每一条数据，统计具有给定特征值的个体在各个类别中的出现次数
            # 参数说明:   数据集 类别数组 选好的特征索引值 特征值
def train_feature_value( X, y_true, feature_index, value):
    class_counts=defaultdict(int)
    #zip函数用法:  http://www.cnblogs.com/frydsh/archive/2012/07/10/2585370.html
    for sample,y in zip(X, y_true):
        if sample[feature_index]==value:
                class_counts[y]+=1
    #对class_counts字典排序，找到最大值，这就找到了具有给定特征值的个体在哪个类别中出现次数最多
    sorted_class_counts=sorted(class_counts.items(), key=itemgetter(1),reverse=True)
    most_frequent_class=sorted_class_counts[0][0]
    #计算该条规则错误率
    error=sum(  [ class_count for class_value,class_count in class_counts.items() if class_value!=most_frequent_class ])
    return most_frequent_class,error
    
    
#定义另一个函数:
    #对于某项特征，遍历其每一个特征值，调用上面的函数，就能得到预测结果和每个特征值所带来的错误率，然后把所有错误率累加起来，就能得到该特征的总错误率。
def train(X,y_true,feature_index):
    #以数组形式返回由feature_index所指的列的值.然后以set函数将数组转为集合. 
    values=set(X[:,feature_index])
    #再创建字典
    predictors=dict()
    errors=[]
    for current_value in values:
        most_frequent_class,error=train_feature_value( X,y_true, feature_index, current_value)
        predictors[current_value]=most_frequent_class
        errors.append( error )
    #最后计算该规则的总错误率
    total_error=sum(errors)
    return predictors,total_error
    

#计算预测器
all_predictors={variable:train(x_train,y_train,variable) for variable in range(x_train.shape[1])}
errors={variable:error for variable, (mapping,error) in all_predictors.items()}

#找出错误率最低的特征,作为分类的唯一规则
best_feature,best_error=sorted(errors.items(),key=itemgetter(1))[0]
#对预测器进行排序，找到最佳特征值,创建模型
model={'feature':best_feature,'predictor':all_predictors[best_feature][0]}
#model中的元素，一个用于分类的特征和预测器
#下面可以开始对测试集进行预测了

#首先定义一个函数
def predict( x_test,model):
    feature=model['feature']
    predictor=model['predictor']
    y_predicted=np.array([predictor[int(sample[feature])] for sample in x_test])
    return y_predicted
    
#调用上面的函数对测试集进行预测
y_predicted=predict( x_test,model)
#使用模型预测的结果:  array([0, 0, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 0, 0, 0,
  #     2, 0, 2, 0, 2, 2, 0, 0, 0, 2, 0, 2, 0, 2, 2])

y_test
#标准结果: array([0, 0, 0, 1, 2, 1, 0, 1, 0, 1, 2, 0, 2, 2, 0, 1, 0, 2, 2, 1, 0, 0, 0,
 #      1, 0, 2, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 2, 1])

#计算正确率
accuracy=np.mean( y_predicted==y_test)*100
print(u"模型正确率为:{:.1f}%".format( accuracy))   # 65.8%

最后使用测试集进行测试，正确率为 65.8%. 对于这样一条简单的规则来说，已经非常不错了.