以下参考自《python数据挖掘入门与实战》

1.4分类

已知类别的数据集，经过训练得到一个分类模型，再用模型对类别未知的的数据进行分类。比如垃圾邮件过滤器

离散化：数据集的特征为连续值，算法使用类别性特征值，连续值转变为类别值，这个过程叫离散化
最简单的离散化算法：确定一个阈值，低于该阈值的特征设置为0，高于阈值的值设置为1
阈值计算方法之一：该特征的所有特征值的均值

oneR算法

对于每个特征的每个取值，统计它在各个类别出现次数
错误率 = 1 - 第i个特征的第j个取值出现次数 / 第i特征所有取值的和
选择错误率最低的特征作为唯一分类准则

测试算法
机器学习流程分为2步，训练和测试。从数据集中取一部分数据用于训练模型，取一部分数据用于测试模型对于未知数据的拟合效果

过拟合：模型对训练集表现好，对于未见过的数据表现差。解决方法：不要用训练数据测试模型或者把数据集分为两部分，分别用于训练和测试。

代码实现

#从scikit_learn库内置的iris植物分类数据集
from sklearn.datasets import load_iris
dataset = load_iris()
print(dataset.DESCR) #可以打印出数据集的具体信息

import numpy as np
from collections import defaultdict
from operator import itemgetterX = dataset.data  #特征值
Y = dataset.target #类别attribute_mean = X.mean(axis=0)
X_d = np.array(X >= attribute_mean, dtype='int') # transfer continuous value to discrete将连续值离散化def train_feature_value(X, y_true, feature_index, value):# create a dictionary to count how frequenctly a sample given a specific feature appears in certian class#创建字典，统计某一特征在某一类别出现频率class_counts = defaultdict(int)for sample, y in zip(X,y_true):if sample[feature_index] == value:class_counts[y] += 1# get the best one by sorting 特征值最有可能归属的类别sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)most_frequent_class = sorted_class_counts[0][0]#error is the number of samples that do not classified as the most frequent class 1- 属于most_frequent_class的特征值incorrect_predictions = [class_count for class_value, class_count in class_counts.items() if class_value != most_frequent_class]error = sum(incorrect_predictions)return most_frequent_class, errordef train_on_value(X, y_true, feature_index):predictors = {} #create a dictionary with key denoting feature value and value denoting which class it belongs特征值，类组成的字典errors = []values = set(X[:, feature_index])for v in values:most_frequent_class, error = train_feature_value(X, y_true, feature_index, v)predictors[v] = most_frequent_classerrors.append(error)total_error = sum(errors)return predictors, total_error

#split dataset into two parts: trainning set and testing set (default 25%)
#分训练集和测试集，让每个集合符合原集合的分布
from sklearn.cross_validation import train_test_split
Xd_train, Xd_test, y_train, y_test = train_test_split(X_d, Y, random_state=14) #random_state?all_predictors = {}
errors = {}
for feature_index in range(Xd_train.shape[1]):predictor, error = train_on_value(Xd_train, y_train, feature_index)all_predictors[feature_index] = predictorerrors[feature_index] = error#建立的分类预测模型
best_feature, best_error = sorted(errors.items(), key=itemgetter(1))[0]
model = {"feature": best_feature, 'predictor': all_predictors[best_feature]}def predict(X_test, model):feature = model["feature"]predictor = model["predictor"]y_predicted = np.array([predictor[int(sample[feature])] for sample in X_test])return y_predicted#预测值与实际值的误差
y_predicted = predict(Xd_test, model)
accuracy = np.mean(y_predicted == y_test) * 100
print("The test accuracy is {0:.1f}".format(accuracy))

python数据挖掘-oneR算法相关推荐

Python数据挖掘-OneR算法简介
OneR算法内容来源于<Python数据挖掘入门与实践> 在末尾有源代码,但需要使用Jupyter notebook,大家可以去我的另一篇文章找安装教程.http://blog.csdn ...
Python数据挖掘入门与实践-OneR分类算法
Python数据挖掘入门与实践-OneR分类算法 OneR算法 OneR算法是根据已有的数据中,具有相同特征值的个体最可能属于哪个类别进行分类. 在本例中,只需选区Iris是个特征中分类效果最好的一个 ...
Python数据挖掘学习——鸢尾花分类、OneR算法
<Python数据挖掘入门与实践>第一章内容,实现一个简单的分类处理,实现OneR算法. OneR算法的思路很简单,它根据已有的数据中,具有相同特征值的个体最可能属于哪个类别进行分类.On ...
python数据挖掘（2.分类 OneR算法）
数据源分类是数据挖掘最为常用的方法之一,无论实际应用还是调研,都需要它的帮忙.对于分类问题,我们通常能拿到表示实际对象或时间的数据及,而数据集中每一条数据都有所属于的类别,这些类别把一条条的数据划分 ...
使用OneR算法进行分类（Python实现）
分类分类是数据挖掘领域最为常用的方法之一,不论是实际应用还是科研,都少不了它的身影. 根据检测数据确定植物的种类.类别的值为"植物属于哪个种类? 接下来这个代码就是实现这个问题执行过程是 ...
数据挖掘之OneR算法（原来数据挖掘如此简单！）
人人都能看懂的数据挖掘之OneR算法(原来数据挖掘如此简单!) 如标题所言,我要以最通俗易懂的方法向大家介绍一个简单的数据挖掘算法--OneR算法,为了每个人都能看懂,这里将不涉及专业术语,不要求任何 ...
机器学习之数据挖掘算法（一）OneR算法
一.初识OneR算法 1.在数据挖掘中,我们会接触到knn,决策树等许多复杂的分类算法,那么有没有一种比较简单的分类算法呢?那就是OneR算法. 2.思想:OneR即One Rule顾名思义,也就是一 ...
OneR算法python实现
OneR算法(分类应用,寻找最佳的特征值用于分类) 计算数据错误率,不属于最多类的特征值个数,把各个取值的错误率相加,选取错误率最低的特征作为唯一的分类准则(One Rule),用于接下来的分类. / ...
python机器学习实现oneR算法以鸢尾data为例
oneR即"一条规则".oneR算法根据已有的数据中,具有相同特征值的个体最可能属于哪个类别来进行分类. 以鸢尾data为例,该算法实现过程可解读为以下六步: 文章目录一. 导包 ...

python数据挖掘-oneR算法

1.4分类

oneR算法

python数据挖掘-oneR算法相关推荐

最新文章

热门文章