机器学习树叶分类与聚类
机器学习树叶分类与聚类
目录
- 1导入包
- 2查看数据
- 3读取训练集和测试集
- 3.1画出相关性矩阵(需要根据相关性矩阵,选择特征进行特征工程)
- 3.2数据标准化
- 4讨论是否需要PCA
- 4.1没有经过PCA降维的KNN
- 4.2经过PCA降维过的KNN
- 4.3比较降维前后
- 5讨论分类算法优劣
- 5.1KNN
- 5.2KNN网格搜索优化
- 5.3SVC
- 5.4逻辑回归
- 5.5voting
- 5.6随机森林
- 5.7比较分类算法
- 6预测树叶图片进行分类
- 6.1测试集,预测结果99种类别
- 6.2初始化Test_label_dic, Train_label_dic
- 6.3建立所有图片的数据集
- 6.4生成分类图片
- 6.5查看在filtered_imgs下生成99个文件夹(文件夹4内的图片如下)
- 6.6查看在filtered_imgs下生成99个文件夹(文件夹7内的图片如下)
- 7不同聚类分析
- 7.1导入数据
- 7.2数据处理
- 7.2.1数据标准化
- 7.2.2pca,降为2维方便可视化
- 7.3scatter_cluster(cluster_num, plt, model_name, X_reduction, y_predict):
- 7.4KMeans
- 7.5Birch
- 7.6MiniBatchKMeans
- 7.7高斯混合聚类
- 7.8比较不同聚类的效果
- 8分析聚类聚多少类合适
- 9聚类结果
- 9.1imagesclassifier2(cluster_num, DImage2, y_predict_KMeans2 ,root_path):
- 9.2顺序读取前12张图片并排序
1导入包
import os
import matplotlib.image as img
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import warnings# 备份原有warnings过滤器
filters = warnings.filters[:]
# 新增一条忽略DeprecationWarning的过滤规则
warnings.simplefilter('ignore', DeprecationWarning)
# import sets
# # 恢复原来的过滤器
# warnings.filters = filters
2查看数据
save_path='./filtered_imgs'
if os.path.exists(save_path) is False:os.makedirs(save_path)
os.listdir()
['.ipynb_checkpoints','data_imgs','data_imgs2','filtered_imgs','images','ML_lesson10_KMeans聚类.ipynb','notebook.tex','render.html','sample_submission.csv','test.csv','train.csv','Untitled.ipynb','Untitled1.ipynb','树叶分类.html','树叶分类实现.ipynb','聚类.ipynb','聚类测试.ipynb']
# 对文件名称重新排序
img_path='./images'
img_name_list=os.listdir(img_path)
img_name_list.sort(key=lambda x: int(x.split('.')[0]))
##顺序读取前12张图片并排序
DImage=[]
for img_name in img_name_list[:12]:img_full_path=os.path.join(img_path, img_name)DImage.append(img.imread(img_full_path))
plt.style.use('ggplot')
##可视化树叶图片
f=plt.figure(figsize=(8,6))
for i in range(12):plt.subplot(3,4,i+1)plt.axis("off")plt.title("image_ID:{0}".format(img_name_list[i].split('.jpg')[0]))plt.imshow(DImage[i],cmap='hot')
plt.show()
3读取训练集和测试集
Train = pd.read_csv('train.csv')
Train_id = Train['id']
Test = pd.read_csv('test.csv')
Test_id = Test['id']
Test.drop(['id'],inplace = True, axis = 1)
# 查看训练集数据描述
Train.describe()
id | margin1 | margin2 | margin3 | margin4 | margin5 | margin6 | margin7 | margin8 | margin9 | ... | texture55 | texture56 | texture57 | texture58 | texture59 | texture60 | texture61 | texture62 | texture63 | texture64 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | ... | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 | 990.000000 |
mean | 799.595960 | 0.017412 | 0.028539 | 0.031988 | 0.023280 | 0.014264 | 0.038579 | 0.019202 | 0.001083 | 0.007167 | ... | 0.036501 | 0.005024 | 0.015944 | 0.011586 | 0.016108 | 0.014017 | 0.002688 | 0.020291 | 0.008989 | 0.019420 |
std | 452.477568 | 0.019739 | 0.038855 | 0.025847 | 0.028411 | 0.018390 | 0.052030 | 0.017511 | 0.002743 | 0.008933 | ... | 0.063403 | 0.019321 | 0.023214 | 0.025040 | 0.015335 | 0.060151 | 0.011415 | 0.039040 | 0.013791 | 0.022768 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 415.250000 | 0.001953 | 0.001953 | 0.013672 | 0.005859 | 0.001953 | 0.000000 | 0.005859 | 0.000000 | 0.001953 | ... | 0.000000 | 0.000000 | 0.000977 | 0.000000 | 0.004883 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000977 |
50% | 802.500000 | 0.009766 | 0.011719 | 0.025391 | 0.013672 | 0.007812 | 0.015625 | 0.015625 | 0.000000 | 0.005859 | ... | 0.004883 | 0.000000 | 0.005859 | 0.000977 | 0.012695 | 0.000000 | 0.000000 | 0.003906 | 0.002930 | 0.011719 |
75% | 1195.500000 | 0.025391 | 0.041016 | 0.044922 | 0.029297 | 0.017578 | 0.056153 | 0.029297 | 0.000000 | 0.007812 | ... | 0.043701 | 0.000000 | 0.022217 | 0.009766 | 0.021484 | 0.000000 | 0.000000 | 0.023438 | 0.012695 | 0.029297 |
max | 1584.000000 | 0.087891 | 0.205080 | 0.156250 | 0.169920 | 0.111330 | 0.310550 | 0.091797 | 0.031250 | 0.076172 | ... | 0.429690 | 0.202150 | 0.172850 | 0.200200 | 0.106450 | 0.578130 | 0.151370 | 0.375980 | 0.086914 | 0.141600 |
8 rows × 193 columns
Train.head()
id | species | margin1 | margin2 | margin3 | margin4 | margin5 | margin6 | margin7 | margin8 | ... | texture55 | texture56 | texture57 | texture58 | texture59 | texture60 | texture61 | texture62 | texture63 | texture64 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Acer_Opalus | 0.007812 | 0.023438 | 0.023438 | 0.003906 | 0.011719 | 0.009766 | 0.027344 | 0.0 | ... | 0.007812 | 0.000000 | 0.002930 | 0.002930 | 0.035156 | 0.0 | 0.0 | 0.004883 | 0.000000 | 0.025391 |
1 | 2 | Pterocarya_Stenoptera | 0.005859 | 0.000000 | 0.031250 | 0.015625 | 0.025391 | 0.001953 | 0.019531 | 0.0 | ... | 0.000977 | 0.000000 | 0.000000 | 0.000977 | 0.023438 | 0.0 | 0.0 | 0.000977 | 0.039062 | 0.022461 |
2 | 3 | Quercus_Hartwissiana | 0.005859 | 0.009766 | 0.019531 | 0.007812 | 0.003906 | 0.005859 | 0.068359 | 0.0 | ... | 0.154300 | 0.000000 | 0.005859 | 0.000977 | 0.007812 | 0.0 | 0.0 | 0.000000 | 0.020508 | 0.002930 |
3 | 5 | Tilia_Tomentosa | 0.000000 | 0.003906 | 0.023438 | 0.005859 | 0.021484 | 0.019531 | 0.023438 | 0.0 | ... | 0.000000 | 0.000977 | 0.000000 | 0.000000 | 0.020508 | 0.0 | 0.0 | 0.017578 | 0.000000 | 0.047852 |
4 | 6 | Quercus_Variabilis | 0.005859 | 0.003906 | 0.048828 | 0.009766 | 0.013672 | 0.015625 | 0.005859 | 0.0 | ... | 0.096680 | 0.000000 | 0.021484 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.031250 |
5 rows × 194 columns
Test.head()
margin1 | margin2 | margin3 | margin4 | margin5 | margin6 | margin7 | margin8 | margin9 | margin10 | ... | texture55 | texture56 | texture57 | texture58 | texture59 | texture60 | texture61 | texture62 | texture63 | texture64 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.019531 | 0.009766 | 0.078125 | 0.011719 | 0.003906 | 0.015625 | 0.005859 | 0.0 | 0.005859 | 0.023438 | ... | 0.006836 | 0.000000 | 0.015625 | 0.000977 | 0.015625 | 0.0 | 0.0 | 0.000000 | 0.003906 | 0.053711 |
1 | 0.007812 | 0.005859 | 0.064453 | 0.009766 | 0.003906 | 0.013672 | 0.007812 | 0.0 | 0.033203 | 0.023438 | ... | 0.000000 | 0.000000 | 0.006836 | 0.001953 | 0.013672 | 0.0 | 0.0 | 0.000977 | 0.037109 | 0.044922 |
2 | 0.000000 | 0.000000 | 0.001953 | 0.021484 | 0.041016 | 0.000000 | 0.023438 | 0.0 | 0.011719 | 0.005859 | ... | 0.128910 | 0.000000 | 0.000977 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.015625 | 0.000000 | 0.000000 |
3 | 0.000000 | 0.000000 | 0.009766 | 0.011719 | 0.017578 | 0.000000 | 0.003906 | 0.0 | 0.003906 | 0.001953 | ... | 0.012695 | 0.015625 | 0.002930 | 0.036133 | 0.013672 | 0.0 | 0.0 | 0.089844 | 0.000000 | 0.008789 |
4 | 0.001953 | 0.000000 | 0.015625 | 0.009766 | 0.039062 | 0.000000 | 0.009766 | 0.0 | 0.005859 | 0.000000 | ... | 0.000000 | 0.042969 | 0.016602 | 0.010742 | 0.041016 | 0.0 | 0.0 | 0.007812 | 0.009766 | 0.007812 |
5 rows × 192 columns
Train['species'].value_counts().head()
Quercus_Agrifolia 10
Quercus_Chrysolepis 10
Alnus_Cordata 10
Viburnum_x_Rhytidophylloides 10
Ginkgo_Biloba 10
Name: species, dtype: int64
print("树叶种类数目为:",len(set(Train['species'])))
树叶种类数目为: 99
## 把species转换为类标签。如把树叶名转化为数字标签
map_dic = {}
i =- 1
for _ in Train['species']:if _ in map_dic:passelse:i+=1map_dic[_] = map_dic.get(_, i)
[(key, value) for key, value in map_dic.items()][:5]
[('Acer_Opalus', 0),('Pterocarya_Stenoptera', 1),('Quercus_Hartwissiana', 2),('Tilia_Tomentosa', 3),('Quercus_Variabilis', 4)]
len(map_dic)
99
Train['species'].replace(map_dic.keys(), map_dic.values(), inplace=True)
Train.drop(['id'], inplace = True, axis = 1)
Train_ture = Train['species']
3.1画出相关性矩阵(需要根据相关性矩阵,选择特征进行特征工程)
corr = Train.corr()
f, ax = plt.subplots(figsize=(25, 25))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,square=True, linewidths=.5)
plt.show()
# 判断是否存在缺失值,若为False,则无缺失值
np.all(np.any(pd.isnull(Train)))
False
X = Train.drop(['species'], axis=1)
y = Train['species']
print(y.head())
print("训练集尺寸:", X.shape)
0 0
1 1
2 2
3 3
4 4
Name: species, dtype: int64
训练集尺寸: (990, 192)
## 划分训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.2, random_state=42)
print("训练集数据尺寸",X_train.shape)
print("测试集数据尺寸",X_test.shape)
print("训练集目标尺寸",y_train.shape)
print("测试集目标尺寸",y_test.shape)
训练集数据尺寸 (792, 192)
测试集数据尺寸 (198, 192)
训练集目标尺寸 (792,)
测试集目标尺寸 (198,)
3.2数据标准化
from sklearn.preprocessing import StandardScaler
# 数据标准化
standerScaler = StandardScaler()
X_train = standerScaler.fit_transform(X_train)
X_test = standerScaler.transform(X_test)
4讨论是否需要PCA
4.1没有经过PCA降维的KNN
%%time
from sklearn.preprocessing import StandardScalerX_train_shape1 = X_train.shape
X_test_shape1 = X_test.shape
print(X_train.shape[1])
192
Wall time: 0 ns
%%time
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_scoreclf=KNeighborsClassifier(2)
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
notpca_score = accuracy_score(y_test, train_predictions)
print(notpca_score)
0.9696969696969697
Wall time: 134 ms
4.2经过PCA降维过的KNN
from sklearn.decomposition import PCApca = PCA(n_components=0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
X_train_shape2 = X_train.shape
X_test_shape2 = X_test.shape
print(X_train.shape[1])
68
%%timeclf=KNeighborsClassifier(2)
clf.fit(X_train, y_train)
train_predictions = clf.predict(X_test)
pca_score = accuracy_score(y_test, train_predictions)
print(pca_score)
0.9646464646464646
Wall time: 34.9 ms
4.3比较降维前后
data_score = pd.DataFrame([[X_train_shape1, X_train_shape2],[X_test_shape1, X_test_shape2], [notpca_score, pca_score]])
# 添加中文行索引
data_score.index = ["X_train形状", "X_test形状", "Accuracy得分"]
# 添加中文列索引
data_score.columns = ["降维前", "降维后"]
print("KNN")
data_score
降维前 | 降维后 | |
---|---|---|
X_train形状 | (792, 192) | (792, 68) |
X_test形状 | (198, 192) | (198, 68) |
Accuracy得分 | 0.969697 | 0.964646 |
讨论: pca降维在保留95%的信息下,与降维前的准确率相比略低,相差不大的情况下,以获得较快的计算速度。
5讨论分类算法优劣
score_classify_list = []
model_classify_list = []
model_predict_list = []
5.1KNN
from sklearn.neighbors import KNeighborsClassifierknn_clf0 = KNeighborsClassifier()
knn_clf0.fit(X_train, y_train)
print("*"*30)
print('KNeighborsClassifier')y_predict = knn_clf0.predict(X_test)
score = accuracy_score(y_test, y_predict)
print("Accuracy: {:.4%}".format(score))score_classify_list.append(score)
model_classify_list.append("KNN")
model_predict_list.append(y_predict)
运行结果:
******************************
KNeighborsClassifier
Accuracy: 97.9798%
5.2KNN网格搜索优化
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCVparam_grid = [{'weights':["uniform"],'n_neighbors':[i for i in range(2, 4)]},{'weights':["distance"],'n_neighbors':[i for i in range(2, 4)],'p':[i for i in range(1, 3)]}
]knn_clf = KNeighborsClassifier()
gird_search = GridSearchCV(knn_clf, param_grid)
gird_search.fit(X_train, y_train)
print("最佳超参数: ", gird_search.best_params_)
knn_clf = gird_search.best_estimator_
print("*"*30)
print('KNeighborsClassifier网格搜索')y_predict = knn_clf.predict(X_test)
score = accuracy_score(y_test, y_predict)
print("Accuracy: {:.4%}".format(score))score_classify_list.append(score)
model_classify_list.append("KNN网格搜索")
model_predict_list.append(y_predict)
运行结果:
最佳超参数: {‘n_neighbors’: 2, ‘p’: 1, ‘weights’: ‘distance’}
******************************
KNeighborsClassifier网格搜索
Accuracy: 98.9899%
5.3SVC
from sklearn.svm import SVC
svc_clf = SVC(probability=True)
svc_clf.fit(X_train, y_train)print("*"*30)
print('SVC')y_predict = svc_clf.predict(X_test)
score = accuracy_score(y_test, y_predict)
print("Accuracy: {:.4%}".format(score))score_classify_list.append(score)
model_classify_list.append("SVC")
model_predict_list.append(y_predict)
运行结果:
******************************
SVC
Accuracy: 97.4747%
5.4逻辑回归
from sklearn.linear_model import LogisticRegressionCVlr = LogisticRegressionCV(multi_class="ovr", fit_intercept=True, Cs=np.logspace(-2,2,20), cv=2, penalty="l2", solver="lbfgs", tol=0.01)lr.fit(X_train,y_train)print("*"*30)
print('逻辑回归')y_predict = lr.predict(X_test)
score = accuracy_score(y_test, y_predict)
print("Accuracy: {:.4%}".format(score))score_classify_list.append(score)
model_classify_list.append("逻辑回归")
model_predict_list.append(y_predict)
运行结果:
******************************
逻辑回归
Accuracy: 98.9899%
5.5voting
warnings.simplefilter('ignore', DeprecationWarning)from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifiervoting_clf = VotingClassifier(estimators=[('log_clf', LogisticRegression()),('svm_clf', SVC(probability=True)),
], voting='hard')voting_clf.fit(X_train,y_train)print("*"*30)
print('voting')y_predict = voting_clf.predict(X_test)
score = accuracy_score(y_test, y_predict)
print("Accuracy: {:.4%}".format(score))score_classify_list.append(score)
model_classify_list.append("voting")
model_predict_list.append(y_predict)
运行结果:
******************************
voting
Accuracy: 97.4747%
5.6随机森林
from sklearn.ensemble import RandomForestClassifierrf_clf = RandomForestClassifier(n_estimators=250,random_state=666, oob_score=True)
rf_clf.fit(X_train, y_train)print("*"*30)
print('随机森林')y_predict = rf_clf.predict(X_test)
score = accuracy_score(y_test, y_predict)
print("Accuracy: {:.4%}".format(score))score_classify_list.append(score)
model_classify_list.append("随机森林")
model_predict_list.append(y_predict)
运行结果:
******************************
随机森林
Accuracy: 96.9697%
5.7比较分类算法
data_score = pd.DataFrame(score_classify_list)
models = model_classify_list
features = ["accuracy_score得分"]
# 添加中文行索引
data_score.index = models
# 添加中文列索引
data_score.columns = features
data_score
不同分类算法的得分:
accuracy_score得分 | |
---|---|
KNN | 0.979798 |
KNN网格搜索 | 0.989899 |
SVC | 0.974747 |
逻辑回归 | 0.989899 |
voting | 0.974747 |
随机森林 | 0.969697 |
from pyecharts import options as opts
from pyecharts.charts import Bar
from pyecharts.globals import ThemeTypemodels_score = [round(i, 4) for i in score_classify_list]def bar_reversal_axis() -> Bar:c = (Bar(init_opts=opts.InitOpts(width="800px",height="300px", theme=ThemeType.WONDERLAND)).add_xaxis(models).add_yaxis("model", models_score).reversal_axis().set_series_opts(label_opts=opts.LabelOpts(position="right")).set_global_opts(title_opts=opts.TitleOpts(title="分类算法", subtitle="得分")))return c
c = bar_reversal_axis()
c.render("分类算法得分.html")
c.render_notebook()
可视化不同分类算法的得分:
# (6, 198)
model_predict_array = np.array(model_predict_list)
test_id_list = ["test_"+str(i) for i in range(model_predict_array.shape[1])]def bar_datazoom_slider() -> Bar:c = (Bar(init_opts=opts.InitOpts(width="1000px",height="200px",theme=ThemeType.WONDERLAND)).add_xaxis(test_id_list[:]).add_yaxis(models[0], model_predict_array[0,:].tolist()).add_yaxis(models[1], model_predict_array[1,:].tolist()).add_yaxis(models[2], model_predict_array[2,:].tolist()).add_yaxis(models[3], model_predict_array[3,:].tolist()).add_yaxis(models[4], model_predict_array[4,:].tolist()).add_yaxis(models[5], model_predict_array[5,:].tolist()).set_global_opts(title_opts=opts.TitleOpts(title="分类", subtitle="预测类别"),datazoom_opts=[opts.DataZoomOpts(type_="slider")]))return c
c = bar_datazoom_slider()
c.render("分类预测类别.html")
c.render_notebook()
可视化不同分类算法的预测结果:
warnings.simplefilter('ignore', DeprecationWarning)
def plot_learning_curve(model, x_start, x_shop, metrics, X_train_xx, X_test_xx, y_train_xx, y_test_xx,model_name):train_score = []test_score = []for i in range(x_start, x_shop, 100):model.fit(X_train_xx[:i], y_train_xx[:i])y_train_predict = model.predict(X_train_xx[:i])train_score.append(metrics(y_train_xx[:i], y_train_predict))y_test_predict = model.predict(X_test_xx)test_score.append(metrics(y_test_xx, y_test_predict))plt.plot([i for i in range(x_start, x_shop, 100)], np.sqrt(train_score), label="train")plt.plot([i for i in range(x_start, x_shop, 100)], np.sqrt(test_score), label="test")plt.legend()plt.title(model_name)plt.xlabel('训练数量')plt.ylabel('正确率')plt.axis([x_start, x_shop, 0., 1.1])
X_train_xx = X_train.copy()
X_test_xx = X_test.copy()
y_train_xx = y_train.copy()
y_test_xx = y_test.copy()
%%time
model_clf_list = [knn_clf0, knn_clf, svc_clf, lr, voting_clf, rf_clf]
figure, axes = plt.subplots(nrows=4, ncols=2, figsize=(15, 25), dpi=80)
warnings.simplefilter('ignore', Warning)
for i in range(1, 7):plt.subplot(3,2,i)plot_learning_curve(model_clf_list[i-1], 22, len(X_train_xx)+1, accuracy_score, X_train_xx, X_test_xx, y_train_xx, y_test_xx, models[i-1])
plt.show()
学习曲线:观察学习曲线,这6种模型的最终准确率都非常接近于100%,KNN网格搜索和voting集成学习的训练结果和测试结果比较接近,可知两者的过拟合程度非常小。又因为KNN网格搜索最高,所以KNN网格搜索训练出的模型最为理想。
Wall time: 2min 21s
6预测树叶图片进行分类
6.1测试集,预测结果99种类别
### warnings.simplefilter('ignore', DeprecationWarning)
# 测试集,预测结果99种类别
Test_sta = standerScaler.transform(Test)
Test_pca = pca.transform(Test_sta)
Test_predict = knn_clf.predict(Test_pca)
print("预测结果尺寸为:")
Test_predict.shape
预测结果尺寸为:
(594,)
6.2初始化Test_label_dic, Train_label_dic
Test_label_dic={}
for i in range(99):Test_label_dic[i]=np.where(Test_predict==i)[0]Train_label_dic={}
for i in range(99):Train_label_dic[i]=np.where(Train_ture==i)[0]
6.3建立所有图片的数据集
DImage=[]
for img_name in img_name_list:img_full_path=os.path.join(img_path, img_name)DImage.append(img.imread(img_full_path))
train_y = {}
test_y = {}
6.4生成分类图片
# import pysnooper# @pysnooper.snoop(r'C:\Users\linyihua\Desktop\mylog/file.log')
def imagesclassifier(DImage,Train_id,Test_id,Train_label_dic,Test_label_dic,root_path, train_y={}, test_y={}):'''DImage为图像数据集(循环读取的1584张图片),Train_id,Test_id分别是train.csv和test.csv 内id列,Train_label_dic,Test_label_dic是树叶类别标签和图片索引对应关系'''if os.path.exists(root_path) is False:os.makedirs(root_path)save_path=root_pathfor i in range(99):train_val=Train_id.values[np.array(Train_label_dic[i]).reshape(-1)]-1test_val=Test_id.values[np.array(Test_label_dic[i]).reshape(-1)]-1train_y[i] = train_valtest_y[i] = test_valTrain_imgs=np.array(DImage)[train_val]Test_imgs=np.array(DImage)[test_val]for index, _ in enumerate(Train_imgs):img_name='train'+str(train_val[index])+'.jpg'save_path=os.path.join(save_path,str(i))if os.path.exists(save_path) is False:os.makedirs(save_path)img.imsave(os.path.join(save_path,img_name),_,cmap='binary')save_path=root_pathfor index, _ in enumerate(Test_imgs):img_name='test'+str(test_val[index])+'.jpg'save_path=os.path.join(save_path,str(i))if os.path.exists(save_path) is False:os.makedirs(save_path)img.imsave(os.path.join(save_path,img_name),_,cmap='binary')save_path=root_pathreturn train_y, test_yprint(train_val)
%%time
train_y, test_y = imagesclassifier(DImage,Train_id,Test_id,Train_label_dic,Test_label_dic,save_path)
Wall time: 1min 16s
6.5查看在filtered_imgs下生成99个文件夹(文件夹4内的图片如下)
##顺序读取前12张图片并排序
img_name_list1=os.listdir('./filtered_imgs/0')
img_path1='./filtered_imgs/0'
DImage1=[]
for img_name1 in img_name_list1[:12]:img_full_path1=os.path.join(img_path1,img_name1)DImage1.append(img.imread(img_full_path1))
##可视化树叶图片
f=plt.figure(figsize=(15,6))
for i in range(12):plt.subplot(3,4,i+1)plt.axis("off")plt.title("image_ID:{0}".format(img_name_list1[i].split('.jpg')[0]))plt.imshow(DImage1[i],cmap='hot')
plt.show()
6.6查看在filtered_imgs下生成99个文件夹(文件夹7内的图片如下)
##顺序读取前12张图片并排序
img_name_list2=os.listdir('./filtered_imgs/7')
img_path2='./filtered_imgs/7'
DImage2=[]
for img_name2 in img_name_list2[:12]:img_full_path2=os.path.join(img_path2,img_name2)DImage2.append(img.imread(img_full_path2))
##可视化树叶图片
f=plt.figure(figsize=(15, 10))
for i in range(12):plt.subplot(3,4,i+1)plt.axis("off")plt.title("image_ID:{0}".format(img_name_list2[i].split('.jpg')[0]))plt.imshow(DImage2[i],cmap='hot')
plt.show()
7不同聚类分析
7.1导入数据
# 导入数据
Train2 = pd.read_csv('train.csv')
Test2 = pd.read_csv('test.csv')
print(Train2.shape, Test2.shape)
(990, 194) (594, 193)
7.2数据处理
## 数据处理
Train2.drop(['species'],inplace = True, axis = 1)
data = np.concatenate((Train2,Test2), axis=0)
data = pd.DataFrame(data)
columns = Test2.columns
data.columns = columns
data = data.sort_values(by="id", ascending=True)
data.head()
id | margin1 | margin2 | margin3 | margin4 | margin5 | margin6 | margin7 | margin8 | margin9 | ... | texture55 | texture56 | texture57 | texture58 | texture59 | texture60 | texture61 | texture62 | texture63 | texture64 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.007812 | 0.023438 | 0.023438 | 0.003906 | 0.011719 | 0.009766 | 0.027344 | 0.0 | 0.001953 | ... | 0.007812 | 0.000000 | 0.002930 | 0.002930 | 0.035156 | 0.0 | 0.0 | 0.004883 | 0.000000 | 0.025391 |
1 | 2.0 | 0.005859 | 0.000000 | 0.031250 | 0.015625 | 0.025391 | 0.001953 | 0.019531 | 0.0 | 0.000000 | ... | 0.000977 | 0.000000 | 0.000000 | 0.000977 | 0.023438 | 0.0 | 0.0 | 0.000977 | 0.039062 | 0.022461 |
2 | 3.0 | 0.005859 | 0.009766 | 0.019531 | 0.007812 | 0.003906 | 0.005859 | 0.068359 | 0.0 | 0.000000 | ... | 0.154300 | 0.000000 | 0.005859 | 0.000977 | 0.007812 | 0.0 | 0.0 | 0.000000 | 0.020508 | 0.002930 |
990 | 4.0 | 0.019531 | 0.009766 | 0.078125 | 0.011719 | 0.003906 | 0.015625 | 0.005859 | 0.0 | 0.005859 | ... | 0.006836 | 0.000000 | 0.015625 | 0.000977 | 0.015625 | 0.0 | 0.0 | 0.000000 | 0.003906 | 0.053711 |
3 | 5.0 | 0.000000 | 0.003906 | 0.023438 | 0.005859 | 0.021484 | 0.019531 | 0.023438 | 0.0 | 0.013672 | ... | 0.000000 | 0.000977 | 0.000000 | 0.000000 | 0.020508 | 0.0 | 0.0 | 0.017578 | 0.000000 | 0.047852 |
5 rows × 193 columns
X = data.iloc[:, 1:]
y_id = data["id"]
X.shape
(1584, 192)
7.2.1数据标准化
from sklearn.preprocessing import StandardScalerstanderScaler2 = StandardScaler()
X = standerScaler2.fit_transform(X)
7.2.2pca,降为2维方便可视化
from sklearn.decomposition import PCApca2 = PCA(n_components=2)
X_reduction = pca2.fit_transform(X)
y_predict_list = []
7.3scatter_cluster(cluster_num, plt, model_name, X_reduction, y_predict):
def scatter_cluster(cluster_num, plt, model_name, X_reduction, y_predict):colors = ["#64A600","#A6A600", "#C6A300","#EA7500", "#AD5A5A","#A5A552","#5CADAD","#8080C0","#EA0000","#FF359A","#D200D2","#9F35FF","#2828FF","#0080FF","#00CACA","#02DF82",]markers = ["o", "^", "s", "p", "x", "+", "d", "*"] * 2# plt.figure(figsize=(), dpi=80)plt.grid(linestyle="--", alpha=0.5)for i in range(cluster_num):plt.scatter(X_reduction[y_predict==i, 0], X_reduction[y_predict==i, 1],color=colors[i], marker=markers[i], label=str(i))plt.title(model_name+":聚类数量为"+str(cluster_num))plt.legend()
# plt.show()
model_names = []
7.4KMeans
from sklearn.cluster import KMeans# km = KMeans()
# km.fit(X_reduction, y)
# y_predict = km.predict(X_reduction)
from sklearn.cluster import KMeans
km = KMeans(n_clusters=8,init='k-means++',n_init=10,max_iter=300,tol=0.0001,precompute_distances='auto',verbose=0,random_state=None,copy_x=True,n_jobs=1,algorithm='auto')
#n_clusters:class的个数;
#max_inter:每一个初始化值下,最大的iteration次数;
#n_init:尝试用n_init个初始化值进行拟合;
#tol:within-cluster sum of square to declare convergence;
#init=‘k-means++’:可使初始化的centroids相互远离;
km.fit(X_reduction, y)
y_predict = km.predict(X_reduction)
y_predict_list.append(y_predict)
model_names.append("KMeans")
7.5Birch
warnings.simplefilter('ignore', FutureWarning)from sklearn.cluster import Birchy_predict = Birch(n_clusters = 8).fit_predict(X_reduction)
y_predict_list.append(y_predict)
model_names.append("Birch")
7.6MiniBatchKMeans
from sklearn.cluster import MiniBatchKMeansy_predict = MiniBatchKMeans(n_clusters = 8).fit_predict(X_reduction)
y_predict_list.append(y_predict)
model_names.append("MiniBatchKMeans")
7.7高斯混合聚类
from sklearn import mixturey_predict = mixture.GMM(n_components=8).fit_predict(X_reduction)
y_predict_list.append(y_predict)
model_names.append("高斯混合聚类")
7.8比较不同聚类的效果
figure, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 12), dpi=80)
cluster_num = 8
for i in range(1, 5):plt.subplot(2,2,i)scatter_cluster(cluster_num, plt, model_names[i-1], X_reduction, y_predict_list[i-1])
# 添加网格显示
plt.grid(linestyle="--", alpha=0.8)
plt.show()
不同聚类可视化:将特征运用PCA降维,降成两维以方便可视化,根据簇内距离最近,簇间距离最大原则,KMeans比较理想。
8分析聚类聚多少类合适
def KMeans_clusters(clusters_num, X):km = KMeans(n_clusters=clusters_num,init='k-means++',n_init=10,max_iter=300,tol=0.0001,precompute_distances='auto',verbose=0,random_state=None,copy_x=True,n_jobs=1,algorithm='auto')km.fit(X, y)y_predict = km.predict(X)return y_predictcluster_num = 4
cluster_name = "KMeans"
figure, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 12), dpi=80)
for i in range(1, 5):y_predict = KMeans_clusters(cluster_num*i, X_reduction)plt.subplot(2,2,i)scatter_cluster(cluster_num*i, plt, cluster_name, X_reduction, y_predict)# 添加网格显示
plt.grid(linestyle="--", alpha=0.8)
plt.show()
可视化不同簇数聚类:观察可知分成16簇比较理性。
9聚类结果
9.1imagesclassifier2(cluster_num, DImage2, y_predict_KMeans2 ,root_path):
# 聚16类
cluster_num = 16
y_predict = KMeans_clusters(cluster_num, X)def imagesclassifier2(cluster_num, DImage, y_predict ,root_path):if os.path.exists(root_path) is False:os.makedirs(root_path)save_path=root_pathfor i in range(cluster_num):data_val=np.where(y_predict==i)[0]data_imgs=np.array(DImage)[data_val]for index, _ in enumerate(data_imgs):img_name='data'+str(data_val[index])+'.jpg'save_path=os.path.join(save_path,str(i))if os.path.exists(save_path) is False:os.makedirs(save_path)img.imsave(os.path.join(save_path,img_name),_,cmap='binary')save_path=root_path
imagesclassifier2(cluster_num, DImage, y_predict, r"./data_imgs")
9.2顺序读取前12张图片并排序
##顺序读取前12张图片并排序
img_name_list22=os.listdir('./data_imgs/5')
img_path22='./data_imgs/5'
DImage22=[]
for img_name22 in img_name_list22[:12]:img_full_path22=os.path.join(img_path22,img_name22)DImage22.append(img.imread(img_full_path22))
##可视化树叶图片
f=plt.figure(figsize=(15, 10))
for i in range(12):plt.subplot(3,4,i+1)plt.axis("off")plt.title("image_ID:{0}".format(img_name_list22[i].split('.jpg')[0]))plt.imshow(DImage22[i],cmap='hot')
plt.show()
机器学习树叶分类与聚类相关推荐
- 机器学习中分类与聚类的本质区别
机器学习中分类与聚类的本质区别 机器学习中有两类的大问题,一个是分类,一个是聚类. 在我们的生活中,我们常常没有过多的去区分这两个概念,觉得聚类就是分类,分类也差不多就是聚类,下面,我们就具体来研究下 ...
- 2021-02-02美赛前MATLAB的学习笔记(机器学习(分类、聚类、深度学习))
机器学习 机器学习是一中工具.方法,通过对机器训练,进而学习到某种规律或者模式,并建立预测未来结果的模型. 机器学习可以分为监督学习和无监督学习 有监督学习方法,是提供答案的,主要包括分类和回归 无监 ...
- 回归、分类与聚类:三大方向剖解机器学习算法的优缺点
回归.分类与聚类:三大方向剖解机器学习算法的优缺点 2017-05-20 13:56:14 机器学习 数学 3 0 0 在本教程中,作者对现代机器学习算法进行一次简要的实战梳理.虽然类 ...
- Py之scikit-learn:机器学习sklearn库的简介、六大基本功能介绍(数据预处理/数据降维/模型选择/分类/回归/聚类)、安装、使用方法(实际问题中如何选择最合适的机器学习算法)之详细攻略
Py之scikit-learn:机器学习sklearn库的简介(组件/版本迭代).六大基本功能介绍(数据预处理/数据降维/模型选择/分类/回归/聚类).安装.使用方法(实际问题中如何选择最合适的机器学 ...
- AI:人工智能领域算法思维导图集合之有监督学习/无监督学习/强化学习类型的具体算法简介(预测函数/优化目标/求解算法)、分类/回归/聚类/降维算法模型选择思路、11类机器学习算法详细分类之详细攻略
AI:人工智能领域算法思维导图集合之有监督学习/无监督学习/强化学习类型的具体算法简介(预测函数/优化目标/求解算法).分类/回归/聚类/降维算法模型选择思路.11类机器学习算法详细分类(决策树/贝叶 ...
- 机器学习四大基本模型:回归、分类、聚类、降维
在本文中,我对现代机器学习算法进行了简要梳理,我通过查阅转载众多博客和资料,基于实践中的经验,讨论每个算法的优缺点,并以机器学习入门者的角色来看待各个模型. 主要内容来自<机器之心>:回归 ...
- 二分类最优阈值确定_机器学习-分类和聚类
## 机器学习-分类和聚类.分类和回归.逻辑回归和KNN 分类和聚类的概念: ** 1.分类:使用已知的数据集(训练集)得到相应的模型,通过这个模型可以划分未知数据.分类涉及到的数据集通常是带有标签 ...
- 收藏!机器学习算法分类图谱及其优缺点综合分析
来源:必达智库 近日,Coggle对各类机器学习算法进行了归纳整理,形成了一个较为完整的机器学习算法分类图谱,并对每一类算法的优缺点进行了分析.具体分类如下: 正则化算法(Regularization ...
- [Python从零到壹] 十四.机器学习之分类算法五万字总结全网首发(决策树、KNN、SVM、分类对比实验)
欢迎大家来到"Python从零到壹",在这里我将分享约200篇Python系列文章,带大家一起去学习和玩耍,看看Python这个有趣的世界.所有文章都将结合案例.代码和作者的经验讲 ...
最新文章
- makefille的使用
- RHEL6.5/Centos6.5 搭建bugzilla
- 虚拟语气和推测(三)
- PAT乙级 1086 就不告诉你 (附测试点1,2排查及用例)
- python数据库操作常用功能使用详解(创建表/插入数据/获取数据)
- python 写文件 编码_Python文件写入时的编码问题解决
- ABViewer免费汉化下载注册地址图形查看器教程功能介绍
- 2015年度夏季假期学习内容
- 阿里云携手蓝凌软件,打造全球化企业智慧办公平台
- java架构师一般多少岁,大量教程
- Win10 KeilC51-C251-ARM共存方法
- S3C2440裸机------GPIO
- 核函数是什么-数据的升维与降维
- MySql Order By 多个字段 排序规则
- Java什么是对象?
- 【TJOI 2015】弦论
- 在win7系统中用U盘安装Ubuntu16.04
- EBS 销售订单行单条一次或多次发运确认API(wsh_new_delivery_actions.confirm_delivery)详解
- 3D游戏编程与设计作业6-Unity实现打飞碟游戏改进版(Hit UFO)
- 蓝牙连接每次弹出确认框问题的排查及解决