1.前言

针对Sklearn在前面已经通过代码实战讲解了其中的各个主要模块，现在将从整体的角度深度理解一下Sklearn, 本文主要以代码形式讲解，在代码中有注释，话不多说，开车！！！（请坐稳）

数据链接
密码:a6vy

2.数据处理

class Sentiment:NEGATIVE = "NEGATIVE"NEUTRAL = "NEUTRAL"POSITIVE = "POSITIVE"class Review:def __init__(self, text, score):self.text = textself.score = scoreself.sentiment = self.get_sentiment()   # 调用类内函数def get_sentiment(self):if self.score <= 2:return Sentiment.NEGATIVE       # 类的属性调用（类间调用）elif self.score == 3:return Sentiment.NEUTRALelse:return Sentiment.POSITIVEclass ReviewContainer:        # 对训练集、测试集处理 def __init__(self,reviews):self.reviews = reviewsdef get_text(self):return [x.text for x in self.reviews]     # 将“text”放一起def get_sentiment(self):return [x.sentiment for x in self.reviews]  # 将“sentiment”放一起def evenly_distribute(self):        #  均匀分配数据negative = list(filter(lambda x : x.sentiment == Sentiment.NEGATIVE,self.reviews)) # 筛选NEGATIVEpositive = list(filter(lambda x : x.sentiment == Sentiment.POSITIVE,self.reviews)) # 筛选POSITIVEpositive_shrunk = positive[:len(negative)] #  切片，使积极的样本与消极的样本一样多self.reviews = negative + positive_shrunk  # 最终样本random.shuffle(self.reviews)     #洗牌
#filter() 函数用于过滤序列，过滤掉不符合条件的元素，返回一个迭代器对象，如果要转换为列表，可以使用 list() 来转换
#接收两个参数，第一个为函数，第二个为序列，序列的每个元素作为参数传递给函数进行判，然后返回 True 或 False，最后将返回 True 的元素放到新列表中

接下来就是读取数据并利用上面的类处理数据：

import jsonreviews = []
with open("books_small_10000.json") as f:for line in f:review = json.loads(line)     # 对数据进行解码reviews.append(Review(review["reviewText"], review["overall"])) print(reviews[5].text)  # 类的函数调用
print(reviews[5].score)
print(reviews[5].sentiment)

再进行训练集测试集拆分，并分别拿到对应的特征和标签：

from sklearn.model_selection import train_test_splittraining, test = train_test_split(reviews, test_size=0.33, random_state=42) # 拆分数据
train_container = ReviewContainer(training)  # 实例化训练集对象
test_container = ReviewContainer(test)       # 实例化测试集对象train_container.evenly_distribute()     # 先对训练集取相同样本再打乱
train_x = train_container.get_text()    # 取训练数据
train_y = train_container.get_sentiment()  # 取训练标签test_container.evenly_distribute()      # 先对测试集取相同样本再打乱
test_x = test_container.get_text()      # 取测试数据
test_y = test_container.get_sentiment() # 取测试标签print(test_y.count(Sentiment.POSITIVE))
print(test_y.count(Sentiment.NEGATIVE))
# print(train_x_vectors[0])
# print(train_x_vectors[0].toarray())

最后用TfidfVectorizer把原始文本转化为tf-idf的特征矩阵：

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizervectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x) # 对训练数据用fit_transform
test_x_vectors = vectorizer.transform(test_x)       # 对测试数据用仅用transform
print(vectorizer.get_feature_names())

3.模型构建

3.1.支持向量机

from sklearn.svm import SVC
from sklearn.metrics import f1_scoreclf_svm = SVC(kernel="linear")clf_svm.fit(train_x_vectors, train_y)    # 训练数据
print(clf_svm.score(test_x_vectors, test_y))  # 用测试数据计算模型分类效果
print(clf_svm.predict(test_x_vectors[0]))   #用训练好的模型预测测试数据
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

3.2.决策树

from sklearn.tree import DecisionTreeClassifierclf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
print(clf_dec.score(test_x_vectors, test_y))
print(clf_dec.predict(test_x_vectors[0]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

3.3.逻辑回归

from sklearn.linear_model import LogisticRegressionclf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)
print(clf_log.score(test_x_vectors, test_y))
print(clf_log.predict(test_x_vectors[0]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

4.网格搜索寻找最优结果

from sklearn.model_selection import GridSearchCVparameters = {'kernel':("linear","rbf"), "C":(1,4,8,16,32)}
svc = SVC()
clf = GridSearchCV(svc, parameters, cv=5)  #五折交叉验证
clf.fit(train_x_vectors, train_y)print(clf.score(test_x_vectors, test_y))
print(f1_score(test_y, clf.predict(test_x_vectors),average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE]))

5.保存模型+提取模型

保存模型：

import picklewith open("sklearn.pkl","wb") as f:pickle.dump(clf, f)

提取模型：

with open("sklearn.pkl","rb") as f:loaded = pickle.load(f)

用提取出的模型预测：

print(test_x[0])
loaded.predict(test_x_vectors[0])

Sklearn专题实战——数据处理+模型构建+网格搜索+保存(提取)模型相关推荐

数据挖掘原理与算法：机器学习-＞{[sklearn. model_selection. train_test_split]、[h2o]、[网格搜索]、[numpy]、[plotly.express]}
数据挖掘原理与算法:机器学习->{[sklearn. model_selection. train_test_split].[h2o].[网格搜索].[numpy].[plotly.expres ...
ML之Xgboost：利用Xgboost模型(7f-CrVa+网格搜索调参)对数据集(比马印第安人糖尿病)进行二分类预测
ML之Xgboost:利用Xgboost模型(7f-CrVa+网格搜索调参)对数据集(比马印第安人糖尿病)进行二分类预测目录输出结果设计思路核心代码输出结果设计思路核心代码 grid_s ...
R使用LSTM模型构建深度学习文本分类模型（Quora Insincere Questions Classification）
R使用LSTM模型构建深度学习文本分类模型(Quora Insincere Questions Classification) Long Short Term 网络-- 一般就叫做 LSTM --是一 ...
Sklearn专题实战——针对Category特征进行分类
文章目录 1.前言 2.数据处理 3.模型构建 3.1.支持向量机 3.2.贝叶斯 4.网格搜索寻找最优结果 5.保存模型+提取模型 6.混淆矩阵查看分类效果 1.前言上次回我们是对文本进行情感分类 ...
蒸汽预测之网格搜索调优模型
首先,导入所要的库 import warnings warnings.filterwarnings("ignore") import matplotlib.pyplot as pl ...
以K近邻算法为例，使用网格搜索GridSearchCV优化模型最佳参数
文章目录参数优化网格搜索代码实现自编代码 sklearn 代码验证参数优化在之前的文章以K近邻算法为例,使用交叉验证优化模型最佳参数中,我们使用验证曲线(validation_curve) ...
sklearn学习-SVM例程总结3(网格搜索+交叉验证——寻找最优超参数)
网格搜索+交叉验证--寻找最优超参数 1548962898@qq.com 连续三天写了三篇博客,主要是为了尽快了解机器学习中算法以外的重要知识,这些知识可以迁移到每一个算法中,或许说,这些知识是学习并 ...
Keras【Deep Learning With Python】Save reload 保存提取模型
文章目录 1 代码实现 2 输出: 3 过程讲解 3.1 训练模型 3.2 保存模型 3.3 导入模型并应用 1 代码实现 import numpy as np np.random.seed(1337 ...
【Python-ML】SKlearn库网格搜索和交叉验证
# -*- coding: utf-8 -*- ''' Created on 2018年1月18日 @author: Jason.F @summary: GridSearch网格搜索:同一模型下组合参 ...

Sklearn专题实战——数据处理+模型构建+网格搜索+保存(提取)模型

文章目录