逻辑回归实现文本分类

基于sklearn的文本分类—逻辑回归

本文是文本分类的第一篇，记录使用逻辑回归进行文本分类任务，数据集下载地址:http://thuctc.thunlp.org/

文本分类的主要内容如下:
- 1.基于逻辑回归的文本分类
- 2.基于朴素贝叶斯的文本分类
- 3.使用LDA进行文档降维以及特征选择
- 4.基于SVM的文本分类
- 5.基于多层感知机MLPC的文本分类
- 6.基于卷积神经网络词级别的文本分类以及调参
- 7.基于卷积神经网络的句子级别的文本分类以及调参
- 8.基于Facebook fastText的快速高效文本分类
- 9.基于RNN的文本分类
- 10.基于LSTM的文本分类

- 11.总结

1 数据预处理

其中使用的训练数据来自清华大学开源的文本分类数据集，原始数据集比较大，提供下载的是提取的小数据，thu_data_500 表示每个类提取500篇文章，thu_data_3000 表示每个类提取3000篇文章，一共14个类别,数据处理的代码如下：

import os
import codecs
import jieba
import refrom sklearn.utils import shuffle

category = ['星座', '股票', '房产', '时尚', '体育', '社会', '家居', '游戏', '彩票', '科技', '教育', '时政', '娱乐', '财经']

# 每篇文档保留的文档数量
#per_class_max_docs = 1000def load_data_to_mini(path, to_path, per_class_max_docs=1000):"""处理清华大学语料库，将类别和文档处理成fasttext 所需要的格式:param path: :param to_path: :return: """# 抽取后的语料库corpus = []if not os.path.isdir(path):print('path error')# 列举当前目录下的所有子列别目录with codecs.open(to_path, 'w') as f:for files in os.listdir(path):curr_path = os.path.join(path, files)print(curr_path)if os.path.isdir(curr_path):count = 0docs = []for file in os.listdir(curr_path):count += 1if count > per_class_max_docs:breakfile_path = os.path.join(curr_path, file)# 读取文件中的内容with codecs.open(file_path, 'r', encoding='utf-8') as fd:docs.append('__label__' + files + ' ' + ' '.join(jieba.cut(re.sub('[  \n\r\t]+', '', fd.read()))))f.write('__label__' + files + ' ' + ' '.join(jieba.cut(re.sub('[  \n\r\t]+', '', fd.read()))))corpus.append(docs)# 将数据写到一个新的文件中with codecs.open(to_path, 'a') as f:for docs in corpus:for doc in docs:f.write(doc + '\n')return corpus

通过调用下面的代码，执行小数据集的提取

corpus = load_data_to_mini('/root/git/data/THUCNews', 'thu_data_all', 1000)

/root/git/data/THUCNews/娱乐
/root/git/data/THUCNews/星座
/root/git/data/THUCNews/时尚
/root/git/data/THUCNews/股票
/root/git/data/THUCNews/彩票
/root/git/data/THUCNews/体育
/root/git/data/THUCNews/房产
/root/git/data/THUCNews/社会
/root/git/data/THUCNews/财经
/root/git/data/THUCNews/家居
/root/git/data/THUCNews/游戏
/root/git/data/THUCNews/科技
/root/git/data/THUCNews/教育
/root/git/data/THUCNews/时政

我们看下提取的结果

print('corpus size(%d,%d)' %(len(corpus), len(corpus[0])))

corpus size(14,1000)

可以看到，结果一共是14个类，每个类1000篇文档，下面看下corpus里面的具体内容

corpus[0][1]

'__label__股票 世基 投资 ： 紧缩 压力 骤然 增加 沪 指 再失 2800 \u3000 \u3000 余炜 \u3000 \u3000 周二 大盘 在 半年线 处 止跌 后 ， 连续 3 日 展开 反弹 ， 昨日 一度 站上 过 2830 点 ， 但 最终 还是 未能 收复 ， 显示 出 20 日线 和 年线 从技术上 对 市场 的 压力 比 想象 中 更加 大 。 弱势 格局 下 ， 利空 传言 纷至沓来 ， 周末 效应 再次 显现 ， 周五 大盘 给 我们 呈现出 的 是 一幕 疲软 下滑 走势 ， 使得 前三天 的 反弹 基本 化为乌有 ， 2800 点 再 一次 失守 ， 股指 重新 来到 半年线 附近 求 支撑 。 \u3000 \u3000 盘面 热点 比较 稀少 ， 其中 资产重组 、 业绩 增长 和 预增 题材 的 几只 品种 涨势 不错 ， 昨日 提到 过 的 广电 信息 、 银鸽投资 连续 涨停 ， 计划 高送 转 的 精诚 铜业 也 受大单 推动 涨停 。 此外 ， 高铁 概念 逆势 重新 活跃 ， 晋亿 实业 最高 逼近 涨停 ， 带动 晋西 车轴 、 中国 北车 、 天马 股份 等 快速 上攻 ， 其中 北车 和 晋 亿 实业 已经 率先 创出 阶段 新高 。 部分 具备 病菌 概念 的 医药 股 也 表现 较 好 ， 自早 盘起 就 展现 强势 ， 莱茵 生物 涨停 ， 紫鑫 药业 、 海王 生物 、 联环 药业 大涨 7% 左右 。 \u3000 \u3000 从 目前 公开 消息 来看 ， 相信 是 货币政策 面 的 利空 预期 在 对 市场 形成 压力 ， 随着 韩国 央行 的 昨日 加息 ， 新兴 经济体 紧缩 预期 骤然 升温 ， 之前 秘鲁 和 泰国 已经 有过 连续 加息 的 动作 ， 与此同时 ， 虽然 西方 主要 国家 仍 在 维持 宽松 ， 但 随着 发达 经济体 复苏 步伐 的 加快 ， 通胀 也 有 抬头 迹象 。 国内 方面 ， 下周 可能 将 公布 2010 年 全年 和 12 月份 的 经济运行 数据 ， 形势 摆在 这里 ， 投资者 有所 担忧 也 是 情理之中 。 欢迎 发表 评论 \xa0 \xa0 我要 评论'

可以看到，开头时 label 的本文标签，后面接着的是新闻正文，正文已经使用jieba进行了分词，词之间使用空格键分开。

下面进行数据的切分，将数据划分为样本和标签，因为读取的数据是按照类别来分块的，在后面采用训练数据和测试数据的时候，会出现问题，所以这里也需要进行数据的随机打乱，数据打乱最好不要使用numpy.random.shuffle(),这个效率很低，而且非常容易出现内存溢出问题，推荐使用的是pandas或者是sklearn中的shuffle，我使用的是后者。切分的代码如下:

def split_data_with_label(corpus):"""将数据划分为训练数据和样本标签:param corpus: :return: """input_x = []input_y = []tag = []if os.path.isfile(corpus):with codecs.open(corpus, 'r') as f:for line in f:tag.append(line)else:for docs in corpus:for doc in docs:tag.append(doc)tag = shuffle(tag)for doc in tag:index = doc.find(' ')input_y.append(doc[:index])input_x.append(doc[index + 1 :])# 打乱数据，避免在采样的时候出现类别不均衡现象# datasets = np.column_stack([input_x, input_y])# np.random.shuffle(datasets)# input_x = []# input_y = []# for i in datasets:#     input_x.append(i[:-1])#     input_y.append(i[-1:])return [input_x, input_y]

这个函数返回两个值，其中第一个返回值input_x是样本数据，一共14*1000行，第二个参数input_y和input_x有着相同的行数，每行对应着input_x中新闻样本的类别标签.

2.特征选择

下面将进行特征提取，特征选择的方法有基本的bag-of-words, tf-idf,n-gran等，我们主要使用TF-IDF进行这些方法进行实验，下面是代码:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.metrics.scorer import make_scorer
from sklearn import linear_model
from sklearn import metricsfrom time import time

/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20."This module will be removed in 0.20.", DeprecationWarning)

def feature_extractor(input_x, case='tfidf', max_df=1.0, min_df=0.0):"""特征抽取:param corpus: :param case: 不同的特征抽取方法:return: """return TfidfVectorizer(token_pattern='\w', ngram_range=(1,2), max_df=max_df, min_df=min_df).fit_transform(input_x)

接下来将进行训练数据和测试数据的切分，现在不进行更好的交叉验证等技术，仅仅简单的以一定的比例划分训练数据和测试数据。使用sklearn中提供的工具，具体代码如下:

def split_data_to_train_and_test(corpus, indices=0.2, random_state=10, shuffle=True):"""将数据划分为训练数据和测试数据:param corpus: [input_x]:param indices: 划分比例:random_state: 随机种子:param shuffle: 是否打乱数据:return: """input_x, y = corpus# 切分数据集x_train, x_dev, y_train, y_dev = train_test_split(input_x, y, test_size=indices, random_state=10)print("Vocabulary Size: {:d}".format(input_x.shape[1]))print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))return x_train, x_dev, y_train, y_dev

函数返回四个值，分别是训练数据的样本，训练数据的标签，测试数据样本，测试数据真实标签，下面调用朴素贝叶斯进行分类。

逻辑回归是一种判别式模型，在线性回归的基础上，套用了一个sigmod函数，这个函数讲线性结果映射到一个概率区间，并且概率在0.5周围是光滑的，这就使得数据的分类结果都趋向于在0,1这两端。

LogisticRegression()主要有的参数：
- penalty: 表示正则项为L1或者L2,默认是L2
- C 正则项的参数C,也就是惩罚系数
- fit_intercept 表示线性模型中的bias,也就是模型中的参数b 是一个布尔值，表示带或者不带bias，一般都是带的!
- solver 表示的是参数学习的方法，有{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}这几种情况，在小数据集下liblinear是相对好的选择，在数据集大的时候，选择saga,sag会是一个更好的选择，但是liblinear在多分类情况下受限制与ovr,所以其他的选择在多分类下更明智

这里主要是进行相关的实验，不在理论上展开太多，下面采用逻辑回归进行文档分类，具体代码如下:

def fit_and_predicted(train_x, train_y, test_x, test_y, penalty='l2', C=1.0, solver='lbfgs'):"""训练与预测:param train_x: :param train_y: :param test_x: :param test_y: :return: """clf = linear_model.LogisticRegression(penalty=penalty, C=C, solver=solver, n_jobs=-1).fit(train_x, train_y)predicted = clf.predict(test_x)print(metrics.classification_report(test_y, predicted))print('accuracy_score: %0.5fs' %(metrics.accuracy_score(test_y, predicted)))

上面函数调用LogisticRegression(),基于线性分类器的变种模型有很多，我们将会在后面的试验中使用添加了L1范式的lasso回归，添加了L2范式的岭回归回归

下面将进行实际的代码运行阶段了。

# 1. 加载语料
corpus = split_data_with_label('thu_data_1000')

2.1 TF-IDF (max_df, min_df)=dafault （1.0，0.0）

input_x, y = corpus
# 2. 特征选择
input_x = feature_extractor(input_x, 'tfidf')
# 3.切分训练数据和测试数据
train_x, test_x, train_y, test_y = split_data_to_train_and_test([input_x, y])

Vocabulary Size: 942969
Train/Dev split: 11200/2800

# 4. 训练以及测试
t0 = time()
print('\t\t使用 max_df,min_df=(1.0,0.0) 进行特征选择的逻辑回归文本分类\t\t')
fit_and_predicted(train_x, train_y, test_x, test_y)
print('time uesed: %0.4fs' %(time() - t0))

 使用 max_df,min_df=(1.0,0.0) 进行特征选择的逻辑回归文本分类      precision    recall  f1-score   support__label__体育       0.93      0.97      0.95       186
__label__娱乐       0.87      0.88      0.88       233
__label__家居       0.88      0.90      0.89       203
__label__彩票       0.98      0.96      0.97       207
__label__房产       0.90      0.90      0.90       178
__label__教育       0.92      0.91      0.92       208
__label__时尚       0.92      0.92      0.92       197
__label__时政       0.83      0.87      0.85       211
__label__星座       0.94      0.97      0.95       202
__label__游戏       0.96      0.91      0.94       202
__label__社会       0.85      0.92      0.88       210
__label__科技       0.88      0.80      0.84       173
__label__股票       0.87      0.82      0.84       196
__label__财经       0.93      0.90      0.91       194avg / total       0.90      0.90      0.90      2800accuracy_score: 0.90321s
time uesed: 44.1572s

可以看出逻辑回归的文本分类方法在该数据集上表现良好，综合得分都有91%以上，下面我们将对tf-idf做文章，看看不同的tf-idf参数对特征产生影响

2.2 TF-IDF 不同的max_df对结果参数的影响

# 2. 特征选择
max_df = [0.2, 0.4, 0.5, 0.8, 1.0, 1.5, 5]
for i in max_df:input_x, y = corpusinput_x = feature_extractor(input_x, 'tfidf', max_df=i)# 3.切分训练数据和测试数据train_x, test_x, train_y, test_y = split_data_to_train_and_test([input_x, y])# 4. 训练以及测试t0 = time()print('\t 使用 max_df,min_df=(%.1f,0.0) 进行特征选择的逻辑回归文本分类\t\t\n' %(i))fit_and_predicted(train_x, train_y, test_x, test_y)print('time uesed: %0.4fs' %(time() - t0))

Vocabulary Size: 670795
Train/Dev split: 5600/1400使用 max_df,min_df=(0.2,0.0) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.95      0.95      0.95       109_娱乐_       0.73      0.88      0.80        92_家居_       0.98      0.88      0.93       109_彩票_       0.99      0.96      0.97        97_房产_       0.94      0.94      0.94        97_教育_       0.94      0.89      0.92       104_时尚_       0.87      0.87      0.87       110_时政_       0.86      0.89      0.87        93_星座_       0.96      0.93      0.95       105_游戏_       0.97      0.91      0.94       103_社会_       0.84      0.88      0.86        99_科技_       0.85      0.86      0.86        93_股票_       0.82      0.88      0.85        78_财经_       0.97      0.91      0.94       111avg / total       0.91      0.90      0.91      1400accuracy_score: 0.90429s
time uesed: 63.8214s
Vocabulary Size: 671113
Train/Dev split: 5600/1400使用 max_df,min_df=(0.4,0.0) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.97      0.94      0.96       109_娱乐_       0.78      0.89      0.83        92_家居_       0.94      0.87      0.90       109_彩票_       0.98      0.97      0.97        97_房产_       0.93      0.92      0.92        97_教育_       0.95      0.88      0.92       104_时尚_       0.90      0.86      0.88       110_时政_       0.83      0.90      0.87        93_星座_       0.94      0.95      0.95       105_游戏_       0.95      0.90      0.93       103_社会_       0.86      0.90      0.88        99_科技_       0.88      0.87      0.88        93_股票_       0.80      0.91      0.85        78_财经_       0.95      0.89      0.92       111avg / total       0.91      0.91      0.91      1400accuracy_score: 0.90500s
time uesed: 62.6203s
Vocabulary Size: 671183
Train/Dev split: 5600/1400使用 max_df,min_df=(0.5,0.0) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.97      0.94      0.96       109_娱乐_       0.79      0.89      0.84        92_家居_       0.93      0.87      0.90       109_彩票_       0.98      0.97      0.97        97_房产_       0.93      0.91      0.92        97_教育_       0.94      0.88      0.91       104_时尚_       0.91      0.87      0.89       110_时政_       0.83      0.89      0.86        93_星座_       0.94      0.96      0.95       105_游戏_       0.95      0.89      0.92       103_社会_       0.85      0.90      0.87        99_科技_       0.88      0.87      0.88        93_股票_       0.80      0.91      0.85        78_财经_       0.94      0.88      0.91       111avg / total       0.91      0.90      0.90      1400accuracy_score: 0.90357s
time uesed: 63.2442s
Vocabulary Size: 671261
Train/Dev split: 5600/1400使用 max_df,min_df=(0.8,0.0) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.96      0.94      0.95       109_娱乐_       0.79      0.88      0.83        92_家居_       0.92      0.89      0.91       109_彩票_       0.98      0.97      0.97        97_房产_       0.93      0.91      0.92        97_教育_       0.95      0.88      0.92       104_时尚_       0.91      0.87      0.89       110_时政_       0.83      0.88      0.85        93_星座_       0.94      0.97      0.96       105_游戏_       0.95      0.89      0.92       103_社会_       0.85      0.89      0.87        99_科技_       0.88      0.85      0.86        93_股票_       0.77      0.91      0.84        78_财经_       0.95      0.88      0.92       111avg / total       0.91      0.90      0.90      1400accuracy_score: 0.90143s
time uesed: 64.9945s
Vocabulary Size: 671267
Train/Dev split: 5600/1400使用 max_df,min_df=(1.0,0.0) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.95      0.94      0.94       109_娱乐_       0.80      0.88      0.84        92_家居_       0.95      0.89      0.92       109_彩票_       0.98      0.95      0.96        97_房产_       0.93      0.92      0.92        97_教育_       0.95      0.88      0.92       104_时尚_       0.91      0.87      0.89       110_时政_       0.81      0.90      0.85        93_星座_       0.94      0.98      0.96       105_游戏_       0.95      0.90      0.93       103_社会_       0.84      0.89      0.86        99_科技_       0.89      0.84      0.86        93_股票_       0.76      0.91      0.83        78_财经_       0.96      0.87      0.92       111avg / total       0.91      0.90      0.90      1400accuracy_score: 0.90214s
time uesed: 67.9015s
Vocabulary Size: 671267
Train/Dev split: 5600/1400使用 max_df,min_df=(1.5,0.0) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.95      0.94      0.94       109_娱乐_       0.80      0.88      0.84        92_家居_       0.95      0.89      0.92       109_彩票_       0.98      0.95      0.96        97_房产_       0.93      0.92      0.92        97_教育_       0.95      0.88      0.92       104_时尚_       0.91      0.87      0.89       110_时政_       0.81      0.90      0.85        93_星座_       0.94      0.98      0.96       105_游戏_       0.95      0.90      0.93       103_社会_       0.84      0.89      0.86        99_科技_       0.89      0.84      0.86        93_股票_       0.76      0.91      0.83        78_财经_       0.96      0.87      0.92       111avg / total       0.91      0.90      0.90      1400accuracy_score: 0.90214s
time uesed: 66.4803s
Vocabulary Size: 562057
Train/Dev split: 5600/1400使用 max_df,min_df=(5.0,0.0) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.94      0.83      0.88       109_娱乐_       0.61      0.75      0.67        92_家居_       0.91      0.55      0.69       109_彩票_       0.95      0.84      0.89        97_房产_       0.94      0.81      0.87        97_教育_       0.92      0.80      0.86       104_时尚_       0.82      0.72      0.77       110_时政_       0.77      0.78      0.78        93_星座_       0.86      0.75      0.80       105_游戏_       0.96      0.72      0.82       103_社会_       0.77      0.78      0.77        99_科技_       0.64      0.81      0.71        93_股票_       0.33      0.88      0.48        78_财经_       0.97      0.67      0.79       111avg / total       0.83      0.76      0.78      1400accuracy_score: 0.75929s
time uesed: 22.6883s

从实验结果可以看出，最好的max_df是在0.4-0.5之间，这也就是为什么很多demo中设置TF-IDF阈值进行特征的筛选，下面设置在max_df为1.0的目标下测试min_df;

2.2 TF-IDF 不同的min_df对结果参数的影响

# 2. 特征选择
min_df = [0., 0.1, 0.2, 0.3, 0.4]
for i in min_df:input_x, y = corpusinput_x = feature_extractor(input_x, 'tfidf', max_df=1.0, min_df=i)# 3.切分训练数据和测试数据train_x, test_x, train_y, test_y = split_data_to_train_and_test([input_x, y])# 4. 训练以及测试t0 = time()print('\t 使用 max_df,min_df=(1.0,%.1f) 进行特征选择的逻辑回归文本分类\t\t\n' %(i))fit_and_predicted(train_x, train_y, test_x, test_y)print('time uesed: %0.4fs' %(time() - t0))

Vocabulary Size: 671267
Train/Dev split: 5600/1400使用 max_df,min_df=(1.0,0.0) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.95      0.94      0.94       109_娱乐_       0.80      0.88      0.84        92_家居_       0.95      0.89      0.92       109_彩票_       0.98      0.95      0.96        97_房产_       0.93      0.92      0.92        97_教育_       0.95      0.88      0.92       104_时尚_       0.91      0.87      0.89       110_时政_       0.81      0.90      0.85        93_星座_       0.94      0.98      0.96       105_游戏_       0.95      0.90      0.93       103_社会_       0.84      0.89      0.86        99_科技_       0.89      0.84      0.86        93_股票_       0.76      0.91      0.83        78_财经_       0.96      0.87      0.92       111avg / total       0.91      0.90      0.90      1400accuracy_score: 0.90214s
time uesed: 67.8985s
Vocabulary Size: 1052
Train/Dev split: 5600/1400使用 max_df,min_df=(1.0,0.1) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.93      0.94      0.94       109_娱乐_       0.75      0.75      0.75        92_家居_       0.87      0.85      0.86       109_彩票_       0.97      0.97      0.97        97_房产_       0.91      0.88      0.89        97_教育_       0.95      0.88      0.92       104_时尚_       0.88      0.84      0.86       110_时政_       0.82      0.89      0.86        93_星座_       0.93      0.96      0.94       105_游戏_       0.95      0.88      0.91       103_社会_       0.84      0.88      0.86        99_科技_       0.79      0.78      0.79        93_股票_       0.73      0.90      0.80        78_财经_       0.94      0.86      0.90       111avg / total       0.88      0.88      0.88      1400accuracy_score: 0.87786s
time uesed: 1.8236s
Vocabulary Size: 472
Train/Dev split: 5600/1400使用 max_df,min_df=(1.0,0.2) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.89      0.92      0.90       109_娱乐_       0.71      0.79      0.75        92_家居_       0.80      0.79      0.79       109_彩票_       0.97      0.86      0.91        97_房产_       0.92      0.80      0.86        97_教育_       0.96      0.84      0.89       104_时尚_       0.86      0.82      0.84       110_时政_       0.77      0.84      0.80        93_星座_       0.85      0.90      0.88       105_游戏_       0.78      0.66      0.72       103_社会_       0.77      0.87      0.82        99_科技_       0.71      0.73      0.72        93_股票_       0.65      0.83      0.73        78_财经_       0.92      0.85      0.88       111avg / total       0.83      0.82      0.82      1400accuracy_score: 0.82214s
time uesed: 1.4046s
Vocabulary Size: 244
Train/Dev split: 5600/1400使用 max_df,min_df=(1.0,0.3) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.87      0.89      0.88       109_娱乐_       0.73      0.66      0.70        92_家居_       0.81      0.74      0.78       109_彩票_       0.90      0.84      0.87        97_房产_       0.90      0.80      0.85        97_教育_       0.84      0.78      0.81       104_时尚_       0.78      0.83      0.81       110_时政_       0.68      0.78      0.73        93_星座_       0.83      0.86      0.84       105_游戏_       0.75      0.61      0.67       103_社会_       0.70      0.84      0.76        99_科技_       0.69      0.72      0.71        93_股票_       0.60      0.78      0.68        78_财经_       0.87      0.78      0.82       111avg / total       0.79      0.78      0.78      1400accuracy_score: 0.78143s
time uesed: 1.1189s
Vocabulary Size: 154
Train/Dev split: 5600/1400使用 max_df,min_df=(1.0,0.4) 进行特征选择的逻辑回归文本分类     precision    recall  f1-score   support_体育_       0.82      0.83      0.82       109_娱乐_       0.66      0.63      0.64        92_家居_       0.75      0.60      0.66       109_彩票_       0.91      0.85      0.88        97_房产_       0.82      0.75      0.78        97_教育_       0.77      0.72      0.74       104_时尚_       0.70      0.75      0.73       110_时政_       0.65      0.78      0.71        93_星座_       0.85      0.86      0.85       105_游戏_       0.63      0.57      0.60       103_社会_       0.70      0.76      0.73        99_科技_       0.61      0.67      0.64        93_股票_       0.54      0.67      0.60        78_财经_       0.78      0.72      0.75       111avg / total       0.73      0.73      0.73      1400accuracy_score: 0.72643s
time uesed: 1.0229s

从上面的实验可以看出，min_df或许取0.0是一个不错的选择，那就默认吧。

不得不说，总感觉上面的控制变量进行参数的寻找是有毛病的，暂且就这么做吧。在后面的试验中，我们将选取max_df=0.5,min_df=0.0进行相关实验。

3 逻辑回归的调参

3.1 交叉验证

在进行最优参数的调整之前，我们先看一下sklearn提供的另外一个函数LogisticRegressionCV(),它提供了标准的k-fold-cross-validator

def fit_and_predicted_use_CV(train_x, train_y, test_x, test_y, penalty='l2', C=1.0, solver='lbfgs', cv=10):"""训练与预测:param train_x: :param train_y: :param test_x: :param test_y: :return: """clf = linear_model.LogisticRegressionCV(penalty=penalty, C=C, solver=solver, n_jobs=-1, cv=cv).fit(train_x, train_y)predicted = clf.predict(test_x)print(metrics.classification_report(test_y, predicted))print('accuracy_score: %0.5fs' %(metrics.accuracy_score(test_y, predicted)))

input_x, y = corpus
input_x = feature_extractor(input_x, 'tfidf', max_df=i)
# 3.切分训练数据和测试数据
train_x, test_x, train_y, test_y = split_data_to_train_and_test([input_x, y])
# 4. 训练以及测试
t0 = time()
print('\t 使用 max_df,min_df=(%.1f,0.0) 进行特征选择的逻辑回归文本分类\t\t\n' %(i))
fit_and_predicted_use_CV(train_x, train_y, test_x, test_y)
print('time uesed: %0.4fs' %(time() - t0))

可以看到使用交叉验证的结果往往比我们直接划分数据集的效果要好一些。

3.2 逻辑回归的最佳参数寻找

现在将采用sklearn中提供的网格查找方法进行最优参数的寻找，网格查找其实是一种暴力查找方法。

import numpy as np
from sklearn.grid_search import GridSearchCV

def train_and_predicted_with_graid(corpus, param_grid, cv=5):input_x, y = corpusscoring = ['precision_macro', 'recall_macro', 'f1_macro']clf = linear_model.LogisticRegression(n_jobs=-1)grid = GridSearchCV(clf, param_grid, cv=cv, scoring='accuracy')scores = grid.fit(input_x, y)print('parameters:')best_parameters = grid.best_estimator_.get_params()for param_name in sorted(best_parameters):print('\t%s: %r' %(param_name, best_parameters[param_name]))return scores

C= [0.1, 0.2, 0.5, 0.8, 1.5, 3, 5]
fit_intercept=[True, False]
penalty=['l1', 'l2']
solver=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
solver = ['saga']
param_grid=dict(C=C, fit_intercept=fit_intercept, penalty=penalty, solver=solver)
input_x, y = corpus
input_x = feature_extractor(input_x, 'tfidf', max_df=0.5, min_df=0.0)
scores = train_and_predicted_with_graid([input_x, y], cv=5, param_grid=param_grid)

parameters:C: 5class_weight: Nonedual: Falsefit_intercept: Trueintercept_scaling: 1max_iter: 100multi_class: 'ovr'n_jobs: -1penalty: 'l2'random_state: Nonesolver: 'saga'tol: 0.0001verbose: 0warm_start: False

4. 其他线性分类器

4.1 简单的线性回归

def fit_and_predicted_with_linerCV(train_x, train_y, test_x, test_y, alpha=1.0, cv=10):"""训练与预测:param train_x: :param train_y: :param test_x: :param test_y: :return: """clf = linear_model.LogisticRegressionCV(penalty=penalty, C=C, solver=solver, n_jobs=-1, cv=cv).fit(train_x, train_y)predicted = clf.predict(test_x)print(metrics.classification_report(test_y, predicted))print('accuracy_score: %0.5fs' %(metrics.accuracy_score(test_y, predicted)))

input_x, y = corpus
input_x = feature_extractor(input_x, 'tfidf', max_df=i)
# 3.切分训练数据和测试数据
train_x, test_x, train_y, test_y = split_data_to_train_and_test([input_x, y])
# 4. 训练以及测试
t0 = time()
print('\t 使用线性回归的文本分类\t\t\n' %(i))
fit_and_predicted_with_linerCV(train_x, train_y, test_x, test_y)
print('time uesed: %0.4fs' %(time() - t0))

{'fit_time': array([ 0.69856882,  0.6891861 ,  0.68457079,  0.68122745,  0.68401599]),'score_time': array([ 0.24055672,  0.25055385,  0.24642444,  0.24583435,  0.25062966]),'test_f1_macro': array([ 0.93190598,  0.93358814,  0.92900074,  0.93620104,  0.93139325]),'test_precision_macro': array([ 0.93411186,  0.93509947,  0.93082131,  0.93790787,  0.93312355]),'test_recall_macro': array([ 0.93178571,  0.93357143,  0.92892857,  0.93607143,  0.93142857]),'train_f1_macro': array([ 0.95534592,  0.95516529,  0.95665886,  0.95573948,  0.95629695]),'train_precision_macro': array([ 0.95629235,  0.95618146,  0.95767379,  0.9566414 ,  0.95725075]),'train_recall_macro': array([ 0.95526786,  0.95508929,  0.95660714,  0.95571429,  0.95625   ])}

4.2 使用L1范式的Lasso

def fit_and_predicted_with_LassoCV(train_x, train_y, test_x, test_y):"""训练与预测:param train_x: :param train_y: :param test_x: :param test_y: :return: """clf = linear_model.LassoCV().fit(train_x, train_y)predicted = clf.predict(test_x)print(metrics.classification_report(test_y, predicted))print('accuracy_score: %0.5fs' %(metrics.accuracy_score(test_y, predicted)))

input_x, y = corpus
input_x = feature_extractor(input_x, 'tfidf', max_df=i)
# 3.切分训练数据和测试数据
train_x, test_x, train_y, test_y = split_data_to_train_and_test([input_x, y])
# 4. 训练以及测试
t0 = time()
print('\t 使用线性回归的文本分类\t\t\n' %(i))
fit_and_predicted_with_LassoCV(train_x, train_y, test_x, test_y)
print('time uesed: %0.4fs' %(time() - t0))

{'fit_time': array([ 0.69856882,  0.6891861 ,  0.68457079,  0.68122745,  0.68401599]),'score_time': array([ 0.24055672,  0.25055385,  0.24642444,  0.24583435,  0.25062966]),'test_f1_macro': array([ 0.93190598,  0.93358814,  0.92900074,  0.93620104,  0.93139325]),'test_precision_macro': array([ 0.93411186,  0.93509947,  0.93082131,  0.93790787,  0.93312355]),'test_recall_macro': array([ 0.93178571,  0.93357143,  0.92892857,  0.93607143,  0.93142857]),'train_f1_macro': array([ 0.95534592,  0.95516529,  0.95665886,  0.95573948,  0.95629695]),'train_precision_macro': array([ 0.95629235,  0.95618146,  0.95767379,  0.9566414 ,  0.95725075]),'train_recall_macro': array([ 0.95526786,  0.95508929,  0.95660714,  0.95571429,  0.95625   ])}

4.3 使用L2范式的岭回归

def fit_and_predicted_with_RidgeCV(train_x, train_y, test_x, test_y):"""训练与预测:param train_x: :param train_y: :param test_x: :param test_y: :return: """clf = linear_model.RidgeClassifierCV().fit(train_x, train_y)predicted = clf.predict(test_x)print(metrics.classification_report(test_y, predicted))print('accuracy_score: %0.5fs' %(metrics.accuracy_score(test_y, predicted)))

input_x, y = corpus
input_x = feature_extractor(input_x, 'tfidf', max_df=i)
# 3.切分训练数据和测试数据
train_x, test_x, train_y, test_y = split_data_to_train_and_test([input_x, y])
# 4. 训练以及测试
t0 = time()
print('\t 使用线性回归的文本分类\t\t\n' %(i))
fit_and_predicted_with_RidgeCV(train_x, train_y, test_x, test_y)
print('time uesed: %0.4fs' %(time() - t0))

{'fit_time': array([ 0.69856882,  0.6891861 ,  0.68457079,  0.68122745,  0.68401599]),'score_time': array([ 0.24055672,  0.25055385,  0.24642444,  0.24583435,  0.25062966]),'test_f1_macro': array([ 0.93190598,  0.93358814,  0.92900074,  0.93620104,  0.93139325]),'test_precision_macro': array([ 0.93411186,  0.93509947,  0.93082131,  0.93790787,  0.93312355]),'test_recall_macro': array([ 0.93178571,  0.93357143,  0.92892857,  0.93607143,  0.93142857]),'train_f1_macro': array([ 0.95534592,  0.95516529,  0.95665886,  0.95573948,  0.95629695]),'train_precision_macro': array([ 0.95629235,  0.95618146,  0.95767379,  0.9566414 ,  0.95725075]),'train_recall_macro': array([ 0.95526786,  0.95508929,  0.95660714,  0.95571429,  0.95625   ])}

4.4 使用elastic net正则项的的线性回归

def fit_and_predicted_with_ElasticNetCV(train_x, train_y, test_x, test_y):"""训练与预测:param train_x: :param train_y: :param test_x: :param test_y: :return: """clf = linear_model.MultiTaskElasticNetCV().fit(train_x, train_y)predicted = clf.predict(test_x)print(metrics.classification_report(test_y, predicted))print('accuracy_score: %0.5fs' %(metrics.accuracy_score(test_y, predicted)))

input_x, y = corpus
input_x = feature_extractor(input_x, 'tfidf', max_df=i)
# 3.切分训练数据和测试数据
train_x, test_x, train_y, test_y = split_data_to_train_and_test([input_x, y])
# 4. 训练以及测试
t0 = time()
print('\t 使用线性回归的文本分类\t\t\n' %(i))
fit_and_predicted_with_ElasticNetCV(train_x, train_y, test_x, test_y)
print('time uesed: %0.4fs' %(time() - t0))

{'fit_time': array([ 0.69856882,  0.6891861 ,  0.68457079,  0.68122745,  0.68401599]),'score_time': array([ 0.24055672,  0.25055385,  0.24642444,  0.24583435,  0.25062966]),'test_f1_macro': array([ 0.93190598,  0.93358814,  0.92900074,  0.93620104,  0.93139325]),'test_precision_macro': array([ 0.93411186,  0.93509947,  0.93082131,  0.93790787,  0.93312355]),'test_recall_macro': array([ 0.93178571,  0.93357143,  0.92892857,  0.93607143,  0.93142857]),'train_f1_macro': array([ 0.95534592,  0.95516529,  0.95665886,  0.95573948,  0.95629695]),'train_precision_macro': array([ 0.95629235,  0.95618146,  0.95767379,  0.9566414 ,  0.95725075]),'train_recall_macro': array([ 0.95526786,  0.95508929,  0.95660714,  0.95571429,  0.95625   ])}

线性模型有很多的变种，其的简单高效并且可解释性强等特点在机器学习领域有很广泛的应用，这里不作进一步展开，大家自己科普吧~~

5. 总结

本文记录了使用sklearn，采用线性回归进行文本分类任务，在特征选择哪里，进行TF-IDF的参数验证部分，找到相对较好的max_df=0.5左右；

在选取好了特征后，我们对数据集进行交叉验证，发现交叉验证的方式能提高模型的效果，推荐在后面划分数据集的时候使用交叉验证。

我以逻辑回归为例，进行了线性回归分类器的参数搜索部分，然后利用最佳的参数，训练了最佳的逻辑回归文本分类模型，模型性能的acc值能达到:91%以上

最后，我们利用其它具有代表性的线性分类器进行相关实验，但是没有进行调参工作，其中L1产生比较离散的数值，elasticNet结合了L1，L2的优缺点，其在集合上的图像介于两者之间，效果在论文中比L1，L2都要好。