ML CLassifier模块

沿用上一篇的例子。此处的问题是垃圾邮件的分类，监督学习。

1. Random Forest + KFold

import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import stringstopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']def count_punct(text):count = sum([1 for char in text if char in string.punctuation])return round(count/(len(text) - text.count(" ")), 3)*100data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return texttfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()

接下来建立模型。

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_scorerf = RandomForestClassifier(n_jobs=-1)  # parallel building
kfold = KFold(n_splits=10)
cross_val_score(rf,X_features,data["label"],cv=kfold,scoring="accuracy",n_jobs=-1)

2. Holdout Test Set Evaluation

from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifierxTrain, xTest, yTrain, yTest = train_test_split(X_features, data.label, test_size=0.2)rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model = rf.fit(xTrain, yTrain)# find out the most important features with respect to the model
sorted(zip(rf_model.feature_importances_, xTrain.columns), reverse=True)[:5]y_pred = rf_model.predict(xTest)
precision, recall, fscore, support = score(yTest, y_pred, pos_label = "spam", average = "binary")print('precision: {} / recall: {} / accuracy: {}'.format(precision, recall, (y_pred==yTest).sum()/len(y_pred)))

3. Grid Search + Model Evaluation

手动实现一个简易的网格搜索。

 def train_RF(n_est, depth):rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)rf_model = rf.fit(X_train, y_train)y_pred = rf_model.predict(X_test)prec, recall, fscore, sup = score(y_test, y_pred, pos_label="spam", average="binary")print("Est:{}/Dpeth:{}\nprecision:{}/recall:{}/accur:{}".format(n_est, depth, prec, recall, (y_pred==y_test).sum()/len(y_pred)))

for n_est in [10,20,50]:for depth in range(10,40,10):train_RF(n_est, depth)

调用sklearn自带的方法。

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCVrf = RandomForestClassifier()
param = {'n_estimators':[10,150,300],'max_depth':[30,60,90,None]}
gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)
gs_fit = gs.fit(X_tfidf_feat, data["label"])
pd.DataFrame(gs_fit.cv_results_).sort_values("mean_test_score", ascending=False)[:5]

本机跑GridSearchCV的时候报了Memory Error错误。解决方法有别的博客讲解，建议增大虚拟内存，具体操作在此不赘述。

4. Gradient Boost

定义：Ensemble learning method that takes an iterative approach to combining weak learners to create a strong learner by focusing on mistakes of prior iterations. Decision tree based.

与RF的区别：

RF:

Bagging, so training can be done in parallel.
Unweighted voting for final prediction.
Easier to tune, harder to overfit.

Gradient Boosting:

Boosting, so training must be done iteratively.
Weighted voting for final prediction.
Harder to tune, easier to overfit.

Tradeoffs of GB:

pros:

powerful
accepts various types of inputs
can be used for classification or regression
outputs feature importance

Cons:

longer to train
more likely to overfit
more difficult to properly tune

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCVgb = GradientBoostingClassifier()
param = {"n_estimators":[100, 150],"max_depth":[7, 11, 15],"learning_rate":[0.1]
}
gs = GridSearchCV(gb, param, cv = 5, n_jobs=-1)
cv_fit = gs.fit(X_tfidf_feat, data.label)
pd.DataFrame(cv_fit.cv_results_).sort_values("mean_test_score", ascending=False)[:5]

5. Pipeline总结

read in raw text
clean text and tokenize
feature engineering
fit simple model
tune hyperparameters and evalueate model
final model selection

Vectorizers should be fit on the training set and only be used to transform the test set.

Process:

split data into trainig and test set -> train vectorizers on training set and use that to transform test set -> fit best RF and GB model on training set and predict on test set -> evaluate results of two models to select best model

贴出完整代码：

import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import stringstopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']def count_punct(text):count = sum([1 for char in text if char in string.punctuation])return round(count/(len(text) - text.count(" ")), 3)*100data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return text

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(tfidf_test.toarray())], axis=1)X_train_vect.head()

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time#RF model
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))#GB model
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

新手探索NLP（二）相关推荐

新手探索NLP（六）——全文检索
全文检索技术--Lucene的介绍转载自https://blog.csdn.net/yerenyuan_pku/article/details/72582979 查看全文 http://www.ta ...
新手探索NLP（四）
学习NLP需要一个比较系统的概要.所以理论上这个应该写在第一篇. [NLP的技术概貌] NLP里细分领域和技术实在太多,根据NLP的终极目标,大致可以分为自然语言理解(NLU)和自然语言生成(NLG) ...
新手探索NLP（三）
目录 NLP语言模型词的表示方法类型 1.词的独热表示one-hot representation 简介不足 2. 词的分布式表示distributed representation 简介建模类 ...
新手探索NLP（十二）——文本聚类
简介聚类又称群分析,是数据挖掘的一种重要的思想,聚类(Cluster)分析是由若干模式(Pattern)组成的,通常,模式是一个度量(Measurement)的向量,或者是多维空间中的一个点.聚类分 ...
新手探索NLP（十五）——终章
目录中文语音的机器处理汉语语言学的研究未登录词识别概率图模型信息熵互信息联合熵条件熵统计语言模型隐马尔科夫模型 Viterbi算法最大熵模型最大熵原理 GIS实现条件随机场模 ...
新手探索NLP（九）——文本摘要
转载自知乎https://zhuanlan.zhihu.com/p/67078700 文本摘要是一种从一个或多个信息源中抽取关键信息的方法,它帮助用户节省了大量时间,用户可以从摘要获取到文本的所有关键 ...
新手探索NLP（八）——序列标注
转载自知乎https://zhuanlan.zhihu.com/p/50184092 NLP中的序列标注问题(隐马尔可夫HMM与条件随机场CRF) Introduction 序列标注问题(sequen ...
新手探索NLP（七）——情感分析
简介文本情感分析(sentiment analysis),又称为意见挖掘,是对带有情感色彩的主观性文本进行分析.处理.归纳和推理的过程.其中,主观情感可以是他们的判断或者评价,他们的情绪状态,或者有 ...
新手探索NLP（五）
命名实体识别简介命名实体识别(NER)(也称为实体识别.实体分块和实体提取)是信息提取的一个子任务,旨在将文本中的命名实体定位并分类为预先定义的类别,如人员.组织.位置.时间表达式.数量.货币值. ...

新手探索NLP（二）