ML CLassifier模块

沿用上一篇的例子。此处的问题是垃圾邮件的分类,监督学习。

1. Random Forest + KFold

import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import stringstopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']def count_punct(text):count = sum([1 for char in text if char in string.punctuation])return round(count/(len(text) - text.count(" ")), 3)*100data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return texttfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()

接下来建立模型。

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_scorerf = RandomForestClassifier(n_jobs=-1)  # parallel building
kfold = KFold(n_splits=10)
cross_val_score(rf,X_features,data["label"],cv=kfold,scoring="accuracy",n_jobs=-1)

2. Holdout Test Set Evaluation

from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifierxTrain, xTest, yTrain, yTest = train_test_split(X_features, data.label, test_size=0.2)rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model = rf.fit(xTrain, yTrain)# find out the most important features with respect to the model
sorted(zip(rf_model.feature_importances_, xTrain.columns), reverse=True)[:5]y_pred = rf_model.predict(xTest)
precision, recall, fscore, support = score(yTest, y_pred, pos_label = "spam", average = "binary")print('precision: {} / recall: {} / accuracy: {}'.format(precision, recall, (y_pred==yTest).sum()/len(y_pred)))

3. Grid Search + Model Evaluation

手动实现一个简易的网格搜索。

 def train_RF(n_est, depth):rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)rf_model = rf.fit(X_train, y_train)y_pred = rf_model.predict(X_test)prec, recall, fscore, sup = score(y_test, y_pred, pos_label="spam", average="binary")print("Est:{}/Dpeth:{}\nprecision:{}/recall:{}/accur:{}".format(n_est, depth, prec, recall, (y_pred==y_test).sum()/len(y_pred)))
for n_est in [10,20,50]:for depth in range(10,40,10):train_RF(n_est, depth)

调用sklearn自带的方法。

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCVrf = RandomForestClassifier()
param = {'n_estimators':[10,150,300],'max_depth':[30,60,90,None]}
gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)
gs_fit = gs.fit(X_tfidf_feat, data["label"])
pd.DataFrame(gs_fit.cv_results_).sort_values("mean_test_score", ascending=False)[:5]

本机跑GridSearchCV的时候报了Memory Error错误。解决方法有别的博客讲解,建议增大虚拟内存,具体操作在此不赘述。

4. Gradient Boost

定义:Ensemble learning method that takes an iterative approach to combining weak learners to create a strong learner by focusing on mistakes of prior iterations. Decision tree based.

与RF的区别:

RF:

  1. Bagging, so training can be done in parallel.
  2. Unweighted voting for final prediction.
  3. Easier to tune, harder to overfit.

Gradient Boosting:

  1. Boosting, so training must be done iteratively.
  2. Weighted voting for final prediction.
  3. Harder to tune, easier to overfit.

Tradeoffs of GB:

pros:

  1. powerful
  2. accepts various types of inputs
  3. can be used for classification or regression
  4. outputs feature importance

Cons:

  1. longer to train
  2. more likely to overfit
  3. more difficult to properly tune
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCVgb = GradientBoostingClassifier()
param = {"n_estimators":[100, 150],"max_depth":[7, 11, 15],"learning_rate":[0.1]
}
gs = GridSearchCV(gb, param, cv = 5, n_jobs=-1)
cv_fit = gs.fit(X_tfidf_feat, data.label)
pd.DataFrame(cv_fit.cv_results_).sort_values("mean_test_score", ascending=False)[:5]

5. Pipeline总结

  • read in raw text
  • clean text and tokenize
  • feature engineering
  • fit simple model
  • tune hyperparameters and evalueate model
  • final model selection

Vectorizers should be fit on the training set and only be used to transform the test set.

Process:

split data into trainig and test set -> train vectorizers on training set and use that to transform test set -> fit best RF and GB model on training set and predict on test set -> evaluate results of two models to select best model

贴出完整代码:

import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import stringstopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']def count_punct(text):count = sum([1 for char in text if char in string.punctuation])return round(count/(len(text) - text.count(" ")), 3)*100data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return text
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(tfidf_test.toarray())], axis=1)X_train_vect.head()
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time#RF model
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))#GB model
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

新手探索NLP(二)相关推荐

  1. 新手探索NLP(六)——全文检索

    全文检索技术--Lucene的介绍 转载自https://blog.csdn.net/yerenyuan_pku/article/details/72582979 查看全文 http://www.ta ...

  2. 新手探索NLP(四)

    学习NLP需要一个比较系统的概要.所以理论上这个应该写在第一篇. [NLP的技术概貌] NLP里细分领域和技术实在太多,根据NLP的终极目标,大致可以分为自然语言理解(NLU)和自然语言生成(NLG) ...

  3. 新手探索NLP(三)

    目录 NLP语言模型 词的表示方法类型 1.词的独热表示one-hot representation 简介 不足 2. 词的分布式表示distributed representation 简介 建模类 ...

  4. 新手探索NLP(十二)——文本聚类

    简介 聚类又称群分析,是数据挖掘的一种重要的思想,聚类(Cluster)分析是由若干模式(Pattern)组成的,通常,模式是一个度量(Measurement)的向量,或者是多维空间中的一个点.聚类分 ...

  5. 新手探索NLP(十五)——终章

    目录 中文语音的机器处理 汉语语言学的研究 未登录词识别 概率图模型 信息熵 互信息 联合熵 条件熵 统计语言模型 隐马尔科夫模型 Viterbi算法 最大熵模型 最大熵原理 GIS实现 条件随机场模 ...

  6. 新手探索NLP(九)——文本摘要

    转载自知乎https://zhuanlan.zhihu.com/p/67078700 文本摘要是一种从一个或多个信息源中抽取关键信息的方法,它帮助用户节省了大量时间,用户可以从摘要获取到文本的所有关键 ...

  7. 新手探索NLP(八)——序列标注

    转载自知乎https://zhuanlan.zhihu.com/p/50184092 NLP中的序列标注问题(隐马尔可夫HMM与条件随机场CRF) Introduction 序列标注问题(sequen ...

  8. 新手探索NLP(七)——情感分析

    简介 文本情感分析(sentiment analysis),又称为意见挖掘,是对带有情感色彩的主观性文本进行分析.处理.归纳和推理的过程.其中,主观情感可以是他们的判断或者评价,他们的情绪状态,或者有 ...

  9. 新手探索NLP(五)

    命名实体识别 简介 命名实体识别(NER)(也称为实体识别.实体分块和实体提取)是信息提取的一个子任务,旨在将文本中的命名实体定位并分类为预先定义的类别,如人员.组织.位置.时间表达式.数量.货币值. ...

最新文章

  1. 批量新建文件夹并命名_dos命令实现批量新建文件夹
  2. 数学知识--Unconstrained Optimization(第二章)
  3. 利用jsoncpp将json字符串转换为Vector
  4. 图像的打开、修改、显示和保存示例(OpenCV 2.0)
  5. 电子设计大赛作品_第十四届电子设计大赛圆满结束!
  6. React开发(126):ant design学习指南之form中的自定义校验labelCol
  7. 蓝宝石显卡bios_这操作竟能让显卡性能暴涨?原来不是黑科技,小白都会
  8. windows时间服务器状态,搭建window时间服务器:
  9. 蓝桥杯2016年第七届C++省赛B组第五题-抽签
  10. WPF DataGrid使用 后台界面修改前台不刷新问题
  11. 压缩解压缩工具之WinRAR
  12. 英雄联盟php文件,英雄联盟-QQ网吧游戏特权-QQ网吧
  13. SharePoint文件审批功能设置
  14. java中如何在键盘中输入一串以逗号隔开数字然后存入数组中,并输出。
  15. 京东联盟自动转链php,求京东联盟php自动转链源码 请 ZenHaBit 继续帮忙
  16. centos安装matlab2018的步骤(基本是借鉴的但是会有些自己的解释及补充)
  17. java 编程思想 多线程学习笔记
  18. sql基本的日期函数
  19. ECSHOP去掉版权
  20. 革文:B2B企业如何用品牌思维玩转社群营销

热门文章

  1. JavaScript碎片—函数闭包(模拟面向对象)
  2. 迭代器、生成器、面向过程编程思想
  3. Spring xml 注入静态变量
  4. Java程序员的日常—— 《编程思想》关于类的使用常识
  5. 编写个shell脚本将/home/test 目录下大于10K的文件转移到/tmp目录下
  6. Netbeans ClassFormatException: Invalid byte tag in
  7. 转:PHP Liunx 服务安全防范方案
  8. Silverlight实例教程 - Navigation导航框架系列汇总
  9. 对ie6、ie7、ff兼容性的详细css hack介绍
  10. 西南交大量子计算机,交大量子光电实验室