新手探索NLP(二)
ML CLassifier模块
沿用上一篇的例子。此处的问题是垃圾邮件的分类,监督学习。
1. Random Forest + KFold
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import stringstopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']def count_punct(text):count = sum([1 for char in text if char in string.punctuation])return round(count/(len(text) - text.count(" ")), 3)*100data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return texttfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()
接下来建立模型。
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_scorerf = RandomForestClassifier(n_jobs=-1) # parallel building
kfold = KFold(n_splits=10)
cross_val_score(rf,X_features,data["label"],cv=kfold,scoring="accuracy",n_jobs=-1)
2. Holdout Test Set Evaluation
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifierxTrain, xTest, yTrain, yTest = train_test_split(X_features, data.label, test_size=0.2)rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model = rf.fit(xTrain, yTrain)# find out the most important features with respect to the model
sorted(zip(rf_model.feature_importances_, xTrain.columns), reverse=True)[:5]y_pred = rf_model.predict(xTest)
precision, recall, fscore, support = score(yTest, y_pred, pos_label = "spam", average = "binary")print('precision: {} / recall: {} / accuracy: {}'.format(precision, recall, (y_pred==yTest).sum()/len(y_pred)))
3. Grid Search + Model Evaluation
手动实现一个简易的网格搜索。
def train_RF(n_est, depth):rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)rf_model = rf.fit(X_train, y_train)y_pred = rf_model.predict(X_test)prec, recall, fscore, sup = score(y_test, y_pred, pos_label="spam", average="binary")print("Est:{}/Dpeth:{}\nprecision:{}/recall:{}/accur:{}".format(n_est, depth, prec, recall, (y_pred==y_test).sum()/len(y_pred)))
for n_est in [10,20,50]:for depth in range(10,40,10):train_RF(n_est, depth)
调用sklearn自带的方法。
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCVrf = RandomForestClassifier()
param = {'n_estimators':[10,150,300],'max_depth':[30,60,90,None]}
gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)
gs_fit = gs.fit(X_tfidf_feat, data["label"])
pd.DataFrame(gs_fit.cv_results_).sort_values("mean_test_score", ascending=False)[:5]
本机跑GridSearchCV的时候报了Memory Error错误。解决方法有别的博客讲解,建议增大虚拟内存,具体操作在此不赘述。
4. Gradient Boost
定义:Ensemble learning method that takes an iterative approach to combining weak learners to create a strong learner by focusing on mistakes of prior iterations. Decision tree based.
与RF的区别:
RF:
- Bagging, so training can be done in parallel.
- Unweighted voting for final prediction.
- Easier to tune, harder to overfit.
Gradient Boosting:
- Boosting, so training must be done iteratively.
- Weighted voting for final prediction.
- Harder to tune, easier to overfit.
Tradeoffs of GB:
pros:
- powerful
- accepts various types of inputs
- can be used for classification or regression
- outputs feature importance
Cons:
- longer to train
- more likely to overfit
- more difficult to properly tune
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCVgb = GradientBoostingClassifier()
param = {"n_estimators":[100, 150],"max_depth":[7, 11, 15],"learning_rate":[0.1]
}
gs = GridSearchCV(gb, param, cv = 5, n_jobs=-1)
cv_fit = gs.fit(X_tfidf_feat, data.label)
pd.DataFrame(cv_fit.cv_results_).sort_values("mean_test_score", ascending=False)[:5]
5. Pipeline总结
- read in raw text
- clean text and tokenize
- feature engineering
- fit simple model
- tune hyperparameters and evalueate model
- final model selection
Vectorizers should be fit on the training set and only be used to transform the test set.
Process:
split data into trainig and test set -> train vectorizers on training set and use that to transform test set -> fit best RF and GB model on training set and predict on test set -> evaluate results of two models to select best model
贴出完整代码:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import stringstopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']def count_punct(text):count = sum([1 for char in text if char in string.punctuation])return round(count/(len(text) - text.count(" ")), 3)*100data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return text
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(tfidf_test.toarray())], axis=1)X_train_vect.head()
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time#RF model
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))#GB model
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))
新手探索NLP(二)相关推荐
- 新手探索NLP(六)——全文检索
全文检索技术--Lucene的介绍 转载自https://blog.csdn.net/yerenyuan_pku/article/details/72582979 查看全文 http://www.ta ...
- 新手探索NLP(四)
学习NLP需要一个比较系统的概要.所以理论上这个应该写在第一篇. [NLP的技术概貌] NLP里细分领域和技术实在太多,根据NLP的终极目标,大致可以分为自然语言理解(NLU)和自然语言生成(NLG) ...
- 新手探索NLP(三)
目录 NLP语言模型 词的表示方法类型 1.词的独热表示one-hot representation 简介 不足 2. 词的分布式表示distributed representation 简介 建模类 ...
- 新手探索NLP(十二)——文本聚类
简介 聚类又称群分析,是数据挖掘的一种重要的思想,聚类(Cluster)分析是由若干模式(Pattern)组成的,通常,模式是一个度量(Measurement)的向量,或者是多维空间中的一个点.聚类分 ...
- 新手探索NLP(十五)——终章
目录 中文语音的机器处理 汉语语言学的研究 未登录词识别 概率图模型 信息熵 互信息 联合熵 条件熵 统计语言模型 隐马尔科夫模型 Viterbi算法 最大熵模型 最大熵原理 GIS实现 条件随机场模 ...
- 新手探索NLP(九)——文本摘要
转载自知乎https://zhuanlan.zhihu.com/p/67078700 文本摘要是一种从一个或多个信息源中抽取关键信息的方法,它帮助用户节省了大量时间,用户可以从摘要获取到文本的所有关键 ...
- 新手探索NLP(八)——序列标注
转载自知乎https://zhuanlan.zhihu.com/p/50184092 NLP中的序列标注问题(隐马尔可夫HMM与条件随机场CRF) Introduction 序列标注问题(sequen ...
- 新手探索NLP(七)——情感分析
简介 文本情感分析(sentiment analysis),又称为意见挖掘,是对带有情感色彩的主观性文本进行分析.处理.归纳和推理的过程.其中,主观情感可以是他们的判断或者评价,他们的情绪状态,或者有 ...
- 新手探索NLP(五)
命名实体识别 简介 命名实体识别(NER)(也称为实体识别.实体分块和实体提取)是信息提取的一个子任务,旨在将文本中的命名实体定位并分类为预先定义的类别,如人员.组织.位置.时间表达式.数量.货币值. ...
最新文章
- 批量新建文件夹并命名_dos命令实现批量新建文件夹
- 数学知识--Unconstrained Optimization(第二章)
- 利用jsoncpp将json字符串转换为Vector
- 图像的打开、修改、显示和保存示例(OpenCV 2.0)
- 电子设计大赛作品_第十四届电子设计大赛圆满结束!
- React开发(126):ant design学习指南之form中的自定义校验labelCol
- 蓝宝石显卡bios_这操作竟能让显卡性能暴涨?原来不是黑科技,小白都会
- windows时间服务器状态,搭建window时间服务器:
- 蓝桥杯2016年第七届C++省赛B组第五题-抽签
- WPF DataGrid使用 后台界面修改前台不刷新问题
- 压缩解压缩工具之WinRAR
- 英雄联盟php文件,英雄联盟-QQ网吧游戏特权-QQ网吧
- SharePoint文件审批功能设置
- java中如何在键盘中输入一串以逗号隔开数字然后存入数组中,并输出。
- 京东联盟自动转链php,求京东联盟php自动转链源码 请 ZenHaBit 继续帮忙
- centos安装matlab2018的步骤(基本是借鉴的但是会有些自己的解释及补充)
- java 编程思想 多线程学习笔记
- sql基本的日期函数
- ECSHOP去掉版权
- 革文:B2B企业如何用品牌思维玩转社群营销
热门文章
- JavaScript碎片—函数闭包(模拟面向对象)
- 迭代器、生成器、面向过程编程思想
- Spring xml 注入静态变量
- Java程序员的日常—— 《编程思想》关于类的使用常识
- 编写个shell脚本将/home/test 目录下大于10K的文件转移到/tmp目录下
- Netbeans ClassFormatException: Invalid byte tag in
- 转:PHP Liunx 服务安全防范方案
- Silverlight实例教程 - Navigation导航框架系列汇总
- 对ie6、ie7、ff兼容性的详细css hack介绍
- 西南交大量子计算机,交大量子光电实验室