ML CLassifier模块


1. Random Forest + KFold

import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import stringstopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']def count_punct(text):count = sum([1 for char in text if char in string.punctuation])return round(count/(len(text) - text.count(" ")), 3)*100data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return texttfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_scorerf = RandomForestClassifier(n_jobs=-1)  # parallel building
kfold = KFold(n_splits=10)

2. Holdout Test Set Evaluation

from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifierxTrain, xTest, yTrain, yTest = train_test_split(X_features, data.label, test_size=0.2)rf = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)
rf_model =, yTrain)# find out the most important features with respect to the model
sorted(zip(rf_model.feature_importances_, xTrain.columns), reverse=True)[:5]y_pred = rf_model.predict(xTest)
precision, recall, fscore, support = score(yTest, y_pred, pos_label = "spam", average = "binary")print('precision: {} / recall: {} / accuracy: {}'.format(precision, recall, (y_pred==yTest).sum()/len(y_pred)))

3. Grid Search + Model Evaluation


 def train_RF(n_est, depth):rf = RandomForestClassifier(n_estimators=n_est, max_depth=depth, n_jobs=-1)rf_model =, y_train)y_pred = rf_model.predict(X_test)prec, recall, fscore, sup = score(y_test, y_pred, pos_label="spam", average="binary")print("Est:{}/Dpeth:{}\nprecision:{}/recall:{}/accur:{}".format(n_est, depth, prec, recall, (y_pred==y_test).sum()/len(y_pred)))
for n_est in [10,20,50]:for depth in range(10,40,10):train_RF(n_est, depth)


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCVrf = RandomForestClassifier()
param = {'n_estimators':[10,150,300],'max_depth':[30,60,90,None]}
gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)
gs_fit =, data["label"])
pd.DataFrame(gs_fit.cv_results_).sort_values("mean_test_score", ascending=False)[:5]

本机跑GridSearchCV的时候报了Memory Error错误。解决方法有别的博客讲解,建议增大虚拟内存,具体操作在此不赘述。

4. Gradient Boost

定义:Ensemble learning method that takes an iterative approach to combining weak learners to create a strong learner by focusing on mistakes of prior iterations. Decision tree based.



  1. Bagging, so training can be done in parallel.
  2. Unweighted voting for final prediction.
  3. Easier to tune, harder to overfit.

Gradient Boosting:

  1. Boosting, so training must be done iteratively.
  2. Weighted voting for final prediction.
  3. Harder to tune, easier to overfit.

Tradeoffs of GB:


  1. powerful
  2. accepts various types of inputs
  3. can be used for classification or regression
  4. outputs feature importance


  1. longer to train
  2. more likely to overfit
  3. more difficult to properly tune
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCVgb = GradientBoostingClassifier()
param = {"n_estimators":[100, 150],"max_depth":[7, 11, 15],"learning_rate":[0.1]
gs = GridSearchCV(gb, param, cv = 5, n_jobs=-1)
cv_fit =, data.label)
pd.DataFrame(cv_fit.cv_results_).sort_values("mean_test_score", ascending=False)[:5]

5. Pipeline总结

  • read in raw text
  • clean text and tokenize
  • feature engineering
  • fit simple model
  • tune hyperparameters and evalueate model
  • final model selection

Vectorizers should be fit on the training set and only be used to transform the test set.


split data into trainig and test set -> train vectorizers on training set and use that to transform test set -> fit best RF and GB model on training set and predict on test set -> evaluate results of two models to select best model


import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import stringstopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']def count_punct(text):count = sum([1 for char in text if char in string.punctuation])return round(count/(len(text) - text.count(" ")), 3)*100data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))def clean_text(text):text = "".join([word.lower() for word in text if word not in string.punctuation])tokens = re.split('\W+', text)text = [ps.stem(word) for word in tokens if word not in stopwords]return text
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit =['body_text'])tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), pd.DataFrame(tfidf_test.toarray())], axis=1)X_train_vect.head()
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time#RF model
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)start = time.time()
rf_model =, y_train)
end = time.time()
fit_time = (end - start)start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))#GB model
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)start = time.time()
gb_model =, y_train)
end = time.time()
fit_time = (end - start)start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))


  1. 新手探索NLP(六)——全文检索

    全文检索技术--Lucene的介绍 转载自 查看全文 http://www.ta ...

  2. 新手探索NLP(四)

    学习NLP需要一个比较系统的概要.所以理论上这个应该写在第一篇. [NLP的技术概貌] NLP里细分领域和技术实在太多,根据NLP的终极目标,大致可以分为自然语言理解(NLU)和自然语言生成(NLG) ...

  3. 新手探索NLP(三)

    目录 NLP语言模型 词的表示方法类型 1.词的独热表示one-hot representation 简介 不足 2. 词的分布式表示distributed representation 简介 建模类 ...

  4. 新手探索NLP(十二)——文本聚类

    简介 聚类又称群分析,是数据挖掘的一种重要的思想,聚类(Cluster)分析是由若干模式(Pattern)组成的,通常,模式是一个度量(Measurement)的向量,或者是多维空间中的一个点.聚类分 ...

  5. 新手探索NLP(十五)——终章

    目录 中文语音的机器处理 汉语语言学的研究 未登录词识别 概率图模型 信息熵 互信息 联合熵 条件熵 统计语言模型 隐马尔科夫模型 Viterbi算法 最大熵模型 最大熵原理 GIS实现 条件随机场模 ...

  6. 新手探索NLP(九)——文本摘要

    转载自知乎 文本摘要是一种从一个或多个信息源中抽取关键信息的方法,它帮助用户节省了大量时间,用户可以从摘要获取到文本的所有关键 ...

  7. 新手探索NLP(八)——序列标注

    转载自知乎 NLP中的序列标注问题(隐马尔可夫HMM与条件随机场CRF) Introduction 序列标注问题(sequen ...

  8. 新手探索NLP(七)——情感分析

    简介 文本情感分析(sentiment analysis),又称为意见挖掘,是对带有情感色彩的主观性文本进行分析.处理.归纳和推理的过程.其中,主观情感可以是他们的判断或者评价,他们的情绪状态,或者有 ...

  9. 新手探索NLP(五)

    命名实体识别 简介 命名实体识别(NER)(也称为实体识别.实体分块和实体提取)是信息提取的一个子任务,旨在将文本中的命名实体定位并分类为预先定义的类别,如人员.组织.位置.时间表达式.数量.货币值. ...


  1. 批量新建文件夹并命名_dos命令实现批量新建文件夹
  2. 数学知识--Unconstrained Optimization(第二章)
  3. 利用jsoncpp将json字符串转换为Vector
  4. 图像的打开、修改、显示和保存示例(OpenCV 2.0)
  5. 电子设计大赛作品_第十四届电子设计大赛圆满结束!
  6. React开发(126):ant design学习指南之form中的自定义校验labelCol
  7. 蓝宝石显卡bios_这操作竟能让显卡性能暴涨?原来不是黑科技,小白都会
  8. windows时间服务器状态,搭建window时间服务器:
  9. 蓝桥杯2016年第七届C++省赛B组第五题-抽签
  10. WPF DataGrid使用 后台界面修改前台不刷新问题
  11. 压缩解压缩工具之WinRAR
  12. 英雄联盟php文件,英雄联盟-QQ网吧游戏特权-QQ网吧
  13. SharePoint文件审批功能设置
  14. java中如何在键盘中输入一串以逗号隔开数字然后存入数组中,并输出。
  15. 京东联盟自动转链php,求京东联盟php自动转链源码 请 ZenHaBit 继续帮忙
  16. centos安装matlab2018的步骤(基本是借鉴的但是会有些自己的解释及补充)
  17. java 编程思想 多线程学习笔记
  18. sql基本的日期函数
  19. ECSHOP去掉版权
  20. 革文:B2B企业如何用品牌思维玩转社群营销


  1. JavaScript碎片—函数闭包(模拟面向对象)
  2. 迭代器、生成器、面向过程编程思想
  3. Spring xml 注入静态变量
  4. Java程序员的日常—— 《编程思想》关于类的使用常识
  5. 编写个shell脚本将/home/test 目录下大于10K的文件转移到/tmp目录下
  6. Netbeans ClassFormatException: Invalid byte tag in
  7. 转:PHP Liunx 服务安全防范方案
  8. Silverlight实例教程 - Navigation导航框架系列汇总
  9. 对ie6、ie7、ff兼容性的详细css hack介绍
  10. 西南交大量子计算机,交大量子光电实验室