目录

  • 前言
  • 一、相关知识介绍
    • 1-1、比赛描述
  • 二、模型介绍
  • 2-1、单模1——(TF-idf+贝叶斯)
  • 2-2、单模2——(TF-idf+岭回归)
  • 总结

前言

追求最前沿的技术是每个NLPer的基本操守!


一、相关知识介绍

Jigsaw Rate Severity of Toxic Comments原文.

1-1、比赛描述

# require: 测试集是一些评论,比赛要求按照评论毒性的严重程度对评论进行排名,毒性强的评论应该获得比较高的数值
# train: 训练数据的话,以单模一为例:我们使用之前的jigsaw-toxic-comment-classification-challenge的训练集,
# train: 将训练集处理为有毒评论和无毒评论,将评论处理之后,使用分类器进行有标签的数据训练
# validation:对验证集进行向量化处理、文本清洗,再使用模型来进行概率预测。

二、模型介绍

TF-idf+贝叶斯.

2-1、单模1——(TF-idf+贝叶斯)

import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer# 读取挑战赛的训练集
df = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")
# 新增一列,为数据各列相加的和。
df['y'] = (df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1) > 0 ).astype(int)
# 重命名列名
df = df[['comment_text', 'y']].rename(columns={'comment_text': 'text'})# 数据不均衡,进行下采样
min_len = (df['y'] == 1).sum()
df_y0_undersample = df[df['y'] == 0].sample(n=min_len, random_state=201)
df = pd.concat([df[df['y'] == 1], df_y0_undersample])from nltk.stem import SnowballStemmer, WordNetLemmatizer
import re
from nltk.corpus import stopwords# SnowballStemmer:词干提取
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
# 调取停用词库
all_stopwords = stopwords.words('english')def clean(comment):"""对数据进行处理1、把所有除了英文字母之外的字符都转换为空格2、所有字母都转化为小写3、分割开句子,方便处理每个单词4、遍历句子的每一个单词,去除掉停用词,不是停用词的提取词干5、将处理后的每个单词连在一起。""" comment = re.sub('[^a-zA-Z]', ' ', comment)comment = comment.lower()comment = comment.split()comment = [stemmer.stem(word) for word in comment if not word in set(all_stopwords)]comment = [lemmatizer.lemmatize(word) for word in comment]comment = ' '.join(comment)return comment
# 对每条评论都进行数据清洗
df['text'] = df['text'].apply(clean)# 使用tf-idf来对处理后的文本进行处理
vec = TfidfVectorizer()
X = vec.fit_transform(df['text'])# 使用多项式贝叶斯分类器来训练数据
model = MultinomialNB()
model.fit(X, df['y'])# 读取比赛的验证集
df_val = pd.read_csv("../input/jigsaw-toxic-severity-rating/validation_data.csv")
# 使用训练好的模型来分别训练验证集上的弱毒评论和强度评论
X_less_toxic = vec.transform(df_val['less_toxic'].apply(clean))
X_more_toxic = vec.transform(df_val['more_toxic'].apply(clean))# 使用验证集来评判模型的好坏
# predict_proba返回的是一个n行k列的数组,第i行第j列上的数值是模型预测第i个预测样本的标签为j的概率, 如果是二分类,就是两列,第一个概率值是预测为0的概率,第二个概率值是预测为1的概率。
p1 = model.predict_proba(X_less_toxic)
p2 = model.predict_proba(X_more_toxic)(p1[:, 1] < p2[:, 1]).mean()
# 0.6675634382888269
# 即模型的准确率为0.67# 读取要提交的评论,训练评论。
df_sub = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")
X_test = vec.transform(df_sub['text'])
p3 = model.predict_proba(X_test)# 读取评论为毒性的概率
df_sub['score'] = p3[:, 1]# 评论唯一的个数是多少。
# df_sub['score'].nunique()# 提交
df_sub[['comment_id', 'score']].to_csv("submission.csv", index=False)

TF-idf+岭回归.

2-2、单模2——(TF-idf+岭回归)

%%time
import pandas as pd
import numpy as npfrom sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from scipy.stats import rankdatadef ridge_cv (vec, X, y, X_test, folds, stratified ):kf = StratifiedKFold(n_splits=FOLDS,shuffle=True,random_state=12)val_scores = []rmse_scores = []X_less_toxics = []X_more_toxics = []preds = []for fold, (train_index,val_index) in enumerate(kf.split(X,stratified)):X_train, y_train = X[train_index], y[train_index]X_val, y_val = X[val_index], y[val_index]model = Ridge()model.fit(X_train, y_train)rmse_score = mean_squared_error ( model.predict (X_val), y_val, squared = False) rmse_scores.append (rmse_score)X_less_toxic = vec.transform(df_val['less_toxic'])X_more_toxic = vec.transform(df_val['more_toxic'])p1 = model.predict(X_less_toxic)p2 = model.predict(X_more_toxic)X_less_toxics.append ( p1 )X_more_toxics.append ( p2 )# Validation Accuracyval_acc = (p1< p2).mean()val_scores.append(val_acc)pred = model.predict (X_test)preds.append (pred)print(f"FOLD:{fold}, rmse_fold:{rmse_score:.5f}, val_acc:{val_acc:.5f}")mean_val_acc = np.mean (val_scores)mean_rmse_score = np.mean (rmse_scores)p1 = np.mean ( np.vstack(X_less_toxics), axis=0 )p2 = np.mean ( np.vstack(X_more_toxics), axis=0 )val_acc = (p1< p2).mean()print(f"OOF: val_acc:{val_acc:.5f}, mean val_acc:{mean_val_acc:.5f}, mean rmse_score:{mean_rmse_score:.5f}")preds = np.mean ( np.vstack(preds), axis=0 )return p1, p2, predstoxic = 1.0
severe_toxic = 2.0
obscene = 1.0
threat = 1.0
insult = 1.0
identity_hate = 2.0def create_train (df):df['y'] = df[["toxic","severe_toxic","obscene","threat","insult","identity_hate"]].max(axis=1)df['y'] = df["y"]+df['severe_toxic']*severe_toxicdf['y'] = df["y"]+df['obscene']*obscenedf['y'] = df["y"]+df['threat']*threatdf['y'] = df["y"]+df['insult']*insultdf['y'] = df["y"]+df['identity_hate']*identity_hatedf = df[['comment_text', 'y', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].rename(columns={'comment_text': 'text'})#undersample non toxic comments  on Toxic Comment Classification Challengemin_len = (df['y'] >= 1).sum()df_y0_undersample = df[df['y'] == 0].sample(n=int(min_len*1.5),random_state=201)df = pd.concat([df[df['y'] >= 1], df_y0_undersample])return dfdf_val = pd.read_csv("../input/jigsaw-toxic-severity-rating/validation_data.csv")
df_test = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")jc_train_df = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")
print(f"jc_train_df:{jc_train_df.shape}")jc_train_df = create_train(jc_train_df)df = jc_train_df
print(df['y'].value_counts())FOLDS = 7vec = TfidfVectorizer(analyzer='char_wb', max_df=0.5, min_df=3, ngram_range=(4, 6) )
X = vec.fit_transform(df['text'])
y = df["y"].values
X_test = vec.transform(df_test['text'])stratified = np.around ( y )
jc_p1, jc_p2, jc_preds =  ridge_cv (vec, X, y, X_test, FOLDS, stratified )juc_train_df = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
print(f"juc_train_df:{juc_train_df.shape}")
juc_train_df = juc_train_df.query ("toxicity_annotator_count > 5")
print(f"juc_train_df:{juc_train_df.shape}")juc_train_df['y'] = juc_train_df[[ 'severe_toxicity', 'obscene', 'sexual_explicit','identity_attack', 'insult', 'threat']].sum(axis=1)juc_train_df['y'] = juc_train_df.apply(lambda row: row["target"] if row["target"] <= 0.5 else row["y"] , axis=1)
juc_train_df = juc_train_df[['comment_text', 'y']].rename(columns={'comment_text': 'text'})
min_len = (juc_train_df['y'] > 0.5).sum()
df_y0_undersample = juc_train_df[juc_train_df['y'] <= 0.5].sample(n=int(min_len*1.5),random_state=201)
juc_train_df = pd.concat([juc_train_df[juc_train_df['y'] > 0.5], df_y0_undersample])df = juc_train_df
print(df['y'].value_counts())FOLDS = 7vec = TfidfVectorizer(analyzer='char_wb', max_df=0.5, min_df=3, ngram_range=(4, 6) )
X = vec.fit_transform(df['text'])
y = df["y"].values
X_test = vec.transform(df_test['text'])stratified = (np.around ( y, decimals = 1  )*10).astype(int)
juc_p1, juc_p2, juc_preds =  ridge_cv (vec, X, y, X_test, FOLDS, stratified )rud_df = pd.read_csv("../input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv")
print(f"rud_df:{rud_df.shape}")
rud_df['y'] = rud_df['offensiveness_score'].map(lambda x: 0.0 if x <=0 else x)
rud_df = rud_df[['txt', 'y']].rename(columns={'txt': 'text'})
min_len = (rud_df['y'] < 0.5).sum()
print(rud_df['y'].value_counts())FOLDS = 7
df = rud_df
vec = TfidfVectorizer(analyzer='char_wb', max_df=0.5, min_df=3, ngram_range=(4, 6) )
X = vec.fit_transform(df['text'])
y = df["y"].values
X_test = vec.transform(df_test['text'])stratified = (np.around ( y, decimals = 1  )*10).astype(int)
rud_p1, rud_p2, rud_preds =  ridge_cv (vec, X, y, X_test, FOLDS, stratified )jc_max = max(jc_p1.max() , jc_p2.max())
juc_max = max(juc_p1.max() , juc_p2.max())
rud_max = max(rud_p1.max() , rud_p2.max())p1 = jc_p1/jc_max + juc_p1/juc_max + rud_p1/rud_max
p2 = jc_p2/jc_max + juc_p2/juc_max + rud_p2/rud_maxval_acc = (p1< p2).mean()
print(f"Ensemble: val_acc:{val_acc:.5f}")preds2 = jc_preds/jc_max + juc_preds/juc_max + rud_preds/rud_max

参考文章:
一文了解倒排表.


总结

kaggle比赛——Jigsaw Rate Severity of Toxic Comments(NLP类型)——分析获奖模型笔记相关推荐

  1. 在参加了39场Kaggle比赛之后,有人总结了一份图像分割炼丹的「奇技淫巧」

    点击上方"视学算法",选择加"星标"或"置顶" 重磅干货,第一时间送达 本文转载自:机器之心 一个经历了 39 场 Kaggle 比赛的团队 ...

  2. Kaggle比赛分类与winner资料汇总(更新中)

    Kaggle比赛分类与winner资料汇总(更新中) 1.介绍 把比赛分为四类,Data Mining.Images.NLP.Speech Recognition,举几个例子: Data Mining ...

  3. kaggle 比赛分类_黑色素瘤分类在kaggle比赛中获得奖牌

    kaggle 比赛分类 Using deep learning to identify melanomas from skin images and patient meta-data 使用深度学习从 ...

  4. 竞赛老陪跑怎么办?来自一位Kaggle比赛失败者的含泪总结

    大数据文摘出品 来源:medium 编译:zeroInfinity.笪洁琼 Kaggle比赛应该是数据竞赛中公认含金量最高的那个.每场比赛,参加的队伍至少上千人,也并非每次都次都能脱引而出,一不小心就 ...

  5. 通俗理解kaggle比赛大杀器xgboost + XGBOOST手算内容 转

    通俗理解kaggle比赛大杀器xgboost    转 https://blog.csdn.net/v_JULY_v/article/details/81410574 XGBOOST有手算内容 htt ...

  6. Kaggle比赛冠军经验分享:如何用 RNN 预测维基百科网络流量

    Kaggle比赛冠军经验分享:如何用 RNN 预测维基百科网络流量 from:https://www.leiphone.com/news/201712/zbX22Ye5wD6CiwCJ.html 导语 ...

  7. Kaggle常用函数总结 原创 2017年07月03日 21:47:34 标签: kaggle 493 kaggle比赛也参加了好几次,在这里就把自己在做比赛中用到的函数汇总到这,方便自己以后查阅

    Kaggle常用函数总结 原创 2017年07月03日 21:47:34 标签: kaggle / 493 编辑 删除 kaggle比赛也参加了好几次,在这里就把自己在做比赛中用到的函数汇总到这,方便 ...

  8. kaggle比赛模型融合指南

    kaggle比赛模型融合指南 转载 2017年10月13日 16:29:32

  9. EL之Bagging:kaggle比赛之利用泰坦尼克号数据集建立Bagging模型对每个人进行获救是否预测

    EL之Bagging:kaggle比赛之利用泰坦尼克号数据集建立Bagging模型对每个人进行获救是否预测 目录 输出结果 设计思路 核心代码 输出结果 设计思路 核心代码 bagging_clf = ...

最新文章

  1. selector多路复用_超详细的I/O多路复用概念、常用I/O模型、系统调用等介绍
  2. leetcode 贪心_LeetCode进阶1029-贪心
  3. VLAN TRUNK 链路聚合 网络层路由器
  4. mongodb----集合而定多种查询方式
  5. 【博客】csdn搬家到wordpress
  6. Hbase PageFilter 取出数量不准确问题
  7. 跟面试官侃半小时MySQL事务隔离性,从基本概念深入到实现
  8. (转)Elasticsearch NoNodeAvailableException None of the configured nodes are available
  9. Linux 简单文本处理命令
  10. 实例解读:如何减少Docker中的Java内存消耗
  11. Android5.1/7.1 Selinux JNI访问新增/dev/xxx设备节点
  12. 6.18-GTest
  13. Beego 框架学习(一)
  14. lua upvalue
  15. 西门子PLC模块大类
  16. 基于python的三维射线追踪库-ttcrpy详解(3)
  17. 电脑检测工具eve_检测电脑硬件的软件(系统硬件检测工具)
  18. GateWay简单的使用、集群搭建和数据库动态配置
  19. vue报错:Not Found - GET https://registry.npmjs.org/- Not found
  20. 热烈欢迎茂名高级技工学校毕业生参加我司技术工程师岗前实训

热门文章

  1. 超级码力在线编程大赛初赛 第2场 T1-T4题解
  2. 基于动态时间规整(DTW)的孤立字语音识别
  3. 完全免费的OCR文字识别软件
  4. 2020年了,跨境电商收款有哪几种方式?
  5. 人事管理系统(Mysql+Java)
  6. 请问下谁知道,column-tree.css中zoom是什么意思,在下面这代码里面起什么作用?...
  7. c++写俄罗斯方块小游戏
  8. 进入社会一周年的些许感悟和经历
  9. 线性代数复习总结——基本概念
  10. Python编程:从入门到实践------第6章:字典