机器学习训练营——机器学习爱好者的自由交流空间（入群联系qq：2279055353）

案例介绍

一项由谷歌发起的研究，使用机器学习技术识别在线谈话里的有害评论。这里的“有害评论”，是指任何粗鲁的(rude)、无礼的(disrespectful), 或者其它导致某人终止讨论的言谈。该案例将构建分类模型，识别有害评论，并且减少不需要的偏差。例如，一个特定的名字经常与有害评论联系，一些模型可能把出现在无害评论里的同名的评论错误地分在有害评论里。

数据描述

在案例数据集里，每一条评论文本在comment_text列。训练集的每一条评论有一个toxicity标签(target), 开发的模型将预测检验集里的target. 所有其它属性是给定评论的属性比例值。为了便于评价模型，在检验集里，target>0.5的样本被标记为阳性类(toxic).

加载包

import gc
import os
import warnings
import operator
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
from wordcloud import WordCloud, STOPWORDS
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
from gensim import corpora, models
import pyLDAvis
import pyLDAvis.gensim
from keras.preprocessing.text import TokenizerpyLDAvis.enable_notebook()
np.random.seed(2018)
warnings.filterwarnings('ignore')

加载数据

JIGSAW_PATH = "../input/jigsaw-unintended-bias-in-toxicity-classification/"
train = pd.read_csv(os.path.join(JIGSAW_PATH,'train.csv'), index_col='id')
test = pd.read_csv(os.path.join(JIGSAW_PATH,'test.csv'), index_col='id')

显示train, test的前5行。

train.head(), test.head()

数据探索

评论文本存储在comment_text列里。此外，在train里有标记特定的敏感主题是否存在于评论里。主题与5个类别有关：

race or ethnicity: asian, black, jewish, latino, other_race_or_ethnicity, white
gender: female, male, transgender, other_gender
sexual orientation: bisexual, heterosexual, homosexual_gay_or_lesbian, other_sexual_orientation
religion: atheist,buddhist, christian, hindu, muslim, other_religion
disability: intellectual_or_learning_disability, other_disability, physical_disability, psychiatric_or_mental_illness

我们也有几个评论识别信息：

created_date
publication_id
parent_id
article_id

几个评论相关的用户反馈信息：

rating
funny
wow
sad
likes
disagree
sexual_explicit

数据集里还有两个注释变量：

identity_annotator_count
toxicity_annotator_count

目标特征

让我们检查一下训练集里target值的分布。

plt.figure(figsize=(12,6))
plt.title("Distribution of target in the train set")
sns.distplot(train['target'],kde=True,hist=False, bins=120, label='target')
plt.legend(); plt.show()

让我们表示另外的有害特征分布的相似性。

def plot_features_distribution(features, title):plt.figure(figsize=(12,6))plt.title(title)for feature in features:sns.distplot(train.loc[~train[feature].isnull(),feature],kde=True,hist=False, bins=120, label=feature)plt.xlabel('')plt.legend()plt.show()

features = ['severe_toxicity', 'obscene','identity_attack','insult','threat']
plot_features_distribution(features, "Distribution of additional toxicity features in the train set")

敏感的话题

现在，让我们检查敏感话题特征的值分布。

features = ['asian', 'black', 'jewish', 'latino', 'other_race_or_ethnicity', 'white']
plot_features_distribution(features, "Distribution of race and ethnicity features values in the train set")

features = ['female', 'male', 'transgender', 'other_gender']
plot_features_distribution(features, "Distribution of gender features values in the train set")

features = ['atheist','buddhist',  'christian', 'hindu', 'muslim', 'other_religion']
plot_features_distribution(features, "Distribution of religion features values in the train set")

features = ['intellectual_or_learning_disability', 'other_disability', 'physical_disability', 'psychiatric_or_mental_illness']
plot_features_distribution(features, "Distribution of disability features values in the train set")

反馈信息

让我们看一看反馈信息值的分布。

def plot_count(feature, title,size=1):f, ax = plt.subplots(1,1, figsize=(4*size,4))total = float(len(train))g = sns.countplot(train[feature], order = train[feature].value_counts().index[:20], palette='Set3')g.set_title("Number and percentage of {}".format(title))for p in ax.patches:height = p.get_height()ax.text(p.get_x()+p.get_width()/2.,height + 3,'{:1.2f}%'.format(100*height/total),ha="center") plt.show()

plot_count('rating','rating')

plot_count('funny','funny votes given',3)

plot_count('wow','wow votes given',3)

plot_count('sad','sad votes given',3)

plot_count('likes','likes given',3)

plot_count('disagree','disagree given',3)

features = ['sexual_explicit']
plot_features_distribution(features, "Distribution of sexual explicit values in the train set")

评论词云

让我们看一看评论所使用的的词的频率排在前50位的词云。

stopwords = set(STOPWORDS)def show_wordcloud(data, title = None):wordcloud = WordCloud(background_color='white',stopwords=stopwords,max_words=50,max_font_size=40, scale=5,random_state=1).generate(str(data))fig = plt.figure(1, figsize=(10,10))plt.axis('off')if title: fig.suptitle(title, fontsize=20)fig.subplots_adjust(top=2.3)plt.imshow(wordcloud)plt.show()

我们看一看训练集里的流行词。

show_wordcloud(train['comment_text'].sample(20000), title = 'Prevalent words in comments - train data')

我们看一看insult score在0.25 以下，0.75 以上的词的使用频率。

show_wordcloud(train.loc[train['insult'] < 0.25]['comment_text'].sample(20000), title = 'Prevalent comments with insult score < 0.25')

show_wordcloud(train.loc[train['insult'] > 0.75]['comment_text'].sample(20000), title = 'Prevalent comments with insult score > 0.75')

类似的，也可以做threat score, obscene score, target (toxicity) score相关词云。

更多精彩内容请关注微信公众号“统计学习与大数据”

有害评论识别问题：数据可视化与频率词云相关推荐

python scale()函数_【Python菜鸟进阶大神】Matplotlib数据可视化007：词云
词云是对网络文本中出现频率较高的关键词予以视觉上的突出,形成关键词云层或关键词渲染,从而过滤掉大量的文本信息,使读者只要一眼扫过文本就可以领略文本的主旨. 词云用Python是怎么实现的.用wordc ...
Python 数据可视化：WordCloud 词云的构建
WordCloud 官方文档:https://amueller.github.io/word_cloud/index.html WordCloud GitHub 地址:https://github.c ...
python爬取歌曲评论并进行数据可视化
一.抓数据要想做成词云图表,首先得有数据才行.于是需要一点点的爬虫技巧. 基本思路为:抓包分析.加密信息处理.抓取热门评论信息 1.抓包分析我们首先用浏览器打开网易云音乐的网页版,进入薛之谦< ...
python 豆瓣评论数据分析_Python数据可视化分析--豆瓣电影Top250
Python数据分析–豆瓣电影Top250 利用Python爬取豆瓣电影TOP250并进行数据分析,对于众多爬虫爱好者,应该并不陌生.很多人都会以此作为第一个练手的小项目.当然这也多亏了豆瓣的包容,没 ...
ajax将数据显示在class为content的标签中_[原创]数据可视化实战项目
数据可视化实战项目 NLP 数据可视化 request BeautifulSoup #爬虫所需import requestsfrom bs4 import BeautifulSoup# Nlp可视化所 ...
python echarts数据可视化实战
python echarts数据可视化 python echarts数据可视化实战引言词云分析柱状图分析饼图分析总结 python echarts数据可视化实战引言引言上一章我给大家用 ...
python雷达图数据_Python怎么画雷达图？Matplotlib数据可视化008：雷达图\极坐标图...
系列文章链接:[Python菜鸟进阶大神]Matplotlib数据可视化001:基础API汇总&散点图mp.weixin.qq.com [Python菜鸟进阶大神]Matplotlib数据可 ...
大江大河2弹幕数据之词云分析、情感极性分析、主题分析、共现网络分析
最近,自己在疯狂追<大江大河2>这部剧,作为当下最热门的电视剧之一,这部电视剧深受观众的喜爱,自从播出以后就好评不断它主要讲述了改革开放三十年,一代人奋斗向阳的故事,看完之后深受启发,特 ...
大数据可视化模板_最佳大数据可视化技术
研究人员一致认为,视觉是我们的主要意识:我们感知,学习或处理的信息中有80-85%是通过视觉进行调节的. 当我们试图理解和解释数据时,或者当我们寻找数百或数千个变量之间的关系以确定它们的相对重要性时, ...
文本数据可视化中一些概念
文本数据可视化词云词云是一个自动化的文本可视化工具. 词云的特点:1.自动提取高频词:2.呈现高频词:3.字体大小体现单词出现的次数. 文本可视化的重要意义:在于帮助用户快速地完成大量文本阅读和理 ...

有害评论识别问题：数据可视化与频率词云