kaggle:Quora Insincere Questions Classification

问题描述：

今天任何一个主要网站的存在问题是如何处理有毒（toxic）和分裂（divisive）的内容。 Quora希望正面（head-on）解决（tackle）这个问题，让他们的平台成为用户可以安全地与世界分享知识的地方。

Quora是一个让人们相互学习的平台。在Quora上，人们可以提出问题，并与提供独特见解和质量回答（unique insights and quality answers）的其他人联系。一个关键的挑战是淘汰（weed out）虚假的问题 - 那些建立在虚假前提（false premises）下的问题，或者打算发表声明而不是寻求有用答案的问题。

在本次比赛中，Kagglers将开发识别和标记虚假问题（flag insincere questions）的模型。到目前为止（To date），Quora已经使用机器学习和人工审查（manual review）来解决这个问题（address this problem）。在您的帮助下，他们可以开发更具可扩展性的方法（develop more scalable methods）来检测有毒和误导性内容（detect toxic and misleading content）。

这是你大规模对抗在线巨魔（combat online trolls at scale）的机会。帮助Quora坚持（uphold）“善良，尊重”（Be Nice, Be Respectful）的政策，继续成为分享和发展世界知识的地方。

Important Note：（注意）
请注意，这是作为a Kernels Only Competition运行，要求所有submissions都通过Kernel output进行。请仔细阅读内核常见问题解答和数据页面，以充分了解其设计方法。

Data Description（数据描述）

在本次比赛中，您将预测Quora上提出的问题是否真诚（sincere）。

一个虚伪的（insincere）问题被定义为一个旨在发表声明而不是寻找有用答案的问题。一些可以表明问题虚伪（insincere）的特征：

具有非中性语气（Has a non-neutral tone）
- 夸张的语气（exaggerated tone）强调了一群人的观点
- 是修辞（rhetorical）的，意味着暗示（meant to imply）关于一群人的陈述
是贬低（disparaging）或煽动性的（inflammatory）
- 建议针对受保护阶层的人提出歧视性（discriminatory）观点，或寻求确认陈规定型观念（confirmation of a stereotype）
- 对特定的人或一群人进行贬低（disparaging）的攻击/侮辱（attacks/insults）
- 基于关于一群人的古怪前提（outlandish premise）
- 贬低（Disparages）不可修复（fixable）且无法衡量（measurable）的特征
不是基于现实（Isn’t grounded in reality）
- 基于虚假信息（false information），或包含荒谬的假设（absurd assumptions）
使用性内容（乱伦incest，兽交bestiality，恋童癖pedophilia）来获得震撼价值，而不是寻求真正的（genuine）答案

训练数据包括被询问的问题（question that was asked），以及是否被识别为真诚的（insincere）（target=1）。真实（ground-truth）标签包含一些噪音：它们不能保证是完美的。

请注意，数据集中问题的分布不应被视为代表Quora上提出的问题的分布。部分原因是由于采样程序和已应用于最终数据集的消毒（sanitization）措施的组合。

Data fields（数据域描述）

qid - 唯一的问题标识符
question_text - Quora问题文本
target - 标记为“insincere”的问题的值为1，否则为0

这是仅限内核的比赛（Kernels-only competition）。此数据部分中的文件可供下载，以便在阶段1中参考。阶段2文件仅在内核中可用且无法下载。

比赛的第二阶段会有什么？

在比赛的第二阶段，我们将重新运行您选择的内核。以下文件将与新数据交换：

test.csv - 这将与完整的公共和私有测试数据集交换。该文件在阶段1中具有_{56k行，在阶段2中具有}376k行。两个版本的公共页首数据保持相同。文件名将相同（均为test.csv）以确保您的代码将运行。
sample_submission.csv - 类似于test.csv，这将从第1阶段的_{56k变为第2阶段的}376k行。文件名将保持不变。

Embeddings

本次比赛不允许使用外部数据源。但是，我们提供了许多字嵌入以及可以在模型中使用的数据集。这些如下：

GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
from sklearn.model_selection import train_test_split
from sklearn import metricsfrom keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")
print("Train shape : ",train_df.shape)
print("Test shape : ",test_df.shape)

接下来的步骤如下：

将训练数据集拆分为train和val样本。交叉验证是一个耗时的过程，因此让我们进行简单的train val split。
使用’na’填写文本列中的缺失值
对文本列进行标记（Tokenize the text column）并将其转换为矢量序列
根据需要填充序列 - 如果文本中的单词数大于’max_len’，则将它们trunacate为’max_len’，或者如果文本中的单词数小于’max_len’，则为剩余值添加零。

## split to train and val（划分数据集）
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=2018)## some config values（一些配置信息）
embed_size = 300 # how big is each word vector（词向量大小）
max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)（要使用多少个独特的单词）
maxlen = 100 # max number of words in a question to use（要使用的问题中的最大单词数）## fill up the missing values（填充缺失值）
train_X = train_df["question_text"].fillna("_na_").values
val_X = val_df["question_text"].fillna("_na_").values
test_X = test_df["question_text"].fillna("_na_").values## Tokenize the sentences（对句子进行标记）
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)## Pad the sentences （填写句子）
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)## Get the target values（获取目标值）
train_y = train_df['target'].values
val_y = val_df['target'].values

没有预训练的Embeddings：（Without Pretrained Embeddings）**

现在我们完成了所有必要的预处理步骤，我们可以首先训练双向GRU模型。我们不会对此模型使用任何预先训练过的字嵌入，Embeddings将从头开始学习。请查看模型摘要，了解所用图层的详细信息。

inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size)(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])print(model.summary())

Train the model using train sample and monitor the metric on the valid sample。这只是一个运行2个epochs的样本模型。改变epochs，batch_size和模型参数可能会为我们提供更好的模型。

## Train the model （训练模型）
model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

现在让我们获得验证样本预测，并获得F1分数的最佳阈值。

pred_noemb_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_noemb_val_y>thresh).astype(int))))

现在让我们获取测试集预测并保存它们

pred_noemb_test_y = model.predict([test_X], batch_size=1024, verbose=1)

现在我们的模型构建已经完成，在我们进入下一步之前清理一些内存可能是个好主意。

del model, inp, x
import gc; gc.collect()
time.sleep(10)

因此我们得到了一些基线（baseline）GRU模型，没有经过预先训练的嵌入。现在让我们使用提供的嵌入并再次重建模型以查看性能。

!ls ../input/embeddings/

我们有四种不同类型的嵌入（embeddings）。

GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
glove.840B.300d - https://nlp.stanford.edu/projects/glove/
paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html

在这个内核（in this kernel）中给出了对不同类型嵌入的非常好的解释。有关详细信息，请参阅相同内容…

Glove Embeddings:
在本节中，让我们使用Glove嵌入并重建GRU模型。

EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorinp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

pred_glove_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_glove_val_y>thresh).astype(int))))

结果似乎比没有预训练嵌入的模型更好。

pred_glove_test_y = model.predict([test_X], batch_size=1024, verbose=1)

del word_index, embeddings_index, all_embs, embedding_matrix, model, inp, x
import gc; gc.collect()
time.sleep(10)

Wiki News FastText Embeddings:

现在让我们使用在Wiki News语料库上训练的FastText Embeddings来代替Glove嵌入并重建模型。

EMBEDDING_FILE = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE) if len(o)>100)all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorinp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

pred_fasttext_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_fasttext_val_y>thresh).astype(int))))

pred_fasttext_test_y = model.predict([test_X], batch_size=1024, verbose=1)

del word_index, embeddings_index, all_embs, embedding_matrix, model, inp, x
import gc; gc.collect()
time.sleep(10)

Paragram Embeddings:

在本节中，我们可以使用段落嵌入并构建模型并进行预测。

EMBEDDING_FILE = '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE, encoding="utf8", errors='ignore') if len(o)>100)all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():if i >= max_features: continueembedding_vector = embeddings_index.get(word)if embedding_vector is not None: embedding_matrix[i] = embedding_vectorinp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

pred_paragram_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_paragram_val_y>thresh).astype(int))))

pred_paragram_test_y = model.predict([test_X], batch_size=1024, verbose=1)

del word_index, embeddings_index, all_embs, embedding_matrix, model, inp, x
import gc; gc.collect()
time.sleep(10)

Observations:（观察结论）

与非预训练模型相比，整体预训练嵌入似乎可以提供更好的结果。
不同预训练嵌入的性能几乎相似。

Final Blend:（最后融合）

虽然具有不同预训练嵌入的模型（pre-trained embeddings）的结果是相似的，但是它们很可能从数据中捕获不同类型的信息（capture different type of information）。因此，让我们通过平均他们的预测来混合这三个模型。

pred_val_y = 0.33*pred_glove_val_y + 0.33*pred_fasttext_val_y + 0.34*pred_paragram_val_y
for thresh in np.arange(0.1, 0.501, 0.01):thresh = np.round(thresh, 2)print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_val_y>thresh).astype(int))))

结果似乎比单个预训练模型更好，因此我们让我们使用此模型混合创建提交文件。

pred_test_y = 0.33*pred_glove_test_y + 0.33*pred_fasttext_test_y + 0.34*pred_paragram_test_y
pred_test_y = (pred_test_y>0.35).astype(int)
out_df = pd.DataFrame({"qid":test_df["qid"].values})
out_df['prediction'] = pred_test_y
out_df.to_csv("submission.csv", index=False)