赛题背景

Quora是一个获取和共享有关任何知识的地方。这里是一个提出问题并与提供独特见解和高质量答案的人联系的平台。这使人们能够相互学习并更好的了解世界。
每个月有超过 1 亿人访问 Quora，因此很多人问类似的问题也就不足为奇了。具有相同意图的多个问题可能会导致搜索者花费更多时间寻找问题的最佳答案，并使作者觉得他们需要回答同一问题的多个版本。Quora 重视规范问题，因为它们为活跃的搜索者和作家提供了更好的体验，并从长远来看为这两个群体提供了更多价值。
在本次比赛中，Kagglers 面临着通过应用先进技术对问题对是否重复进行分类来解决这一自然语言处理问题的挑战。这样做可以更轻松地找到问题的高质量答案，从而改善 Quora 作者、搜索者和读者的体验。

解题过程

1. 数据分析

代码中有详细注释解析

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tlsINPUT_PATH = '/home/lyz/work/kaggle/kaggle-quora-question-pairs'df = pd.read_csv(INPUT_PATH + "/train.csv").fillna("")
df.head()

观察有这么几个字段
id不重要
qid是问题的编码
question1和question2是具体的问题
is_duplicate问题是否类似

df.info()

除了question1和question2是字符串类型的问题
其他字段均为int型

df.shape   # (404290, 6)
df.groupby("is_duplicate")['id'].count().plot.bar()

# 问题1和问题2的字符串长度
df['q1len'] = df['question1'].str.len()
df['q2len'] = df['question2'].str.len()# 问题1和问题2的单词数量
df['q1_n_words'] = df['question1'].apply(lambda row: len(row.split(" ")))
df['q2_n_words'] = df['question2'].apply(lambda row: len(row.split(" ")))# 根据空格分隔单词，并且将单词全部转化为小写字母并去重
# 返回值是Q1和Q2的不重复单词，占Q1 Q2总长度比值
def normalized_word_share(row):w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    return 1.0 * len(w1 & w2)/(len(w1) + len(w2))df['word_share'] = df.apply(normalized_word_share, axis=1)df.head()

画图查看新增特征跟目标值的相关性

plt.figure(figsize=(12, 8))
plt.subplot(1,2,1)
sns.violinplot(x = 'is_duplicate', y = 'word_share', data = df[0:50000])plt.subplot(1,2,2)
sns.histplot(df[df['is_duplicate'] == 1.0]['word_share'][0:10000], color = 'green',kde=True)
sns.histplot(df[df['is_duplicate'] == 0.0]['word_share'][0:10000], color = 'red', kde=True)

由图1可以看出，
不相似的问题word_share属性值是0–0.2
相似问题word_share属性值趋近于0.2–0.4
是一个很有效的特征值

2. 构建模型

2.1 特征工程 + 树模型

import numpy as np
import pandas as pd
import xgboost as xgbfrom sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from nltk.corpus import stopwordsINPUT_PATH = '/home/lyz/work/kaggle/kaggle-quora-question-pairs/'df_train = pd.read_csv(INPUT_PATH + 'train.csv', nrows=5000)
df_test  = pd.read_csv(INPUT_PATH + 'test.csv', nrows=5000)

TFIDF

这里有个小问题，如果设置成： 1 / (count + eps)，单词出现的频率越多，权值越低？

# 计算每个单词的权重
# 如果单词数量小于2，那么将其权重设置为0
# 如果单词数量大于2，那么将其权重设置为1 / (count + eps)
def get_weight(count, eps=10000, min_count=2):return 0 if count < min_count else 1 / (count + eps)# 将Q1和Q2的两个问题拼接
train_qs = pd.Series(df_train['question1'].tolist() + df_train['question2'].tolist()
).astype(str)# 全部转化成小写
words = (" ".join(train_qs)).lower().split()
# 计数
counts = Counter(words)
#
weights = {word: get_weight(count) for word, count in counts.items()}

载入停用词

stops = set(stopwords.words("english"))

def word_shares(row):# 第1种情况：句子1只包含停用词q1_list = str(row['question1']).lower().split()q1 = set(q1_list)q1words = q1.difference(stops)if len(q1words) == 0:return '0:0:0:0:0:0:0:0'# 第2种情况：句子2只包含停用词q2_list = str(row['question2']).lower().split()q2 = set(q2_list)q2words = q2.difference(stops)if len(q2words) == 0:return '0:0:0:0:0:0:0:0'# 相同单词在最长问题的占比words_hamming = sum(1 for i in zip(q1_list, q2_list) if i[0]==i[1])/max( len(q1_list), len(q2_list) )# Q1和Q2的停用词q1stops = q1.intersection(stops)q2stops = q2.intersection(stops)# 问题中的单词去重q1_2gram = set([i for i in zip(q1_list, q1_list[1:])])q2_2gram = set([i for i in zip(q2_list, q2_list[1:])])# 去重后的单词序列，看两个问题之间相同单词序列的个数shared_2gram = q1_2gram.intersection(q2_2gram)# Q1和Q2的相同的单词shared_words = q1words.intersection(q2words)# 未处理前 相同单词的权重列表shared_weights = [weights.get(w, 0) for w in shared_words]# Q1和Q2单词的权重列表q1_weights = [weights.get(w, 0) for w in q1words]q2_weights = [weights.get(w, 0) for w in q2words]# 两个问题的总权重列表total_weights = q1_weights + q2_weights# 相同单词的权重  /  总权重R1 = np.sum(shared_weights) / np.sum(total_weights)             # 相同单词的长度 / （总长度-相同单词长度）R2 = len(shared_words) / (len(q1words) + len(q2words) - len(shared_words))# Q1停用词的占比R31 = len(q1stops) / len(q1words) # Q2停用词的占比R32 = len(q2stops) / len(q2words) # 相似度公式Rcosine_denominator = (np.sqrt(np.dot(q1_weights,q1_weights))*np.sqrt(np.dot(q2_weights,q2_weights)))Rcosine = np.dot(shared_weights, shared_weights)/Rcosine_denominator# 处理后的 相同单词的权重if len(q1_2gram) + len(q2_2gram) == 0:R2gram = 0else:R2gram = len(shared_2gram) / (len(q1_2gram) + len(q2_2gram))# 返回新的特征序列return '{}:{}:{}:{}:{}:{}:{}:{}'.format(R1, R2, len(shared_words), R31, R32, R2gram, Rcosine, words_hamming)

# 将训练集和测试集拼接 并计算应用上一步构建新特征的函数
df = pd.concat([df_train, df_test])
df['word_shares'] = df.apply(word_shares, axis=1)train_test = pd.DataFrame()# 分割特征字符串为字段
train_test['word_match']       = df['word_shares'].apply(lambda x: float(x.split(':')[0]))
train_test['word_match_2root'] = np.sqrt(train_test['word_match'])
train_test['tfidf_word_match'] = df['word_shares'].apply(lambda x: float(x.split(':')[1]))
train_test['shared_count']     = df['word_shares'].apply(lambda x: float(x.split(':')[2]))train_test['stops1_ratio']     = df['word_shares'].apply(lambda x: float(x.split(':')[3]))
train_test['stops2_ratio']     = df['word_shares'].apply(lambda x: float(x.split(':')[4]))
train_test['shared_2gram']     = df['word_shares'].apply(lambda x: float(x.split(':')[5]))
train_test['cosine']           = df['word_shares'].apply(lambda x: float(x.split(':')[6]))
train_test['words_hamming']    = df['word_shares'].apply(lambda x: float(x.split(':')[7]))
# Q1停用词占比 - Q2停用词占比
train_test['diff_stops_r']     = train_test['stops1_ratio'] - train_test['stops2_ratio']train_test['len_q1'] = df['question1'].apply(lambda x: len(str(x)))
train_test['len_q2'] = df['question2'].apply(lambda x: len(str(x)))
# Q1长度 - Q2长度
train_test['diff_len'] = train_test['len_q1'] - train_test['len_q2']train_test['caps_count_q1'] = df['question1'].apply(lambda x:sum(1 for i in str(x) if i.isupper()))
train_test['caps_count_q2'] = df['question2'].apply(lambda x:sum(1 for i in str(x) if i.isupper()))
# Q1和Q2大写单词数量差
train_test['diff_caps'] = train_test['caps_count_q1'] - train_test['caps_count_q2']train_test['len_char_q1'] = df['question1'].apply(lambda x: len(str(x).replace(' ', '')))
train_test['len_char_q2'] = df['question2'].apply(lambda x: len(str(x).replace(' ', '')))
# Q1和Q2的长度差
train_test['diff_len_char'] = train_test['len_char_q1'] - train_test['len_char_q2']train_test['len_word_q1'] = df['question1'].apply(lambda x: len(str(x).split()))
train_test['len_word_q2'] = df['question2'].apply(lambda x: len(str(x).split()))
# Q1和Q2的单词数量差
train_test['diff_len_word'] = train_test['len_word_q1'] - train_test['len_word_q2']# 字符串数量和单词数量比
train_test['avg_world_len1'] = train_test['len_char_q1'] / train_test['len_word_q1']
train_test['avg_world_len2'] = train_test['len_char_q2'] / train_test['len_word_q2']
# Q1和Q2 字符串数量和单词数量比 之差
train_test['diff_avg_word'] = train_test['avg_world_len1'] - train_test['avg_world_len2']# Q1和Q2 是否完全相同
train_test['exactly_same'] = (df['question1'] == df['question2']).astype(int)
# 检测两个问题的重复情况
train_test['duplicated'] = df.duplicated(subset=['question1','question2']).astype(int)# 统计 word 单词在 df的Q1和Q2 中出现的次数
def add_word_count(x, df, word):x['q1_' + word] = df['question1'].apply(lambda x: (word in str(x).lower())*1)x['q2_' + word] = df['question2'].apply(lambda x: (word in str(x).lower())*1)x[word + '_both'] = x['q1_' + word] * x['q2_' + word]# how  what  which  who等等
add_word_count(train_test, df, 'how')
add_word_count(train_test, df, 'what')
add_word_count(train_test, df, 'which')
add_word_count(train_test, df, ' ')
add_word_count(train_test, df, 'where')
add_word_count(train_test, df, 'when')
add_word_count(train_test, df, 'why')

XGBoost模型训练
只用5000条数据进行训练，查看效果

params = {'objective': 'binary:logistic','eval_metric': 'logloss','eta': 0.1,'max_depth': 5,
}cv_results = xgb.cv(params,xgb.DMatrix(train_test.iloc[:df_train.shape[0]], df_train['is_duplicate'].values),num_boost_round=100,seed=42,nfold=5,early_stopping_rounds=10
)
cv_results

train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std

可以看到logloss是一直降低到的，可以换成全部数据集训练，查看效果

2.2 词向量 + LSTM

Import package

import os
import re
import csv
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# import codecsfrom string import punctuation
from collections import defaultdict
# from tqdm import tqdmfrom sklearn.preprocessing import StandardScalerfrom nltk.corpus import stopwords
from nltk.stem import SnowballStemmerfrom keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, Embedding, Dropout, Activation, LSTM, Lambda
from keras.layers.merge import concatenate
from keras.models import Model
from keras.layers.normalization import BatchNormalization
from keras.callbacks import EarlyStopping, ModelCheckpoint
# from keras.layers.convolutional import Conv1D
from keras.layers.pooling import GlobalAveragePooling1D
import keras.backend as K

定义常量和模型参数

Data_Dir = '../input/quora-question-pairs/'  # 数据路径
Word_Vec_Dir = '../input/glove-840b-300d/'   # 预训练好的glove   300维的向量模型
Embedding_File = Word_Vec_Dir + 'glove.840B.300d.txt'   # 载入文件路径
Train_Data_File = Data_Dir + 'train.csv'  # 训练集路径
Test_Data_File = Data_Dir + 'test.csv'   # 测试集路径
Max_Sequence_Length = 60   # 最大句子长度
Max_Num_Words = 200000  # 最大单词总数量
Embedding_Dim = 300  # 嵌入层单词向量维度
Validation_Split_Ratio = 0.1  # 验证集划分比例Num_Lstm = np.random.randint(175, 275)  #
Num_Dense = np.random.randint(100, 150) #
Rate_Drop_Lstm = 0.15 + np.random.rand() * 0.25 #
Rate_Drop_Dense = 0.15 + np.random.rand() * 0.25 #Lstm_Struc = 'lstm_{:d}_{:d}_{:.2f}_{:.2f}'.format(Num_Lstm, Num_Dense, Rate_Drop_Lstm, \
Rate_Drop_Dense)
print(Lstm_Struc)act_f = 'relu'  # 激活函数
re_weight = True

载入embedding权重

print('Create word embedding dictionary')
embeddings_index = {}
f = open(Embedding_File, encoding='utf-8')for line in f:values = line.split()word = ''.join(values[:-300])   coefs = np.asarray(values[-300:], dtype='float32')embeddings_index[word] = coefs
f.close()print('Found {} word vectors of glove.'.format(len(embeddings_index)))# Process text in dataset
print('Processing text dataset')

def text_to_wordlist(text, remove_stopwords=False, stem_words=False):# 将字符转化为小写，并根据空格分隔text = text.lower().split()# 去除停用词if remove_stopwords:stop_words = set(stopwords.words("english"))text = [w for w in text if not w in stop_words]# 重新转换为字符串text = " ".join(text)# 清除特殊字符text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)text = re.sub(r"what's", "what is ", text)text = re.sub(r"\'s", " ", text)text = re.sub(r"\'ve", " have ", text)text = re.sub(r"can't", "cannot ", text)text = re.sub(r"n't", " not ", text)text = re.sub(r"i'm", "i am ", text)text = re.sub(r"\'re", " are ", text)text = re.sub(r"\'d", " would ", text)text = re.sub(r"\'ll", " will ", text)text = re.sub(r",", " ", text)text = re.sub(r"\.", " ", text)text = re.sub(r"!", " ! ", text)text = re.sub(r"\/", " ", text)text = re.sub(r"\^", " ^ ", text)text = re.sub(r"\+", " + ", text)text = re.sub(r"\-", " - ", text)text = re.sub(r"\=", " = ", text)text = re.sub(r"'", " ", text)text = re.sub(r":", " : ", text)text = re.sub(r"(\d+)(k)", r"\g<1>000", text)text = re.sub(r" e g ", " eg ", text)text = re.sub(r" b g ", " bg ", text)text = re.sub(r" u s ", " american ", text)# text = re.sub(r"\0s", "0", text) # It doesn't make sense to metext = re.sub(r" 9 11 ", "911", text)text = re.sub(r"e - mail", "email", text)text = re.sub(r"j k", "jk", text)text = re.sub(r"\s{2,}", " ", text)# 缩短单词的词干  (词干提取算法(SnowballStemmer))if stem_words:text = text.split()stemmer = SnowballStemmer('english')stemmed_words = [stemmer.stem(word) for word in text]text = " ".join(stemmed_words)return text

关于词干提取算法(SnowballStemmer)的实例：
https://www.zhiu.cn/57067.html

载入数据并处理

# load data and process with text_to_wordlist
train_texts_1 = []
train_texts_2 = []
train_labels = []df_train = pd.read_csv(Train_Data_File, encoding='utf-8')
df_train = df_train.fillna('empty')train_q1 = df_train.question1.values
train_q2 = df_train.question2.values
train_labels = df_train.is_duplicate.values# 对训练集每个句子应用text_to_wordlist函数
for text in train_q1:train_texts_1.append(text_to_wordlist(text, remove_stopwords=False, stem_words=False))for text in train_q2:train_texts_2.append(text_to_wordlist(text, remove_stopwords=False, stem_words=False))print('{} texts are found in train.csv'.format(len(train_texts_1)))# 对测试集每个句子应用text_to_wordlist函数
df_test = pd.read_csv(Test_Data_File, encoding='utf-8')
df_test = df_test.fillna('empty')
test_q1 = df_test.question1.values
test_q2 = df_test.question2.values
test_ids = df_test.test_id.valuesfor text in test_q1:test_texts_1.append(text_to_wordlist(text, remove_stopwords=False, stem_words=False))for text in test_q2:test_texts_2.append(text_to_wordlist(text, remove_stopwords=False, stem_words=False))print('{} texts are found in test.csv'.format(len(test_texts_1)))

# keras中自带的tokenizer工具
tokenizer = Tokenizer(num_words=Max_Num_Words)
# 训练全部的问题集
tokenizer.fit_on_texts(train_texts_1 + train_texts_2 + test_texts_1 + test_texts_2)# 将问题文本转换为序列
train_sequences_1 = tokenizer.texts_to_sequences(train_texts_1)
train_sequences_2 = tokenizer.texts_to_sequences(train_texts_2)
test_sequences_1 = tokenizer.texts_to_sequences(test_texts_1)
test_sequences_2 = tokenizer.texts_to_sequences(test_texts_2)# 看一下训练后 tokenizer中共有多少词典
word_index = tokenizer.word_index
print('{} unique tokens are found'.format(len(word_index)))# 由于模型需要规则的input，因此需要将句子补成固定长度的序列
train_data_1 = pad_sequences(train_sequences_1, maxlen=Max_Sequence_Length)
train_data_2 = pad_sequences(train_sequences_2, maxlen=Max_Sequence_Length)
test_data_1 = pad_sequences(test_sequences_1, maxlen=Max_Sequence_Length)
test_data_2 = pad_sequences(test_sequences_2, maxlen=Max_Sequence_Length)print('Shape of train data tensor:', train_data_1.shape)
print('Shape of train labels tensor:', train_labels.shape)
print('Shape of test data tensor:', test_data_2.shape)
print('Shape of test ids tensor:', test_ids.shape)# 合并df_train和df_test的两个问题到一个dataframe中
questions = pd.concat([df_train[['question1', 'question2']], \df_test[['question1', 'question2']]], axis=0).reset_index(drop='index')
questions = pd.concat([df_train[['question1', 'question2']], \df_test[['question1', 'question2']]], axis=0).reset_index(drop='index')# 字典格式  Key是question1，Value是question2
q_dict = defaultdict(set)
for i in range(questions.shape[0]):q_dict[questions.question1[i]].add(questions.question2[i])q_dict[questions.question2[i]].add(questions.question1[i])# Q1的单词数量
def q1_freq(row):return(len(q_dict[row['question1']]))
# Q2的单词数量
def q2_freq(row):return(len(q_dict[row['question2']]))
# Q1和Q2问题重复单词的数量
# intersection求两个字典的交集
def q1_q2_intersect(row):return(len(set(q_dict[row['question1']]).intersection(set(q_dict[row['question2']]))))f_train['q1_q2_intersect'] = df_train.apply(q1_q2_intersect, axis=1, raw=True)
df_train['q1_freq'] = df_train.apply(q1_freq, axis=1, raw=True)
df_train['q2_freq'] = df_train.apply(q2_freq, axis=1, raw=True)df_test['q1_q2_intersect'] = df_test.apply(q1_q2_intersect, axis=1, raw=True)
df_test['q1_freq'] = df_test.apply(q1_freq, axis=1, raw=True)
df_test['q2_freq'] = df_test.apply(q2_freq, axis=1, raw=True)leaks = df_train[['q1_q2_intersect', 'q1_freq', 'q2_freq']]
test_leaks = df_test[['q1_q2_intersect', 'q1_freq', 'q2_freq']]# 对上述的三个特征进行标准化处理
ss = StandardScaler()
ss.fit(np.vstack((leaks, test_leaks)))
leaks = ss.transform(leaks)
test_leaks = ss.transform(test_leaks)num_words = min(Max_Num_Words, len(word_index))+1embedding_matrix = np.zeros((num_words, Embedding_Dim))
for word, i in word_index.items():embedding_vector = embeddings_index.get(word)if embedding_vector is not None:embedding_matrix[i] = embedding_vector
print('Null word embeddings: '.format(np.sum(np.sum(embedding_matrix, axis=1) == 0)))# 划分训练集和验证集
perm = np.random.permutation(len(train_data_1))
idx_train = perm[:int(len(train_data_1)*(1-Validation_Split_Ratio))]
idx_val = perm[int(len(train_data_1)*(1-Validation_Split_Ratio)):]data_1_train = np.vstack((train_data_1[idx_train], train_data_2[idx_train]))
data_2_train = np.vstack((train_data_2[idx_train], train_data_1[idx_train]))
leaks_train = np.vstack((leaks[idx_train], leaks[idx_train]))
labels_train = np.concatenate((train_labels[idx_train], train_labels[idx_train]))data_1_val = np.vstack((train_data_1[idx_val], train_data_2[idx_val]))
data_2_val = np.vstack((train_data_2[idx_val], train_data_1[idx_val]))
leaks_val = np.vstack((leaks[idx_val], leaks[idx_val]))
labels_val = np.concatenate((train_labels[idx_val], train_labels[idx_val]))

vstack和hstack的区别：
https://blog.csdn.net/nanhuaibeian/article/details/100597342
定义模型

# 模型参数
emb_layer = Embedding(input_dim=num_words,output_dim=Embedding_Dim,weights=[embedding_matrix],input_length=Max_Sequence_Length,trainable=False
)    # LSTM层
lstm_layer = LSTM(Num_Lstm, dropout=Rate_Drop_Lstm, recurrent_dropout=Rate_Drop_Lstm)seq1 = Input(shape=(Max_Sequence_Length,), dtype='int32')
seq2 = Input(shape=(Max_Sequence_Length,), dtype='int32')
# Run inputs through embedding
emb1 = emb_layer(seq1)
emb2 = emb_layer(seq2)
# Run through LSTM layers
lstm_a = lstm_layer(emb1)
lstm_b = lstm_layer(emb2)# 密集层
magic_input = Input(shape=(leaks.shape[1],))
magic_dense = Dense(int(Num_Dense/2), activation=act_f)(magic_input)# 输入层   两个句子是LSTM，特征是Dense
merged = concatenate([lstm_a, lstm_b, magic_dense])
merged = BatchNormalization()(merged)  # 批标准化
# Dropout 防止过拟合
merged = Dropout(Rate_Drop_Dense)(merged)merged = Dense(Num_Dense, activation=act_f)(merged)
merged = BatchNormalization()(merged)
merged = Dropout(Rate_Drop_Dense)(merged)# 二分类的激活函数用sigmoid，多分类用softmax
preds = Dense(1, activation='sigmoid')(merged)if re_weight:class_weight = {0: 1.309033281, 1: 0.471544715}
else:class_weight = None# 训练模型
model = Model(inputs=[seq1, seq2, magic_input], outputs=preds)
# 编译模型   nadam优化器    衡量指标是acc
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['acc'])# loss长时间不收敛  提前停止     patience设置的偏大是有用的
early_stopping =EarlyStopping(monitor='val_loss', patience=10)
bst_model_path = Lstm_Struc + '.h5'
# 设置检查点   只保存最好的模型和权重
model_checkpoint = ModelCheckpoint(bst_model_path, save_best_only=True, save_weights_only=True)# 模型训练
hist = model.fit([data_1_train, data_2_train, leaks_train], labels_train, \validation_data=([data_1_val, data_2_val, leaks_val], labels_val, weight_val), \epochs=200, batch_size=2048, shuffle=True, \class_weight=class_weight, callbacks=[early_stopping, model_checkpoint])# 保存最优参数
model.save_weights(bst_model_path)
bst_val_score = min(hist.history['val_loss'])# 制作提交文件
print('Making the submission')preds = model.predict([test_data_1, test_data_2, test_leaks], batch_size=8192, verbose=1)
preds += model.predict([test_data_2, test_data_1, test_leaks], batch_size=8192, verbose=1)
preds /= 2submission = pd.DataFrame({'test_id':test_ids, 'is_duplicate':preds.ravel()})
submission.to_csv('{:.4f}_'.format(bst_val_score)+Lstm_Struc+'_with_GloVe_Embedding.csv', index=False)

关于 BatchNormalization层
https://blog.csdn.net/weixin_44791964/article/details/114998793

kaggle竞赛 | Quora Question Pairs | 判断相似的Question相关推荐

kaggle竞赛 | Quora Insincere Question | 文本情感分析
目录赛题背景赛题评价指标数据集分析 pytorch建模之前发布了一遍实战类的情感分析的文章,包括微博爬虫,数据分析,相关模型. 可以参考: https://blog.csdn.net/liji ...
Kaggle：Quora Question Pairs
一.概要 Quora Question Pairs是kaggle上一个关于文本匹配的问题,主要目的是判断两个问题是不是同一个意思. 二.数据简介数据结构相对比较简单,如下: "id&quo ...
从Kaggle赛题: Quora Question Pairs 看文本相似性/相关性
从Kaggle赛题: Quora Question Pairs 看文本相似性/相关性包大人健身 Kaggle 自然语言处理数据挖掘 137 人赞了该文章从Kaggle赛题: Quora Que ...
kaggle: quora question pairs
今天看了kaggle竞赛:quora question pairs的一个解决方案,受益匪浅,在此记录一下作者的解题思路. 一.quora question pairs简介首先,介绍一下quora q ...
kaggle比赛--Quora Question Pairs
文章目录数据来源数据分析训练集测试集训练集的数据分析字符个数词的个数词云逻辑回归获得特征训练数据 ROC 评价 Precision-Recall Curve 评价 XGBoost ...
Quora Question Pairs 项目参考资料
实现多种解决方案的 kaggle比赛--Quora Question Pairs https://blog.csdn.net/qq_27009517/article/details/87716641? ...
kaggle竞赛RSNA Screening Mammography Breast Cancer Detection
本次竞赛为2023年初的一场kaggle竞赛,最终成绩铜牌. RSNA Screening Mammography Breast Cancer Detection 1.赛题分析对于北美放射学会给出的 ...
【数据竞赛】Kaggle竞赛如何保证线上线下一致性？
作者: 尘沙樱落.杰少.新峰.谢嘉嘉.DOTA.有夕验证策略设计这是一个系列篇,后续我们会按照我们第一章中的框架进行更新,因为大家平时都较忙,不会定期更新,如有兴趣欢迎长期关注我们的公众号,如有任 ...
【数据竞赛】kaggle竞赛宝典-样本组织篇！
作者: 尘沙杰少.樱落.谢嘉嘉.DOTA.有夕样本筛选.样本组织之样本组织部分这是一个系列篇,后续我们会按照我们第一章中的框架进行更新,因为大家平时都较忙,不会定期更新,如有兴趣欢迎长期关注我们的 ...

kaggle竞赛 | Quora Question Pairs | 判断相似的Question

目录