kaggle 猫狗数据标签

This is a 3 part series where we will be going through Transformers, BERT, and a hands-on Kaggle challenge — Google QUEST Q&A Labeling to see Transformers in action (top 4.4% on the leaderboard).In this part (3/3) we will be looking at a hands-on project from Google on Kaggle.Since this is an NLP challenge, I’ve used transformers in this project. I have not covered transformers in much detail in this part but if you wish you could check out the part 1/3 of this series where I’ve discussed transformers in detail.

这是一个分为3部分的系列文章，我们将通过Transformers，BERT和动手Kaggle挑战-Google QUEST问题与 解答标签来查看Transformers的使用情况(在排行榜上排名前4.4％)。在这一部分(3/3)我们将在Kaggle上查看Google的动手项目。由于这是NLP的挑战，因此我在该项目中使用了变压器。 在本部分中，我没有详细介绍变压器，但是如果您愿意，可以查看本系列的1/3部分，在该部分中我详细讨论了变压器。

博客的鸟瞰图： (Bird's eye view of the blog:)

To make the reading easy, I’ve divided the blog into different sub-topics-

为了便于阅读，我将博客分为不同的子主题，

Problem statement and evaluation metrics.
问题陈述和评估指标。
About the data.
关于数据。
Exploratory Data Analysis (EDA).
探索性数据分析(EDA)。
Modeling (includes data preprocessing).
建模(包括数据预处理)。
Post-modeling analysis.
建模后分析。

问题陈述和评估指标： (Problem statement and Evaluation metrics:)

Computers are really good at answering questions with single, verifiable answers. But, humans are often still better at answering questions about opinions, recommendations, or personal experiences.

计算机真的很擅长用单个可验证的答案回答问题。但是，人们通常仍会更好地回答有关观点，建议或个人经历的问题。

Humans are better at addressing subjective questions that require a deeper, multidimensional understanding of context. Questions can take many forms — some have multi-sentence elaborations, others may be simple curiosity or a fully developed problem. They can have multiple intents, or seek advice and opinions. Some may be helpful and others interesting. Some are simple right or wrong.

人类更擅长解决主观问题，这些问题需要对上下文有更深层次的理解。问题可以采取多种形式-有些具有多句处理的细节，有些则可能只是出于好奇或完全发展的问题。他们可以有多种意图，也可以征求建议和意见。一些可能会有所帮助，而另一些可能会有趣。有些简单是对还是错。

Unfortunately, it’s hard to build better subjective question-answering algorithms because of a lack of data and predictive models. That’s why the CrowdSource team at Google Research, a group dedicated to advancing NLP and other types of ML science via crowdsourcing, has collected data on a number of these quality scoring aspects.

不幸的是，由于缺乏数据和预测模型，很难构建更好的主观问答算法。这就是Google Research的CrowdSource团队(致力于通过众包推进NLP和其他类型的ML科学的团队)收集了许多质量得分方面的数据的原因。

In this competition, we’re challenged to use this new dataset to build predictive algorithms for different subjective aspects of question-answering. The question-answer pairs were gathered from nearly 70 different websites, in a “common-sense” fashion. The raters received minimal guidance and training and relied largely on their subjective interpretation of the prompts. As such, each prompt was crafted in the most intuitive fashion so that raters could simply use their common-sense to complete the task.

在这场比赛中，我们面临着挑战，要使用这个新的数据集为问答的不同主观方面构建预测算法。问答对以“常识”方式从近70个不同的网站收集而来。评估者仅获得了最少的指导和培训，并且很大程度上依赖于他们对提示的主观解释。这样，每个提示都以最直观的方式制作，以便评估者可以简单地使用常识来完成任务。

Demonstrating these subjective labels can be predicted reliably can shine a new light on this research area. Results from this competition will inform the way future intelligent Q&A systems will get built, hopefully contributing to them becoming more human-like.

证明这些主观标签可以被可靠地预测，可以在这个研究领域上崭露头角。此次比赛的结果将为将来构建智能问答系统的方式提供信息，希望有助于使其变得更像人。

Evaluation metric: Submissions are evaluated on the mean column-wise Spearman’s correlation coefficient. The Spearman’s rank correlation is computed for each target column, and the mean of these values is calculated for the submission score.

评估指标：根据平均列式Spearman相关系数评估提交的内容。为每个目标列计算Spearman的排名相关性，并为提交分数计算这些值的平均值。

关于数据： (About the data:)

The data for this competition includes questions and answers from various StackExchange properties. Our task is to predict the target values of 30 labels for each question-answer pair.The list of 30 target labels is the same as the column names in the sample_submission.csv file. Target labels with the prefix question_ relate to the question_title and/or question_body features in the data. Target labels with the prefix answer_ relate to the answer feature.Each row contains a single question and a single answer to that question, along with additional features. The training data contains rows with some duplicated questions (but with different answers). The test data does not contain any duplicated questions.Target labels can have continuous values in the range [0,1]. Therefore, predictions must also be in that range.The files provided are:

该比赛的数据包括来自各种StackExchange属性的问题和解答。我们的任务是为每个问答对预测30个标签的目标值.30个目标标签的列表与sample_submission.csv文件中的列名相同。带有前缀question_目标标签与数据中的question_title和/或question_body特征相关。前缀为answer_目标标签与answer功能相关。每行包含一个问题和对该问题的单个答案以及其他功能。训练数据包含一些重复的问题(但答案不同)的行。测试数据不包含任何重复的问题。目标标签可以具有[0,1]范围内的连续值。因此，预测也必须在该范围内。提供的文件为：

train.csv — the training data (target labels are the last 30 columns)
train.csv —训练数据(目标标签是最后30列)
test.csv — the test set (you must predict 30 labels for each test set row)
test.csv —测试集(您必须为每个测试集行预测30个标签)
sample_submission.csv — a sample submission file in the correct format; column names are the 30 target labels
sample_submission.csv —格式正确的示例提交文件；列名称是30个目标标签

You can check out the dataset using this link.

您可以使用此链接签出数据集。

探索性数据分析(EDA) (Exploratory Data Analysis (EDA))

Check-out the notebook with in-depth EDA + Data Scraping (Kaggle link).

通过深入的EDA +数据收集( Kaggle链接 ) 检出笔记本 。

The training data contains 6079 listings and each listing has 41 columns. Out of these 41 columns, the first 11 columns/features have to be used as the input and the last 30 columns/features are the target predictions.Let’s take a look at the input and target labels:

培训数据包含6079个列表，每个列表都有41列。在这41列中，前11列/功能必须用作输入，后30列/功能是目标预测。让我们看一下输入和目标标签：

The output features are all of the float types between 0 and 1.

输出功能为0到1之间的所有浮点类型。

Let's explore the input labels one by one.

让我们一一探讨输入标签。

qa_id (qa_id)

Question answer ID represents the id of a particular data point in the given dataset. Each data point has a unique qa_id. This feature is not to be used for training and will be used later while submitting the output to Kaggle.

问题答案ID代表给定数据集中特定数据点的ID。每个数据点都有一个唯一的qa_id。此功能不用于培训，稍后将在将输出提交给Kaggle时使用。

https://anime.stackexchange.com/questions/56789/if-naruto-loses-the-ability-he-used-on-kakashi-and-guy-after-kaguyas-seal-whathttps://anime.stackexchange.com/questions/56789/if-naruto-loses-the-ability-he-used-on-kakashi-and-guy-after-kaguyas-seal-what

问题标题 (question_title)

This is a string data type feature that holds the title of the question asked.For the analysis of question_title, I’ll be plotting a histogram of the number of words in this feature.

这是一个字符串数据类型功能，用于保存所问问题的标题。为了分析question_title，我将在此功能中绘制字数的直方图。

From the analysis, it is evident that:- Most of the question_title features have a word length of around 9.- The minimum question length is 2.- The maximum question length is 28.- 50% of question_title have lengths between 6 and 11.- 25% of question_title have lengths between 2 and 6.- 25% of question_title have lengths between 11 and 28.

从分析中可以明显看出：-大多数Question_title功能的单词长度约为9。-最小问题长度为2--最大问题长度为28.-50％的question_title的长度在6到11之间.-question_title的25％的长度在2到6之间。-question_title的25％的长度在11到28之间。

question_body (question_body)

This is again a string data type feature that holds the detailed text of the question asked.For the analysis of question_body, I’ll be plotting a histogram of the number of words in this feature.

这又是一个字符串数据类型功能，用于保存所问问题的详细文本。为了分析question_body，我将在此功能中绘制单词数的直方图。

From the analysis, it is evident that:- Most of the question_body features have a word length of around 93.- The minimum question length is 1.- The maximum question length is 4666.- 50% of question_title have lengths between 55 and 165.- 25% of question_title have lengths between 1 and 55.- 25% of question_title have lengths between 165 and 4666.

从分析中可以明显看出：-大多数question_body特征的单词长度约为93.-最小问题长度为1.-最大问题长度为4666.-50％的question_title的长度在55至165之间.-question_title的25％的长度在1到55之间。-question_title的25％的长度在165和4666之间。

The distribution looks like a power-law distribution, it can be converted to a gaussian distribution using log and then used as an engineered feature.

该分布看起来像幂律分布，可以使用对数将其转换为高斯分布，然后用作工程特征。

问题用户名 (question_user_name)

This is a string data type feature that denotes the name of the user who asked the question.For the analysis of question_answer, I’ll be plotting a histogram of the number of words in this feature.

这是一个字符串数据类型功能，表示提出问题的用户的名称。为了分析question_answer，我将在此功能中绘制单词数的直方图。

I did not find this feature of much use therefore I won’t be using this for modeling.

我没有发现此功能有很多用处，因此不会在建模中使用它。

question_user_page (question_user_page)

This is a string data type feature that holds the URL to the profile page of the user who asked the question.

这是一个字符串数据类型功能，用于保存提出问题的用户的个人资料页面的URL。

On the profile page, I noticed 4 useful features that could be used and should possibly contribute to good predictions. The features are:- Reputation: Denotes the reputation of the user.- gold_score: The number of gold medals awarded.- silver_score: The number of silver medals awarded.- bronze_score: The number of bronze medals awarded.

在个人资料页面上，我注意到可以使用4个有用的功能，这些功能可能有助于做出良好的预测。这些功能包括：-信誉：表示用户的声誉。-gold_score：授予的金牌数量。-silver_score：授予的银牌数量。-copper_score：授予的铜牌数量。

回答 (answer)

This is again a string data type feature that holds the detailed text of the answer to the question.For the analysis of answer, I’ll be plotting a histogram of the number of words in this feature.

这又是一个字符串数据类型功能，用于保存问题答案的详细文本。为分析答案，我将在此功能中绘制单词数的直方图。

From the analysis, it is evident that:- Most of the question_body features have a word length of around 143.- The minimum question length is 2.- The maximum question length is 8158.- 50% of question_title have lengths between 48 and 170.- 25% of question_title have lengths between 2 and 48.- 25% of question_title have lengths between 170 and 8158.

从分析中可以明显看出：-大多数question_body特征的单词长度约为143.-最小问题长度为2.-最大问题长度为8158.-50％的question_title的长度在48到170之间.-question_title的25％的长度在2到48之间。-question_title的25％的长度在170到8158之间。

This distribution also looks like a power-law distribution, it can also be converted to a gaussian distribution using log and then used as an engineered feature.

该分布看起来也像幂律分布，也可以使用对数将其转换为高斯分布，然后用作工程特征。

answer_user_name (answer_user_name)

This is a string data type feature that denotes the name of the user who answered the question.

这是一个字符串数据类型功能，表示回答问题的用户的名称。

I did not find this feature of much use therefore I won’t be using this for modeling.

我没有发现此功能有很多用处，因此不会在建模中使用它。

answer_user_page (answer_user_page)

This is a string data type feature similar to the feature “question_user_page” that holds the URL to the profile page of the user who asked the question.

这是一个字符串数据类型功能，类似于功能“ question_user_page”，其中包含指向提出问题的用户的个人资料页面的URL。

I also used the URL in this feature to scrape the external data from the user’s profile page, similar to what I did for the feature ‘question_user_page’.

我还使用此功能中的URL来从用户的个人资料页面抓取外部数据，类似于我对功能“ question_user_page”所做的操作。

网址 (url)

This feature holds the URL of the question and answers page on StackExchange or StackOverflow. Below I’ve printed the first 10 url data-points from train.csv

此功能保存StackExchange或StackOverflow上问答页面的URL。在下面，我打印了来自train.csv的前10个网址数据点

One thing to notice is that this feature lands us on the question-answer page, and that page may usually contain a lot more data like comments, upvotes, other answers, etc. which can be used for generating more features if the model does not perform well due to fewer data in train.csvLet’s see the data is present and what additional data can be scraped from the question-answer page.

需要注意的一件事是，此功能使我们进入了问答页面，并且该页面通常可能包含更多数据，例如注释，更新，其他答案等，如果模型不提供这些数据，则可以用于生成更多功能由于train.csv中的数据较少，因此效果良好。让我们看看数据是否存在以及可以从问题解答页面中抓取哪些其他数据。

In the snapshot attached above, Post 1 and Post 2 contain the answers, upvotes, and comments for the question asked in decreasing order of upvotes. The post with a green tick is the one containing the answer provided in the train.csv file.

在上面所附的快照中， 帖子1和帖子2包含按答题递减顺序回答的问题的答案，答题和注释。带有绿色勾号的帖子是包含train.csv文件中提供的答案的帖子。

Each question may have more than one answer. We can scrape these answers and use them as additional data.

每个问题可能有多个答案。我们可以抓取这些答案并将其用作其他数据。

The snapshot above defines the anatomy of a post. We can scrape useful features like upvotes and comments and use them as additional data.

上面的快照定义了帖子的结构。我们可以抓取有用的功能，例如投票和评论，并将其用作其他数据。

Below is the code for scraping the data from the URL page.

以下是用于从URL页面抓取数据的代码。

There are 8 new features that I’ve scraped-- upvotes: The number of upvotes on the provided answer.- comments_0: Comments to the provided answer.- answer_1: Most voted answer apart from the one provided.- comment_1: Top comment to answer_1.- answer_2: Second most voted answer.- comment_2: Top comment to answer_2.- answer_3: Third most voted answer.- comment_3: Top comment to answer_3.

我已抓取了8个新功能-投票：所提供答案的投票数。-评论_0：对所提供答案的评论。-answer_1：除所提供的答案外，投票最多的答案。-评论_1： answer_1.- answer_2：投票数第二高的答案。-comment_2：对answer_2最高的评论。-answer_3：投票数第三高的答案。-comment_3：对answer_3最高的评论。

类别 (category)

This is a categorical feature that tells the categories of question and answers pairs. Below I’ve printed the first 10 category data-points from train.csv

这是一个分类功能，可告知问题和答案对的类别。下面，我打印了来自train.csv的前10个类别数据点

Below is the code for plotting a Pie chart of category.

以下是用于绘制类别饼图的代码。

The chart tells us that most of the points belong to the category TECHNOLOGY and least belong to LIFE_ARTS (709 out of 6079).

该图表告诉我们，大多数点属于技术类别，而最少属于LIFE_ARTS (6079中的709)。

主办 (host)

This feature holds the host or domain of the question and answers page on StackExchange or StackOverflow. Below I’ve printed the first 10 host data-points from train.csv

此功能保存StackExchange或StackOverflow上“问题与解答”页面的主机或域。下面，我打印了train.csv的前10个主机数据点

Below is the code for plotting a bar graph of unique hosts.

以下是用于绘制唯一主机条形图的代码。

It seems there are not many but just 63 different subdomains present in the training data. Most of the data points are from StackOverflow.com whereas least from meta.math.stackexchange.com

训练数据中似乎没有多少，但只有63个不同的子域。大多数数据点来自StackOverflow.com，而最少的数据来自meta.math.stackexchange.com

目标值 (Target values)

Let’s analyze the target values that we need to predict. But first, for the sake of a better interpretation, please check out the full dataset on kaggle using this link.

让我们分析我们需要预测的目标值。但是首先，为了更好地解释，请使用此链接在kaggle上查看完整的数据集。

Below is the code block displaying the statistical description of the target values. These are only the first 6 features out of all the 30 features.The values of all the features are of type float and are between 0 and 1.

下面的代码块显示了目标值的统计描述。这些只是30个功能中的前6个功能，所有功能的值均为float类型且介于0和1之间。

Notice the second code block which displays the unique values present in the dataset. There are just 25 unique values between 0 and 1. This could be useful later while fine-tuning the code.

注意第二个代码块，它显示数据集中存在的唯一值。在0到1之间只有25个唯一值。稍后在微调代码时可能会很有用。

Finally, let’s check the distribution of the target features and their correlation.

最后，让我们检查目标特征的分布及其相关性。

Heatmap of correlation between target features.

造型 (Modeling)

Now that we know our data better through EDA, let’s begin with modeling. Below are the subtopics that we’ll go through in this section-

现在，我们已经通过EDA更好地了解了我们的数据，让我们从建模开始。以下是本节将要讨论的子主题：

Overview of the architecture: Quick rundown of the ensemble architecture and it’s different components.

体系结构概述：集成体系结构及其不同组件的快速精简。
Base learners: Overview of the base learners used in the ensemble.

基础学习者：集成中使用的基础学习者的概述。
Preparing the data: Data cleaning and preparation for modeling.

准备数据：数据清理和建模准备。
Ensembling: Creating models for training, and predicting. Pipelining the data preparation, model training, and model prediction steps.

整合：创建用于训练和预测的模型。管道化数据准备，模型训练和模型预测步骤。
Getting the scores from Kaggle: Submitting the predicted target values for test data on Kaggle and generating a leaderboard score to see how well the ensemble did.

从Kaggle获得分数：在Kaggle上提交测试数据的预测目标值，并生成一个排行榜分数，以查看整体表现如何。

I tried various deep neural network architectures with GRU, Conv1D, Dense layers, and with different features for the competition but, an ensemble of 8 transformers (as shown above) seems to work the best.In this part, we will be focusing on the final architecture of the ensemble used and for the other baseline models that I experimented with, you can check out my github repo.

我尝试了具有GRU，Conv1D，Dense层和具有不同功能的各种深度神经网络架构，但具有8个变压器的集合(如上所示)似乎效果最好。在这一部分中，我们将重点关注所使用的集成体的最终体系结构以及我试验过的其他基线模型，您可以查看我的github存储库。

Overview of the architecture:

体系结构概述：

Remember our task was for a given question_title, question_body, and answer, we had to predict 30 target labels. Now out of these 30 target labels, the first 21 are related to the question_title and question_body and have no connection to the answer whereas the last 9 target labels are related to the answer only but out of these 9, some of them also take question_title and question_body into the picture.Eg. features like answer_relevance and answer_satisfaction can only be rated by looking at both the question and answer.

请记住，我们的任务是针对给定的question_title，question_body和answer ，我们必须预测30个目标标签。现在，在这30个目标标签中，前21个与question_title和question_body相关，并且与答案无关，而后9个目标标签仅与答案相关，但在这9个目标标签中，有一些还带有question_title和question_body到picture.Eg。只能通过同时查看问题和答案来对诸如answer_relevance和answer_satisfaction之类的功能进行评分。

With some experimentation, I found that the base-learner (BERT_base) performs exceptionally well in predicting the first 21 target features (related to questions only) but does not perform that well in predicting the last 9 target features. Taking note of this, I constructed 3 dedicated base-learners and 2 different datasets to train them.

通过一些实验，我发现基本学习器(BERT_base)在预测前21个目标特征(仅与问题相关)方面表现出色，但在预测后9个目标特征方面表现不佳。注意到这一点，我构建了3个专用的基础学习器和2个不同的数据集来对其进行训练。

The first base-learner was dedicated to predicting the question-related features (first 21) only. The dataset used for training this model consisted of features question_title and question_body only.

第一个基础学习者专用于预测与问题相关的功能(前21个)。用于训练此模型的数据集仅包含问题question_title和question_body 。
The second base-learner was dedicated to predicting the answer-related features (last 9) only. The dataset used for training this model consisted of features question_title, question_body, and answer.

第二个基础学习者专用于预测与答案相关的功能(仅前9个)。用于训练该模型的数据集包括问题question_title ， question_body和answer 。
The third base-learner was dedicated to predicting all the 30 features. The dataset used for training this model again consisted of features question_title, question_body, and answer.

第三位基础学习者致力于预测所有30个功能。用于训练该模型的数据集再次包含问题question_title ， question_body和answer 。

To make the architecture even more robust, I used 3 different types of base learners — BERT, RoBERTa, and XLNet.We will be going through these different transformer models later in this blog.

为了使体系结构更加强大，我使用了3种不同类型的基础学习器-BERT，RoBERTa和XLNet。 在本博客的后面，我们将介绍这些不同的变压器模型。

In the ensemble diagram above, we can see —

在上方的整体图中，我们可以看到-

The 2 datasets consisting of [question_title + question_body] and [question_title + question_body + answer] being used separately to train different base learners.

由[question_title + question_body]和[question_title + question_body + Answer]组成的2个数据集分别用于训练不同的基础学习者。
Then we can see the 3 different base learners (BERT, RoBERTa, and XLNet) dedicated to predicting the question-related features only (first 21) colored in blue, using the dataset [question_title + question_body]

然后，我们可以看到3个不同的基础学习者(BERT，RoBERTa和XLNet)使用数据集[question_title + question_body]来仅预测与蓝色相关的特征 (前21个)。
Next, we can see the 3 different base learners (BERT, RoBERTa, and XLNet) dedicated to predicting the answer-related features only (last 9) colored in green, using the dataset [question_title + question_body + answer].

接下来，我们可以看到3个不同的基础学习者(BERT，RoBERTa和XLNet)专用于使用数据集[question_title + question_body + answer]来预测仅与绿色相关的功能 (最后9个) 。
Finally, we can see the 2 different base learners (BERT, and RoBERTa) dedicated to predicting all the 30 features colored in red, using the dataset [question_title + question_body + answer].

最后，我们可以看到2个不同的基础学习者(BERT和RoBERTa)使用数据集[question_title + question_body + answer]来预测所有用红色标记 的30个特征 。

In the next step, the predicted data from models dedicated to predicting the question-related features only (denoted as bert_pred_q, roberta_pred_q, xlnet_pred_q) and the predicted data from models dedicated to predicting the answer-related features only (denoted as bert_pred_a, roberta_pred_a, xlnet_pred_a) is collected and concatenated column-wise which leads to a predicted data with all the 30 features. These concatenated features are denoted as xlnet_concat, roberta_concat, and bert_concat.

下一步，仅用于预测与问题相关的特征的模型中的预测数据(表示为bert_pred_q，roberta_pred_q，xlnet_pred_q ) 和来自的预测数据仅收集专门用于预测与答案相关的特征的模型(表示为bert_pred_a，roberta_pred_a，xlnet_pred_a ) ，并按列进行级联，从而得出具有全部30个特征的预测数据。这些串联的功能分别表示为xlnet_concat，roberta_concat和bert_concat。

Similarly, the predicted data from models dedicated to predicting all the 30 features (denoted as bert_qa, roberta_qa) is collected. Notice that I’ve not used the XLNet model here for predicting all the 30 features because the scores were not up to the mark.

同样，来自收集了专门用于预测所有30个特征的模型 (分别表示为bert_qa，roberta_qa )。请注意，由于分数未达到标准，我这里没有使用XLNet模型来预测所有30个功能。

Finally, after collecting all the different predicted data — [xlnet_concat, roberta_concat, bert_concat, bert_qa, and roberta_qa], the final value is calculated by taking the average of all the different predicted values.

最后，在收集了所有不同的预测数据( [xlnet_concat，roberta_concat，bert_concat，bert_qa和roberta_qa])之后，通过取所有不同预测值的平均值来计算最终值。

Base learners

基础学习者

Now we will take a look at the 3 different transformer models that were used as base learners.

现在，我们将研究用作基础学习者的3种不同的变压器模型。

bert_base_uncased:

bert_base_uncased：

Bert was proposed by Google AI in late 2018 and since then it has become state-of-the-art for a wide spectrum of NLP tasks.It uses an architecture derived from transformers pre-trained over a lot of unlabeled text data to learn a language representation that can be used to fine-tune for specific machine learning tasks. BERT outperformed the NLP state-of-the-art on several challenging tasks. This performance of BERT can be ascribed to the transformer’s encoder architecture, unconventional training methodology like the Masked Language Model (MLM), and Next Sentence Prediction (NSP) and the humungous amount of text data (all of Wikipedia and book corpus) that it is trained on. BERT comes in different sizes but for this challenge, I’ve used bert_base_uncased.

Bert由Google AI于2018年底提出，自那以后它已成为处理大量NLP任务的最新技术，它使用的结构源自对变压器进行了预训练，这些结构经过大量未标记文本数据的预训练，可以学习语言表示形式，可用于微调特定的机器学习任务。 BERT在一些艰巨的任务上胜过NLP的最新技术。 BERT的这种性能可以归因于变压器的编码器体系结构，非常规的训练方法(例如，蒙版语言模型(MLM)和下一句预测(NSP))以及大量的文本数据(所有维基百科和书籍语料库)训练有素。 BERT的大小不同，但是为了应对这一挑战，我使用了bert_base_uncased。

The architecture of bert_base_uncased consists of 12 encoder cells with 8 attention heads in each encoder cell. It takes an input of size 512 and returns 2 values by default, the output corresponding to the first input token [CLS] which has a dimension of 786 and another output corresponding to all the 512 input tokens which have a dimension of (512, 768) aka pooled_output. But apart from these, we can also access the hidden states returned by each of the 12 encoder cells by passing output_hidden_states=True as one of the parameters.BERT accepts several sets of input, for this challenge, the input I’ll be using will be of 3 types:

bert_base_uncased的体系结构包含12个编码器单元，每个编码器单元中有8个关注头。它采用大小为512的输入，并且默认情况下返回2个值，该输出对应于尺寸为786的第一个输入令牌[CLS]，另一个输出对应于尺寸为(512，768)的所有512个输入令牌)，也称为pooled_output。但是除了这些之外，我们还可以通过传递output_hidden_states = True作为参数之一来访问12个编码器单元中的每个单元返回的隐藏状态.BERT接受多组输入，对此挑战，我将使用的输入将属于3种类型：

input_ids: The token embeddings are numerical representations of words in the input sentence. There is also something called sub-word tokenization that BERT uses to first breakdown larger or complex words into simple words and then convert them into tokens. For example, in the above diagram look how the word ‘playing’ was broken into ‘play’ and ‘##ing’ before generating the token embeddings. This tweak in tokenization works wonders as it utilized the sub-word context of a complex word instead of just treating it like a new word.

input_ids ：标记嵌入是输入句子中单词的数字表示。 BERT还使用一种称为子词标记的方法，首先将较大或复杂的词分解为简单的词，然后将其转换为标记。例如，在上图中，在生成令牌嵌入之前，如何将单词“ playing”分为“ play”和“ ## ing”。令牌化中的这种调整令人惊奇，因为它利用了复杂单词的子单词上下文，而不仅仅是将其视为新单词。
attention_mask: The segment embeddings are used to help BERT distinguish between the different sentences in a single input. The elements of this embedding vector are all the same for the words from the same sentence and the value changes if the sentence is different.

tention_mask ：段嵌入用于帮助BERT在单个输入中区分不同的句子。对于同一句子中的单词，此嵌入向量的元素都相同，如果句子不同，则值会更改。

Let’s consider an example: Suppose we want to pass the two sentences

让我们考虑一个例子：假设我们要传递两个句子

“I have a pen” and “The pen is red” to BERT. The tokenizer will first tokenize these sentences as:

对于BERT， “我有一支笔”和“笔是红色的” 。分词器首先将这些句子分词为：

“I have a pen” and “The pen is red” to BERT. The tokenizer will first tokenize these sentences as: [‘[CLS]’, ‘I’, ‘have’, ‘a’, ‘pen’, ‘[SEP]’, ‘the’, ‘pen’, ‘is’, ‘red’, ‘[SEP]’]And the segment embeddings for these will look like:

对于BERT， “我有一支笔”和“笔是红色的” 。标记生成器首先将这些句子标记为： ['[CLS]'，'I'，'have'，'a'，'pen'，'[[SEP]'，'the'，'pen'，'is'， 'red'，'[SEP]']并且这些的段嵌入看起来像：

“I have a pen” and “The pen is red” to BERT. The tokenizer will first tokenize these sentences as: [‘[CLS]’, ‘I’, ‘have’, ‘a’, ‘pen’, ‘[SEP]’, ‘the’, ‘pen’, ‘is’, ‘red’, ‘[SEP]’]And the segment embeddings for these will look like:[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]Notice how all the elements corresponding to the word in the first sentence have the same element 0 whereas all the elements corresponding to the word in the second sentence have the same element 1.

对于BERT， “我有一支笔”和“笔是红色的” 。标记生成器首先将这些句子标记为： ['[CLS]'，'I'，'have'，'a'，'pen'，'[[SEP]'，'the'，'pen'，'is'， 'red'，'[[SEP]']]的分段嵌入如下所示： [0，0，0，0，0，0，1，1，1，1，1]请注意所有元素如何对应第一个句子中的单词具有相同的元素0，而第二个句子中与该单词对应的所有元素都具有相同的元素1 。
token_type_ids: The mask tokens that help BERT to understand what all input words are relevant and what all are just there for padding.

token_type_ids：掩码令牌，可帮助BERT理解所有输入词都相关以及填充时所需要的所有内容。

Since BERT takes a 512-dimensional input, and suppose we have an input of 10 words only. To make the tokenized words compatible with the input size, we will add padding of size 512–10=502 at the end. Along with the padding, we will generate a mask token of size 512 in which the index corresponding to the relevant words will have

由于BERT采用512维输入，并且假设我们只有10个单词的输入。为了使标记化单词与输入大小兼容，我们将在末尾添加大小为512–10 = 502的填充。与填充一起，我们将生成大小为512的掩码令牌，其中与相关单词相对应的索引将具有

1s and the index corresponding to padding will have 0s.

1 s，对应于填充的索引将为0 s。

2. XLNet_base_cased:

2. XLNet_base_cased：

XLNet was proposed by Google AI Brain team and researchers at CMU in mid-2019. Its architecture is larger than BERT and uses an improved methodology for training. It is trained on larger data and shows better performance than BERT in many language tasks. The conceptual difference between BERT and XLNet is that while training BERT, the words are predicted in an order such that the previous predicted word contributes to the prediction of the next word whereas, XLNet learns to predict the words in an arbitrary order but in an autoregressive manner (not necessarily left-to-right).

XLNet由Google AI Brain团队和CMU的研究人员于2019年中提出。它的体系结构比BERT大，并使用改进的方法进行训练。它在较大的数据上经过训练，在许多语言任务中显示出比BERT更好的性能。 BERT和XLNet之间的概念差异是，在训练BERT时，单词的预测顺序要使上一个预测的单词有助于下一个单词的预测，而XLNet则学会以任意顺序但以自回归的方式预测单词方式(不一定从左到右)。

*The prediction scheme for a traditional language model. Shaded words are provided as input to the model while unshaded words are masked out.*

*An example of how a permutation language model would predict tokens for a certain permutation. Shaded words are provided as input to the model while unshaded words are masked out.*

This helps the model to learn bidirectional relationships and therefore better handles dependencies and relations between words.In addition to the training methodology, XLNet uses Transformer XL based architecture and 2 main key ideas: relative positional embeddings and the recurrence mechanism which showed good performance even in the absence of permutation-based training.XLNet was trained with over 130 GB of textual data and 512 TPU chips running for 2.5 days, both of which are much larger than BERT.

这有助于模型学习双向关系，从而更好地处理单词之间的依存关系和关系。除训练方法外，XLNet还使用基于Transformer XL的体系结构和两个主要关键思想： 相对位置嵌入和递归机制 ，即使在XLNet接受了130 GB以上的文本数据和512个TPU芯片运行2.5天的培训，两者均比BERT大得多。

For XLNet, I’ll be using only input_ids and attention_mask as input.

对于XLNet，我将只使用input_ids和attention_mask作为输入。

3. RoBERTa_base:

3. RoBERTa_base：

RoBERTa was proposed by Facebook in mid-2019. It is a robustly optimized method for pretraining natural language processing (NLP) systems that improve on BERT’s self-supervised method. RoBERTa builds on BERT’s language masking strategy, wherein the system learns to predict intentionally hidden sections of text within otherwise unannotated language examples. RoBERTa modifies key hyperparameters in BERT, including removing BERT’s Next Sentence Prediction (NSP) objective, and training with much larger mini-batches and learning rates. This allows RoBERTa to improve on the masked language modeling objective compared with BERT and leads to better downstream task performance. RoBERTa was also trained on more data than BERT and for a longer amount of time. The dataset used was from existing unannotated NLP data sets as well as CC-News, a novel set drawn from public news articles.

RoBERTa由Facebook在2019年中期提出。它是对自然语言处理(NLP)系统进行预训练的强大优化方法，它在BERT的自我监督方法的基础上有所改进。 RoBERTa建立在BERT的语言屏蔽策略的基础上，其中系统学习预测在其他情况下未注释的语言示例中故意隐藏的文本部分。 RoBERTa修改了BERT中的关键超参数，包括删除BERT的下一句预测(NSP)目标，以及使用更大的迷你批次和学习率进行训练。与BERT相比，这使RoBERTa可以改进掩蔽语言建模目标，并可以实现更好的下游任务性能。与BERT相比，RoBERTa还接受了更多数据的培训，而且培训时间更长。使用的数据集来自现有的未注释的NLP数据集以及CC-News，后者是从公共新闻文章中提取的新颖集。

For RoBERTa_base, I’ll be using only input_ids and attention_mask as input.

对于RoBERTa_base，我将只使用input_ids和attention_mask作为输入。

Finally here is the comparison of BERT, XLNet, and RoBERTa:

最后是BERT，XLNet和RoBERTa的比较：

Preparing the data

准备数据

Now that we have gained some idea about the architecture let’s see how to prepare the data for the base learners.

现在，我们已经对架构有了一些了解，让我们看看如何为基础学习者准备数据。

As a preprocessing step, I have just treated the HTML syntax present in the features. I used html.unescape() to extract the text from HTML DOM elements.

作为预处理步骤，我刚刚处理了功能中存在HTML语法。我使用html.unescape()从HTML DOM元素提取文本。

In the code snippet below, the function

在下面的代码段中，该函数

get_data() reads the train and test data and applies the preprocessing to the features question_title, question_body, and answer.

get_data()读取训练和测试数据，并将预处理应用于功能question_title，question_body和answer。

The next step was to create input_ids, attention_masks, and token_type_ids from the input sentence.

下一步是根据输入语句创建input_id，attention_masks和token_type_id 。

In the code snippet below, the function

在下面的代码段中，该函数

get_tokenizer() collects pre-trained tokenizer for the different base_learners.

get_tokenizer()为不同的base_learners收集经过预训练的令牌生成器。

The second function

第二功能

fix_length() goes through the generated question tokens and answer tokens and makes sure that the maximum number of tokens is 512. The steps for fixing the number of tokens are as follows:

fix_length()遍历生成的问题令牌和答案令牌，并确保令牌的最大数量为512。固定令牌数量的步骤如下：

- If the input sentence has the number of tokens > 512, the sentence is trimmed down to 512.

-如果输入句子的令牌数> 512，则将该句子修剪为512。

- To trim the number of tokens, 256 tokens from the beginning and 256 tokens from the end are kept and the remaining tokens are dropped.

-为了减少令牌的数量，保留从开头开始的256个令牌和从结尾开始的256个令牌，并丢弃其余的令牌。

- For example, suppose an answer has 700 tokens, to trim this down to 512, 256 tokens from the beginning are taken and 256 tokens from the end are taken and concatenated to make 512 tokens. The remaining [700-(256+256) = 288] tokens that are in the middle of the answer are dropped.

-例如，假设一个答案有700个令牌，将其减少到512个，则从头开始抽取256个令牌，从末尾抽取256个令牌并将其连接起来以构成512个令牌。答案中间的其余[700-(256 + 256)= 288]个标记被丢弃。

- The logic makes sense because in a large text, the beginning part usually describes what the text is all about and the end part describes the conclusion of the text.

-逻辑是有意义的，因为在大文本中，开始部分通常描述文本的全部内容，而结束部分描述文本的结论。

Next is the code block for generating the input_ids, attention_masks, and token_type_ids. I’ve used a condition that checks if the function needs to return the generated data for base learners relying on the dataset [question_title + question_body] or the dataset [question_title + question_body + answer].

接下来是用于生成input_id，attention_masks和token_type_ids的代码块。我使用了一个条件，该条件检查函数是否需要依赖于数据集[question_title + question_body]或数据集[question_title + Question_body + Answer]为基础学习者返回生成的数据。

Finally, here is the function that makes use of the function initialized above and generates input_ids, attention_masks, and token_type_ids for each of the instances in the provided data.

最后，这里的函数利用上面初始化的函数，并为提供的数据中的每个实例生成input_id，attention_masks和token_type_id 。

To make the model training easy, I also created a class that generates train and cross-validation data based on the fold while using KFlod CV with the help of the functions specified above.

为了简化模型训练，我还创建了一个类，该类在使用KFlod CV并借助上面指定的功能的同时，基于折叠生成训练和交叉验证数据。

Ensembling

组装

After data preprocessing, let's create the model architecture starting with base learners.

在数据预处理之后，让我们从基础学习者开始创建模型架构。

The code below takes the model name as input, collects the pre-trained model, and its configuration information according to the input name and creates the base learner model. Notice that output_hidden_states=True is passed after adding the config data.

下面的代码以模型名称为输入，根据输入的名称收集经过预先训练的模型及其配置信息，并创建基础学习者模型。请注意，在添加配置数据后，将传递output_hidden_states = True 。

The next code block is to create the ensemble architecture. The function accepts 2 parameters name that expects the name of the model that we want to train and model_type that expects the type of model we want to train. The model type can be bert-base-uncased, roberta-base or xlnet-base-cased whereas the model type can be questions, answers, or question_answers.The function create_model() takes the model_name and model_type and generates a model that can be trained on the specified data accordingly.

下一个代码块是创建集成体系结构。该函数接受两个参数名称，它们期望我们要训练的模型的名称，以及model_type期望我们要训练的模型的类型。模型类型可以是不带bert的，基于roberta或xlnet的，而模型的类型可以是问题，答案或问题答案。函数create_model ()接受model_name和model_type并生成可以相应地对指定数据进行训练的模型。

Now let's create a function for calculating the evaluation metric Spearman’s correlation coefficient.

现在，让我们创建一个函数来计算评估指标Spearman的相关系数。

Now we need a function that can collect the base learner model, data according to the base learner model, and train the model.I’ve used K-Fold cross-validation with 5 folds for training.

现在我们需要一个可以收集基础学习者模型，根据基础学习者模型进行数据训练的模型的功能。我使用了5折的K-fold交叉验证进行训练。

Now once we have trained the models and generated the predicted values, we need a function for calculating the weighted average. Here’s the code for that.*The weight’s in the weighted average are all 1s.

现在，一旦我们训练了模型并生成了预测值，就需要一个函数来计算加权平均值。这是代码。*加权平均值中的权重均为1。

Before bringing everything together, there is one more function that I used for processing the final predicted values. Remember in the EDA section there was an analysis of the target values where we noticed that the target values were only 25 unique floats between 0 and 1. To make use of that information, I calculated 61 (a hyperparameter) uniformly distributed percentile values and mapped them to the 25 unique values. This created 61 bins uniformly spaced between the upper and lower range of the target values. Now to process the predicted data, I used those bins to collect the predicted values and put them in the right place/order. This trick helped in improving the score in the final submission to the leaderboard to some extent.

在将所有内容组合在一起之前，还有一个函数用于处理最终的预测值。记得在EDA部分中对目标值进行了分析，我们注意到目标值只是0到1之间的25个唯一的浮点数。为了利用该信息，我计算了61个(超参数)均匀分布的百分位数并映射它们为25个唯一值。这样就在目标值的上限和下限范围之间均匀间隔地创建了61个bin。现在要处理预测数据，我使用这些容器收集了预测值并将它们放在正确的位置/顺序中。这个技巧在一定程度上有助于提高最终提交给排行榜的分数。

Finally, to bring the data-preprocessing, model training, and post-processing together, I created the get_predictions() function that-- Collects the data.- Creates the 8 base_learners.- Prepares the data for the base_learners.- Trains the base learners and collects the predicted values from them.- Calculates the weighted average of the predicted values.- Processes the weighted average prediction.- Converts the final predicted values into a dataframe format requested by Kaggle for submission and return it.

最后，为了将数据预处理，模型训练和后处理结合在一起，我创建了get_predictions()函数-收集数据。-创建8个base_learners。-为base_learners准备数据。-训练base学习者并从中收集预测值。-计算预测值的加权平均值。-处理加权平均预测。-将最终预测值转换为Kaggle要求提交的数据帧格式并返回。

Getting the scores from Kaggle

从Kaggle获得分数

Once the code compiles and runs successfully, it generates an output file that can be submitted to Kaggle for score calculation. The ranking of the code on the leaderboard is generated using the score.The ensemble model got a public score of 0.43658 which makes it in the top 4.4% on the leaderboard.

代码编译并成功运行后，它将生成一个输出文件，可以将该文件提交给Kaggle进行分数计算。排行榜上的代码排名是使用分数生成的。整体模型的公共得分为0.43658 ，在排行榜中排名前4.4％。

后期建模分析 (Post modeling Analysis)

Check-out the notebook with complete post-modeling analysis (Kaggle link).

使用完整的建模后分析来检出笔记本( Kaggle链接 )。

Its time for some post-modeling analysis!

现在是时候进行一些后期建模分析了！

In this section, we will go through an analysis of train data to figure out what parts of the data is the model doing well on and what parts of the data it’s not.The main idea behind this step is to know the capability of the trained model and it works like a charm if applied properly for fine-tuning the model and data.But we won’t get into the fine-tuning part in this section, we will just be performing some basic EDA on the train data using the predicted target values for the train data.I’ll be covering the data feature by feature. Here are the top features we’ll be performing analysis on-

在本节中，我们将对训练数据进行分析，以找出模型在哪些方面做得很好，而在哪些数据方面则做得不好。此步骤背后的主要思想是了解受训者的能力模型，如果正确地应用它来进行模型和数据的微调，它就像一个魅力。但是，我们不会在本节中进行微调，我们只是使用预测的数据对火车数据执行一些基本的EDA。火车数据的目标值。我将逐个功能介绍数据。以下是我们将要进行分析的主要功能-

question_title, question_body, and answer.
question_title，question_body和答案。
Word lengths of question_title, question_body, and answer.
Question_title，Question_body和Answer的字长。
Host
主办
Category
类别

First, we will have to divide the data into a spectrum of good data and bad data. Good data will be the data points on which the model achieves a good score and bad data will be the data points on which the model achieves a bad score. Now for scoring, we will be comparing the actual target values of the train data with the model’s predicted target values on train data. I used mean squared error (MSE) as a metric for scoring since it focuses on how close the actual and target values are. Remember the more the MSE-score is, the bad the data point will be.Calculating the MSE-score is pretty simple. Here’s the code:

首先，我们必须将数据分为好数据和坏数据。好的数据将是模型获得良好分数的数据点，而坏的数据将是模型获得不良分数的数据点。现在进行评分，我们将火车数据的实际目标值与模型在火车数据上的预测目标值进行比较。我将均方误差(MSE)用作评分的指标，因为它着重于实际值和目标值的接近程度。请记住，MSE得分越多，数据点就越糟糕。MSE得分的计算非常简单。这是代码：

# Generating the MSE-score for each data point in train data.from sklearn.metrics import mean_squared_errortrain_score = [mean_squared_error(i,j) for i,j in zip(y_pred, y_true)]# sorting the losses from minimum to maximum index wise.train_score_args = np.argsort(train_score)

question_title, question_body, and answer

问题标题，问题主体和答案

Starting with the first set of features, which are all text type features, I’ll be plotting word clouds using them. The plan is to segment out these features from 5 data-points that have the least scores and from another 5 data-points that have the most scores.

从第一组功能(都是文本类型的功能)开始，我将使用它们来绘制词云。该计划是从得分最低的5个数据点和得分最高的另外5个数据点中分割出这些特征。

function for plotting the word-clouds

绘制词云的功能

Let’s run the code and check what the results look like.

让我们运行代码并检查结果是什么样的。

Word lengths of question_title, question_body, and answer

Question_title，Question_body和Answer的字长

The next analysis is on the word lengths of question_title, question_body, and answer. For that, I’ll be picking 30 data-points that have the lowest MSE-scores and 30 data-points that have the highest MSE-scores for each of the 3 features question_title, question_body, and answer. Next, I’ll be calculating the word lengths of these 30 data-points for all the 3 features and plot them to see the trend.

接下来的分析是关于question_title，question_body和答案的词长。为此，我将为3个功能question_title，question_body和answer分别选择30个具有最低MSE分数的数据点和30个具有最高MSE分数的数据点。接下来，我将为所有3个功能计算这30个数据点的字长，并绘制它们以查看趋势。

If we look at the number of words in question_title, question_body, and answer we can observe that the data that generates a high loss has a high number of words which means that the questions and answers are kind of thorough. So, the model does a good job when the questions and answers are concise.

如果我们查看question_title，question_body和answer中的单词数，我们可以观察到，产生高损失的数据具有大量单词，这意味着问题和答案有点彻底。 因此，当问题和答案简洁时，该模型就可以很好地完成工作。

host

主办

The next analysis is on the feature host. For this feature, I’ll be picking 100 data-points that have the lowest MSE-scores and 100 data-points that have the highest MSE-scores and select the values in the feature host. Then I’ll be plotting a histogram of this categorical feature to see the distributions.

下一个分析在功能主机上。对于此功能，我将选择100个具有最低MSE分数的数据点和100个具有最高MSE分数的数据点，然后选择功能主机中的值。然后，我将绘制此分类特征的直方图以查看分布。

We can see that there are a lot of data points from the domain English, biology, sci-fi, physics that contribute to a lesser loss value whereas there are a lot of data points from drupal, programmers, tex that contribute to a higher loss.

我们可以看到，来自英语，生物学，科幻，物理学等领域的大量数据点导致较小的损失值，而来自drupal，程序员和tex的大量数据点则导致较高的损失。

Let’s also take a look at word-clouds of the unique host values that contribute to a low score and a high score. This analysis is again done using the top and bottom 100 data-points.

我们还来看看独特的主机值的词云，它们会导致低分和高分。再次使用顶部和底部100个数据点进行此分析。

Category

类别

The final analysis is on the feature category. For this feature, I’ll be picking 100 data-points that have the lowest MSE-scores and 100 data-points that have the highest MSE-scores and select the values in the feature category. Then I’ll be plotting a pie-chart of this categorical feature to see the proportions.

最终分析是在要素类别上。对于此功能，我将选择100个具有最低MSE分数的数据点和100个具有最高MSE分数的数据点，然后选择要素类别中的值。然后，我将绘制此分类特征的饼图以查看比例。

We can notice that datapoints with category as technology make up 50% of the data that the model could not predict well whereas categories like LIFE_ARTS, SCIENCE, and CULTURE contribute much less to bad predictions.For the good predictions, all the 5 categories contribute almost the same since there is no major difference in the proportion, still, we could say that the data-points with StackOverflow as the category contribute the least.

我们可以注意到，类别为技术的数据点占模型无法很好预测的数据的50％，而LIFE_ARTS，SCIENCE和CULTURE等类别对不良预测的贡献则要小得多。对于良好的预测，所有5个类别的贡献几乎都是相同，因为比例没有重大差异，但是我们可以说以StackOverflow为类别的数据点贡献最少。

With this, we have come to the end of this blog and the 3 part series. Hope the read was pleasant.You can check the complete notebook on Kaggle using this link and leave an upvote if found my work useful.I would like to thank all the creators for creating the awesome content I referred to for writing this blog.

至此，我们已经结束了本博客和3部分系列的结尾。 希望阅读愉快。您可以使用 此链接 在Kaggle上查看完整的笔记本， 并在发现我的工作有用的情况下发表意见。我要感谢所有创建者创建了我写这篇博客所提到的精彩内容。

Reference links:

参考链接：

Applied AI Course: https://www.appliedaicourse.com/

应用AI课程： https : //www.appliedaicourse.com/
https://www.kaggle.com/c/google-quest-challenge/notebooks

https://www.kaggle.com/c/google-quest-challenge/notebooks
http://jalammar.github.io/illustrated-transformer/

http://jalammar.github.io/illustrated-transformer/
https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04

https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04
https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8

https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8

Final note

最后说明

Thank you for reading the blog. I hope it was useful for some of you aspiring to do projects or learn some new concepts in NLP.

感谢您阅读博客。我希望这对有志于在NLP中进行项目或学习一些新概念的人有用。

In part 1/3 we covered how Transformers became state-of-the-art in various modern natural language processing tasks and their working.

在第1/3部分中，我们介绍了《变形金刚》如何在各种现代自然语言处理任务及其工作中成为最先进的技术。

In part 2/3 we went through BERT (Bidirectional Encoder Representations from Transformers).

在第2/3部分中，我们介绍了BERT(来自变压器的双向编码器表示)。

Kaggle in-depth EDA notebook link: https://www.kaggle.com/sarthakvajpayee/top-4-4-in-depth-eda-feature-scraping?scriptVersionId=40263047

Kaggle深入EDA笔记本链接： https ://www.kaggle.com/sarthakvajpayee/top-4-4-in-depth-eda-feature-scraping?scriptVersionId=40263047

Kaggle modeling notebook link: https://www.kaggle.com/sarthakvajpayee/top-4-4-bert-roberta-xlnet

Kaggle建模笔记本链接： https : //www.kaggle.com/sarthakvajpayee/top-4-4-bert-roberta-xlnet

Kaggle post-modeling notebook link: https://www.kaggle.com/sarthakvajpayee/top-4-4-post-modeling-analysis?scriptVersionId=40262842

Kaggle建模后笔记本链接： https ://www.kaggle.com/sarthakvajpayee/top-4-4-post-modeling-analysis?scriptVersionId=40262842

Find me on LinkedIn: www.linkedin.com/in/sarthak-vajpayee

在LinkedIn 上找到我： www.linkedin.com/in/sarthak-vajpayee

Find this project on Github: https://github.com/SarthakV7/Kaggle_google_quest_challenge

在Github上找到此项目： https : //github.com/SarthakV7/Kaggle_google_quest_challenge

Peace! ☮

和平！ ☮

翻译自: https://towardsdatascience.com/hands-on-transformers-kaggle-google-quest-q-a-labeling-affd3dad7bcb