
‘Attention Is All You Need’


New deep learning models are introduced at an increasing rate and sometimes it’s hard to keep track of all the novelties .


in this Article we will talk about Transformers with attached notebook(text classification example) are a type of neural network architecture that have been gaining popularity .

在本文中,我们将讨论带有笔记本的变压器 (文本分类示例)是一种已经越来越流行的神经网络体系结构。

In this post, we will address the following questions related to Transformers :


目录 : (Table Of Contents :)

  • Why do we need the Transformer ?


  • Transformer and its architecture in detail .


  • Text Classification with Transformer .


  • useful papers to well dealing with Transformer.


我-为什么需要变压器? (I -Why do we need the transformer ?)

Transformers were developed to solve the problem of sequence transduction, or neural machine translation. That means any task that transforms an input sequence to an output sequence. This includes speech recognition, text-to-speech transformation, etc..

开发了变压器来解决序列转导或神经机器翻译的问题。 这意味着任何将输入序列转换为输出序列的任务。 这包括语音识别,文本到语音转换等。

For models to perform sequence transduction, it is necessary to have some sort of memory.


The limitations of Long-term dependencies :


Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the previously described/existing sequence-to-sequence models because it does not imply any Recurrent Networks (GRU, LSTM, etc.).


In the paper “Attention is All You Need” , the transformer architecture is well introduced , and like the title indicates transformer architecture uses the attention mechanism (we will make a detailed article about it later)


let’s consider a language model that will predict the next word based on the previous ones !


sentence : “bitcoin the best cryptocurrency”

句子: “比特币是最好的加密货币”

here we don’t need an additional context , so obvious that the next word will be “cryptocurrency” .

这里我们不需要额外的上下文,很明显,下一个单词将是“ cryptocurrency”。

In this case RNN’s can sove the issue and predict the answer using the past information .


But in other cases we need more context . For example, let’s say that you are trying to predict the last word of the text: I grew up in Tunisia … I speak fluent ... Recent information suggests that the next word is probably a language, but if we want to narrow down which language, we need context of Tunisia, that is further back in the text.

但是在其他情况下,我们需要更多的上下文。 例如,假设您要预测文本的最后一句话: 我在突尼斯长大……我的语言说得很流利..... 最近的信息表明,下一个词可能是一种语言,但是如果我们想缩小哪种语言的范围,我们需要突尼斯的语境,这在文本中会更进一步。

RNNs become very ineffective when the gap between the relevant information and the point where it is needed become very large. That is due to the fact that the information is passed at each step and the longer the chain is, the more probable the information is lost along the chain.

当相关信息和需要的信息之间的差距变得很大时,RNN变得非常无效。 这是由于以下事实:信息在每个步骤中传递,并且链条越长,信息沿着链条丢失的可能性就越大。

i recommend a nice article that talk in depth about the difference between seq2seq and transformer .

我推荐一篇不错的文章 ,深入探讨seq2seq和Transformer之间的区别。

II-变压器及其架构的详细信息: (II -Transformer and its architecture in detail :)

An image is worth thousand words, so we will start with that!


The first thing that we can see is that it has a sequence-to-sequence encoder-decoder architecture.


Much of the literature on Transformers present on the Internet use this very architecture to explain Transformers.


But this is not the one used in Open AI’s GPT model (or the GPT-2 model, which was just a larger version of its predecessor).

但这不是Open AI的GPT模型(或GPT-2模型,只是其前身的较大版本)中使用的模型。

The GPT is a 12-layer decoder only transformer with 117M parameters.


The Transformer has a stack of 6 Encoder and 6 Decoder, unlike Seq2Seq;


the Encoder contains two sub-layers: multi-head self-attention layer and a fully connected feed-forward network.


The Decoder contains three sub-layers, a multi-head self-attention layer, an additional layer that performs multi-head self-attention over encoder outputs, and a fully connected feed-forward network.


Each sub-layer in Encoder and Decoder has a Residual connection followed by a layer normalization.


All input and output tokens to Encoder/Decoder are converted to vectors using learned embeddings.


These input embeddings are then passed to Positional Encoding.


The Transformers architecture does not contain any recurrence or convolution and hence has no notion of word order.


All the words of the input sequence are fed to the network


with no special order or position as they all flow simultaneously through the Encoder and decoder stack.


To understand the meaning of a sentence,


it is essential to understand the position and the order of words.


III —使用Transformer(Pytorch实现)进行文本分类: (III — Text Classification using Transformer(Pytorch implementation) :)

It is too simple to use the ClassificationModel from simpletransformes :ClassificationModel(‘Architecture’, ‘model shortcut name’, use_cuda=True,num_labels=4)Architecture : Bert , Roberta , Xlnet , Xlm…shortcut name models for Roberta : roberta-base , roberta-large ….more details here

使用来自simpletransformes的分类模型太简单了:ClassificationModel('Architecture','模型快捷方式名称',use_cuda = True,num_labels = 4)体系结构:Bert,Roberta,Xlnet,Xlm…Roberta的快捷名称模型:roberta-base ,roberta-large…。更多详细信息在这里

we create a model that classify text for 4 classes [‘art’, ‘politics’, ‘health’, ‘tourism’]


we apply this model in our previous project


and we integrate it in our flask application here . (you can buy it for helping us to create better content and help community)

并将其集成到此处的烧瓶应用程序中。 (您可以购买它来帮助我们创建更好的内容并帮助社区)

here you will find a commented notebook :


  • setup environment & configuration
!pip install --upgrade transformers!pip install simpletransformers# memory footprint support libraries/code!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi!pip install gputil!pip install psutil!pip install humanize importing libraries
  • Importing Libraries
import psutilimport humanizeimport osimport GPUtil as GPUimport numpy as npimport pandas as pdfrom google.colab import filesfrom tqdm import tqdmimport warningswarnings.simplefilter('ignore')import gcfrom scipy.special import softmaxfrom simpletransformers.classification import ClassificationModelfrom sklearn.model_selection import train_test_split, StratifiedKFold, KFold import sklearnfrom sklearn.metrics import log_lossfrom sklearn.metrics import *from sklearn.model_selection import *import reimport randomimport torchpd.options.display.max_colwidth = 200#choose the same seed to assure that our model will be roproducibledef seed_all(seed_value):    random.seed(seed_value) # Python    np.random.seed(seed_value) # cpu vars    torch.manual_seed(seed_value) # cpu  vars

    if torch.cuda.is_available():         torch.cuda.manual_seed(seed_value)        torch.cuda.manual_seed_all(seed_value) # gpu vars        torch.backends.cudnn.deterministic = True  #needed        torch.backends.cudnn.benchmark = Falseseed_all(2)
  • Reading Data
import pandas as pd#We consider that our data is a csv file (2 columns : text and label)#using pandas function (read_csv) to read the filetrain=pd.read_csv()feat_cols = "text"
  • Verify the topic classes in the data
  • train the model
label_cols = ['art', 'politics', 'health', 'tourism']train.head()l=['art', 'politics', 'health', 'tourism']# Get the numerical ids of column labeltrain['label']=train.label.astype('category')Y =['label']=Y# Print initial shapeprint(Y.shape)from keras.utils import to_categorical# One-hot encode the indexesY = to_categorical(Y)# Check the new shape of the variableprint(Y.shape)# Print the first 5 rowsprint(Y[0:5])for i in range(len(l)) :          train[l[i]] = Y[:,i]#using KFOLD Cross Validation is important to test our model%%timeerr=[]y_pred_tot=[]fold=StratifiedKFold(n_splits=5, shuffle=True, random_state=1997)i=1for train_index, test_index in fold.split(train,train['label']):    train1_trn, train1_val = train.iloc[train_index], train.iloc[test_index]    model = ClassificationModel('roberta', 'roberta-base', use_cuda=True,num_labels=4, args={                                                                         'train_batch_size':16,                                                                         'reprocess_input_data': True,                                                                         'overwrite_output_dir': True,                                                                         'fp16': False,                                                                         'do_lower_case': False,                                                                         'num_train_epochs': 4,                                                                         'max_seq_length': 128,                                                                         'regression': False,                                                                         'manual_seed': 1997,                                                                         "learning_rate":2e-5,                                                                         'weight_decay':0,                                                                         "save_eval_checkpoints": True,                                                                         "save_model_every_epoch": False,                                                                         "silent": True})    model.train_model(train1_trn)    raw_outputs_val = model.eval_model(train1_val)[1]    raw_outputs_vals = softmax(raw_outputs_val,axis=1)    print(f"Log_Loss: {log_loss(train1_val['label'], raw_outputs_vals)}")    err.append(log_loss(train1_val['label'], raw_outputs_vals))

output :


Log_Loss: 0.35637871529928816


CPU times: user 11min 2s, sys: 4min 21s,


total: 15min 23s Wall time: 16min 7s


Log Loss :


print("Mean LogLoss: ",np.mean(err))

output :


Mean LogLoss: 0.34930175561484067



Output :


array([[9.9822301e-01, 3.4856689e-04, 3.8243082e-04, 1.0458552e-03], [9.9695909e-01, 1.1522240e-03, 5.9563853e-04, 1.2927916e-03], [9.9910539e-01, 2.3084633e-04, 2.5905663e-04, 4.0465154e-04], ..., [3.6545596e-04, 2.8826005e-04, 4.3145564e-04, 9.9891484e-01], [4.0789684e-03, 9.9224585e-01, 1.2752400e-03, 2.3997365e-03], [3.7382307e-04, 3.4797701e-04, 3.6257200e-04, 9.9891579e-01]], dtype=float32)

数组([[9.9822301e-01,3.4856689e-04,3.8243082e-04,1.0458552e-03],[9.9695909e-01,1.1522240e-03,5.9563853e-04,1.2927916e-03],[9.9910539e -01,2.3084633e-04,2.5905663e-04,4.0465154e-04],...,[3.6545596e-04,2.8826005e-04,4.3145564e-04,9.9891484e-01],[4.0789684e-03 ,9.9224585e-01,1.2752400e-03,2.3997365e-03],[3.7382307e-04、3.4797701e-04、3.6257200e-04、9.9891579e-01]],dtype = float32)

  • test our Model
pred = model.predict(['i want to travel to thailand'])[1]preds = softmax(pred,axis=1)preds

output :


array([[6.0461409e-04, 3.6119239e-04, 3.3729596e-04, 9.9869716e-01]], dtype=float32)

数组([[6.0461409e-04,3.6119239e-04,3.3729596e-04,9.9869716e-01]],dtype = float32)

we create a function which calculate the maximum probability and detect the topicfor example if we have 0.6 politics 0.1 art 0.15 health 0.15 tourism >>>> topic = politics

我们创建了一个计算最大概率并检测主题的函数, 例如,如果我们有0.6政治0.1艺术0.15健康0.15旅游业>>>> topic =政治

def estm(raw_outputs_vals): for i in range(len(raw_outputs_vals)):   for j in range(4):       if(max(raw_outputs_vals[i])==raw_outputs_vals[i][j]):         raw_outputs_vals[i][j]=1       else :         raw_outputs_vals[i][j]=0 return(raw_outputs_vals)estm(preds)

output :


array([[0., 0., 0., 1.]], dtype=float32)

数组([[0.,0.,0.,1.]],dtype = float32)

our labels are :['art', 'politics', 'health', 'tourism']so that's correct ;)

我们的标签是:['艺术','政治','健康','旅游'], 所以没错;)

i hope you find it useful & helpful !


Download source code from our github.

从我们的github下载源代码 。

III-与变压器很好地打交道的有用论文: (III -useful papers to well dealing with Transformer:)

here a list of recommended papers to get in depth with transformers (mainly Bert Model) :


* Cross-Linguistic Syntactic Evaluation of Word Prediction Models* Emerging Cross-lingual Structure in Pretrained Language Models* Finding Universal Grammatical Relations in Multilingual BERT* On the Cross-lingual Transferability of Monolingual Representations* How multilingual is Multilingual BERT?* Is Multilingual BERT Fluent in Language Generation?* Are All Languages Created Equal in Multilingual BERT?* What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models* A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT* Cross-Lingual Ability of Multilingual BERT: An Empirical Study* Multilingual is not enough: BERT for Finnish

*单词预测模型的跨语言语法评估*预先训练的语言模型中新出现的跨语言结构*在多语言BERT中找到通用语法关系*关于单语言表示形式的跨语言可传递性*多语言BERT的多语言能力*多语言BERT会流利语言生成?*在多语言BERT中所有语言都创建相同吗?* BERT的层有何特别之处? 仔细研究单语言和多语言模型中的NLP管道*多语言BERT中的跨语言能力和特定于语言的信息的研究*多语言BERT的跨语言能力:一项经验研究*多语言是不够的:芬兰语的BERT

Download all files from our github repo


III-摘要: (III -Summary :)

Transformers present the next front in NLP.


In less than a couple of years since its introduction,


this new architectural trend has surpassed the feats of RNN-based architectures.


This exciting pace of invention is perhaps the best part of being early to a new field like Deep Learning!


if you have any suggestions or a questions please contact NeuroData Team :










Authors :


Yassine Hamdaoui

Yassine Hamdaoui

code credits goes to Med Klai Helmi : Data Scientist and Zindi Mentor

代码信用归于Med Klai Helmi:数据科学家和Zindi Mentor




  • python 机器学习管道_构建机器学习管道-第1部分
  • pandas数据可视化_5利用Pandas进行强大的可视化以进行数据预处理
  • 迁移学习 迁移参数_迁移学习简介
  • div文字自动扩充_文字资料扩充
  • ml是什么_ML,ML,谁是所有人的冠军?
  • 随机森林分类器_建立您的第一个随机森林分类器
  • Python中的线性回归:Sklearn与Excel
  • 机器学习中倒三角符号_机器学习的三角误差
  • 使用Java解决您的数据科学问题
  • 树莓派 神经网络植入_使用自动编码器和TensorFlow进行神经植入
  • opencv 运动追踪_足球运动员追踪-使用OpenCV根据运动员的球衣颜色识别运动员的球队
  • 犀牛建模软件的英文语言包_使用tidytext和textmineR软件包在R中进行主题建模(
  • 使用Keras和TensorFlow构建深度自动编码器
  • 出人意料的生日会400字_出人意料的有效遗传方法进行特征选择
  • fast.ai_使用fast.ai自组织地图—步骤4:使用 DataBunch处理非监督数据
  • 无监督学习与监督学习_有监督与无监督学习
  • 分类决策树 回归决策树_决策树分类器背后的数学
  • 检测对抗样本_对抗T恤以逃避ML人检测器
  • 机器学习中一阶段网络是啥_机器学习项目的各个阶段
  • 目标检测 dcn v2_使用Detectron2分6步进行目标检测
  • 生成高分辨率pdf_用于高分辨率图像合成的生成变分自编码器
  • 神经网络激活函数对数函数_神经网络中的激活函数
  • 算法伦理
  • python 降噪_使用降噪自动编码器重建损坏的数据(Python代码)
  • bert简介_BERT简介
  • 卷积神经网络结构_卷积神经网络
  • html两个框架同时_两个框架的故事
  • 深度学习中交叉熵_深度计算机视觉,用于检测高熵合金中的钽和铌碎片
  • 梯度提升树python_梯度增强树回归— Spark和Python
  • 5行代码可实现5倍Scikit-Learn参数调整的更快速度


  1. 使用java怎么实现商品三级分类_如何实现列表三级分类---后端+前端

    对于分类来说,一般包括一级分类,二级分类,三级分类, 大部分网站都是左边点击二级分类,右边显示相对应商品 下面就来为大家详细分析一下该如何实现吧. 如图: 分析图 1.1后端实现:JavaBean 与 ...

  2. 易语言读文本内容_易读性如何使文本易于阅读

    易语言读文本内容 Your first step in making your texts legible is to understand what legibility means. It is ...

  3. excel分类_最简单的Excel分类汇总教程!三分钟包学包会!

    在进行数据统计的时候我们都会经常用excel来完成,特别是数量较多较复杂的数据,通过Excel的分类汇总能够用方便快捷的处理,那么Excel的分类汇总功能到底是如何的呢?它有哪些功能?如何操作?别急, ...

  4. pytorch bert文本分类_一起读Bert文本分类代码 (pytorch篇 四)

    Bert是去年google发布的新模型,打破了11项纪录,关于模型基础部分就不在这篇文章里多说了.这次想和大家一起读的是huggingface的pytorch-pretrained-BERT代码exa ...

  5. python提取关键词分类_用Py做文本分析5:关键词提取

    1.关键词提取 关键词指的是原始文档的和核心信息,关键词提取在文本聚类.分类.自动摘要等领域中有着重要的作用. 针对一篇语段,在不加人工干预的情况下提取出其关键词 首先进行分词处理 关键词分配:事先给 ...

  6. coco数据集大小分类_如何处理不平衡数据集的分类任务

    在情感分类任务中,数据集的标签分布往往是极度不平衡的.以我目前手上的这个二分类任务来说,正例样本14.4万个:负例样本166.1万 = 1 :11.5.很显然这是一个极度不平衡的数据集,假设我把样本全 ...

  7. python 文本分析_使用Python进行文本分析–书评

    python 文本分析 This is a book review of Text Analytics with Python: A Practical Real-World Approach to ...

  8. python滚动文本框_调整滚动Tkinter文本框的大小

    我想要一个滚动的Tkinter文本框来填充最大的分配空间.我有点工作...在 由于某些原因,当我拉伸窗口时,文本小部件很好:但是,滚动条在x轴上有大量的填充.在 第二个问题是当我缩小窗口时,屏幕上的滚 ...

  9. python实现二分类_感知器做二分类的原理及python numpy实现

    本文目录: 1. 感知器 2. 感知器的训练法则 3. 梯度下降和delta法则 4. python实现 1. 感知器[1] 人工神经网络以感知器(perceptron)为基础.感知器以一个实数值向量 ...


  1. NLP中的词向量及其应用
  2. redis的安装以及常见运用场景
  3. Flex 布局教程:实例篇
  4. WEB API已成为构建客户端服务的利器?
  5. Hibernate面试题分析
  6. Pycrypto与RSA密码技术
  7. 甲骨文Java Archive
  8. ROS学习笔记十:用C++编写一个简单的服务和客户端
  9. django models
  10. 在centos6.5上编译安装httpd-2.4和2.4版本特性介绍
  11. 传统方式不同的变态下载(BT)
  12. INNO setup 制作安装包
  13. 软件测试教程从入门到精通
  14. 笔记本插拔电源屏闪问题
  15. 用windows电脑制作macos系统安装U盘
  16. python爬高德地图_【爬虫】Java关于高德地图爬取数据
  17. 海康存储服务器虚拟机,unraid 安装虚拟机攻略
  18. 领导的艺术:工作里怎么样做,才是包容
  19. 如何快速发现网站恶意镜像与网页劫持?
  20. 六西格玛设计的十二时辰


  1. 轻松使用make menuconfig达到内核的升级!
  2. [python爬虫] Selenium常见元素定位方法和操作的学习介绍(转载)
  3. 【matlab-7】Matlab与线性代数(三)
  4. mediarecorder路径设置为localsocket_[基础教程]-04 NanUI 启动器 Bootstrap 的设置
  5. 真正简单的基于prototype的表单验证
  6. 计算机三级网络技术打印,全国计算机等级考试三级网络技术历年真题(整理_打印版)...
  7. 图片导入ppt后模糊_PPT设计,找图也是一种能力
  8. php异步传输,php 异步处理-上传文件
  9. 后续升级鸿蒙系统,荣耀部分机型后续将支持升级为鸿蒙系统
  10. gtw-050090|执行拦截器时发生异常_执行流程 | 你真的了解Spring AOP的执行顺序吗?...