任务说明

  • 学习主题:论文分类(数据建模任务),利用已有数据建模,对新论文进行类别分类;
  • 学习内容:使用论文标题完成类别分类;
  • 学习成果:学会文本分类的基本方法、TF-IDF等;

数据处理步骤

在原始arxiv论文中论文都有对应的类别,而论文类别是作者填写的。在本次任务中我们可以借助论文的标题和摘要完成:

  • 对论文标题和摘要进行处理;
  • 对论文类别进行处理;
  • 构建文本分类模型;

文本分类思路

  • 思路1:TF-IDF+机器学习分类器

直接使用TF-IDF对文本提取特征,使用分类器进行分类,分类器的选择上可以使用SVM、LR、XGboost等

  • 思路2:FastText

FastText是入门款的词向量,利用Facebook提供的FastText工具,可以快速构建分类器

  • 思路3:WordVec+深度学习分类器

WordVec是进阶款的词向量,并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRnn或者BiLSTM。

  • 思路4:Bert词向量

Bert是高配款的词向量,具有强大的建模学习能力。

具体代码实现以及讲解

为了方便大家入门文本分类,我们选择思路1和思路2给大家讲解。首先完成字段读取:

# 导入所需的package
import seaborn as sns #用于画图
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import re #用于正则表达式,匹配字符串的模式
import requests #用于网络连接,发送网络请求,使用域名获取对应信息
import json #读取数据,我们的数据为json格式的
import pandas as pd #数据处理,数据分析
import numpy as np
import matplotlib.pyplot as plt #画图工具
def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi','report-no', 'categories', 'license', 'abstract', 'versions','update_date', 'authors_parsed'], count=None):'''定义读取文件的函数path: 文件路径columns: 需要选择的列count: 读取行数'''data  = []with open(path, 'r') as f: for idx, line in enumerate(f): if idx == count:breakd = json.loads(line)d = {col : d[col] for col in columns}data.append(d)data = pd.DataFrame(data)return datadata = readArxivFile('./data/arxiv-metadata-oai-2019.json', ['id', 'title', 'categories', 'abstract'],200000)
data.head()
id title categories abstract
0 0704.0297 Remnant evolution after a carbon-oxygen white ... astro-ph We systematically explore the evolution of t...
1 0704.0342 Cofibrations in the Category of Frolicher Spac... math.AT Cofibrations are defined in the category of ...
2 0704.0360 Torsional oscillations of longitudinally inhom... astro-ph We explore the effect of an inhomogeneous ma...
3 0704.0525 On the Energy-Momentum Problem in Static Einst... gr-qc This paper has been removed by arXiv adminis...
4 0704.0535 The Formation of Globular Cluster Systems in M... astro-ph The most massive elliptical galaxies show a ...

为了方便数据的处理,我们可以将标题和摘要拼接一起完成分类。

  • 英文text的处理流程

    1. 去掉非英文的标点–当然也depend on英文书写格式 eg: 标准版–i like apple, what about you(标点后面会跟着空格) don’t直接删除标点
    2. 变小写
    3. trim 左右两边的whitespace
    4. 用空格分词: split(’ ')–>string变成单个单词str组成的list
    5. 去停用词
    6. stemming
    7. lemmatization
    8. (optional)将list中的string,用’ '.join重新组合成string
# 这里是分解
data['text'] = data['title'] + data['abstract']
data.text[0] # 里面有很多换行符\n 要替换
'Remnant evolution after a carbon-oxygen white dwarf merger  We systematically explore the evolution of the merger of two carbon-oxygen\n(CO) white dwarfs. The dynamical evolution of a 0.9 Msun + 0.6 Msun CO white\ndwarf merger is followed by a three-dimensional SPH simulation. We use an\nelaborate prescription in which artificial viscosity is essentially absent,\nunless a shock is detected, and a much larger number of SPH particles than\nearlier calculations. Based on this simulation, we suggest that the central\nregion of the merger remnant can, once it has reached quasi-static equilibrium,\nbe approximated as a differentially rotating CO star, which consists of a\nslowly rotating cold core and a rapidly rotating hot envelope surrounded by a\ncentrifugally supported disc. We construct a model of the CO remnant that\nmimics the results of the SPH simulation using a one-dimensional hydrodynamic\nstellar evolution code and then follow its secular evolution. The stellar\nevolution models indicate that the growth of the cold core is controlled by\nneutrino cooling at the interface between the core and the hot envelope, and\nthat carbon ignition in the envelope can be avoided despite high effective\naccretion rates. This result suggests that the assumption of forced accretion\nof cold matter that was adopted in previous studies of the evolution of double\nCO white dwarf merger remnants may not be appropriate. Our results imply that\nat least some products of double CO white dwarfs merger may be considered good\ncandidates for the progenitors of Type Ia supernovae. In this case, the\ncharacteristic time delay between the initial dynamical merger and the eventual\nexplosion would be ~10^5 yr. (Abridged).\n'
data['text'] = data['text'].apply(lambda x: x.replace('\n',' ')) # 对于text这一列的每个值(string)--x:把\n换行符替换成空格str' '
data.text[0]
'Remnant evolution after a carbon-oxygen white dwarf merger  We systematically explore the evolution of the merger of two carbon-oxygen (CO) white dwarfs. The dynamical evolution of a 0.9 Msun + 0.6 Msun CO white dwarf merger is followed by a three-dimensional SPH simulation. We use an elaborate prescription in which artificial viscosity is essentially absent, unless a shock is detected, and a much larger number of SPH particles than earlier calculations. Based on this simulation, we suggest that the central region of the merger remnant can, once it has reached quasi-static equilibrium, be approximated as a differentially rotating CO star, which consists of a slowly rotating cold core and a rapidly rotating hot envelope surrounded by a centrifugally supported disc. We construct a model of the CO remnant that mimics the results of the SPH simulation using a one-dimensional hydrodynamic stellar evolution code and then follow its secular evolution. The stellar evolution models indicate that the growth of the cold core is controlled by neutrino cooling at the interface between the core and the hot envelope, and that carbon ignition in the envelope can be avoided despite high effective accretion rates. This result suggests that the assumption of forced accretion of cold matter that was adopted in previous studies of the evolution of double CO white dwarf merger remnants may not be appropriate. Our results imply that at least some products of double CO white dwarfs merger may be considered good candidates for the progenitors of Type Ia supernovae. In this case, the characteristic time delay between the initial dynamical merger and the eventual explosion would be ~10^5 yr. (Abridged). '
data['text'] = data['text'].apply(lambda x: x.lower()) #变小写
data = data.drop(['abstract', 'title'], axis=1)
data.text[0]
'remnant evolution after a carbon-oxygen white dwarf merger  we systematically explore the evolution of the merger of two carbon-oxygen (co) white dwarfs. the dynamical evolution of a 0.9 msun + 0.6 msun co white dwarf merger is followed by a three-dimensional sph simulation. we use an elaborate prescription in which artificial viscosity is essentially absent, unless a shock is detected, and a much larger number of sph particles than earlier calculations. based on this simulation, we suggest that the central region of the merger remnant can, once it has reached quasi-static equilibrium, be approximated as a differentially rotating co star, which consists of a slowly rotating cold core and a rapidly rotating hot envelope surrounded by a centrifugally supported disc. we construct a model of the co remnant that mimics the results of the sph simulation using a one-dimensional hydrodynamic stellar evolution code and then follow its secular evolution. the stellar evolution models indicate that the growth of the cold core is controlled by neutrino cooling at the interface between the core and the hot envelope, and that carbon ignition in the envelope can be avoided despite high effective accretion rates. this result suggests that the assumption of forced accretion of cold matter that was adopted in previous studies of the evolution of double co white dwarf merger remnants may not be appropriate. our results imply that at least some products of double co white dwarfs merger may be considered good candidates for the progenitors of type ia supernovae. in this case, the characteristic time delay between the initial dynamical merger and the eventual explosion would be ~10^5 yr. (abridged). '
# 这个是整体
data['text'] = data['text'].apply(lambda x: x.replace('\n',' '))
data['text'] = data['title'] + data['abstract']
data['text'] = data['text'].apply(lambda x: x.lower()) #变小写
data = data.drop(['abstract', 'title'], axis=1)
data.head()
id categories text
0 0704.0297 astro-ph remnant evolution after a carbon-oxygen white ...
1 0704.0342 math.AT cofibrations in the category of frolicher spac...
2 0704.0360 astro-ph torsional oscillations of longitudinally inhom...
3 0704.0525 gr-qc on the energy-momentum problem in static einst...
4 0704.0535 astro-ph the formation of globular cluster systems in m...

由于原始论文有可能有多个类别,所以也需要处理:

# 整体
# 多个类别,包含子分类
data['categories'] = data['categories'].apply(lambda x : x.split(' '))# 单个类别,不包含子分类
data['categories_big'] = data['categories'].apply(lambda x : [xx.split('.')[0] for xx in x]) # 目标:抽取出每个类别的string(主类别),放到最外层大的列表中
# 分解
data.categories = data.categories.apply(lambda x:str(x).split(" "))
data.categories.head(10)
0    [astro-ph]
1     [math.AT]
2    [astro-ph]
3       [gr-qc]
4    [astro-ph]
5     [nucl-ex]
6    [quant-ph]
7     [math.DG]
8      [hep-ex]
9    [astro-ph]
Name: categories, dtype: object
# condition1:
# categories只有一个分类,但格式:主类别.子类别-->要用split('.')得到[主类别,子类别]-->再用[0]取出主类别
data.categories[7][0] # [7]得到的list是这一列的一个元素--lambda函数里的参数x;
'math.DG'
data.categories[7][0].split(".")[0]
'math'
# condition2:
# categories只有一个分类,且没有主子类别之分
data.categories[9][0].split('.') # 如果这个string里面没有.-->split之后的结果又变回list且只有一个分类str-->所以需要再用[0]取出str,之后放到大的列表生成式中
['astro-ph']
data.categories[9][0].split('.')[0]
'astro-ph'
data.categories[170616]
['solv-int', 'adap-org', 'hep-th', 'nlin.AO', 'nlin.SI']
# condition3:
# categories有多个类别
[xx.split(".")[0] for xx in data.categories[170616]] # xx是这个iterable中的每个类别str--可能有主子类别之分
['solv-int', 'adap-org', 'hep-th', 'nlin', 'nlin']
data.categories.apply(lambda x : [xx.split(".")[0] for xx in x])
# eg x:['solv-int', 'adap-org', 'hep-th', 'nlin.AO', 'nlin.SI']
# xx:'solv-int','nlin.SI' etc--用. 进行split-->['nlin', 'SI']-->只提取主要类别[0]:str-->放到最外面的列表生成式中
0                                       [astro-ph]
1                                           [math]
2                                       [astro-ph]
3                                          [gr-qc]
4                                       [astro-ph]...
170613                                  [quant-ph]
170614                            [solv-int, nlin]
170615                            [solv-int, nlin]
170616    [solv-int, adap-org, hep-th, nlin, nlin]
170617                            [solv-int, nlin]
Name: categories, Length: 170618, dtype: object
data.categories_big = data.categories.apply(lambda x : [xx.split(".")[0] for xx in x])
data.categories_big.tail(10)
170608                              [quant-ph, cs]
170609                                  [quant-ph]
170610                                  [quant-ph]
170611                                  [quant-ph]
170612                                  [quant-ph]
170613                                  [quant-ph]
170614                            [solv-int, nlin]
170615                            [solv-int, nlin]
170616    [solv-int, adap-org, hep-th, nlin, nlin]
170617                            [solv-int, nlin]
Name: categories_big, dtype: object

然后将类别进行编码,这里类别是多个,所以需要多编码:

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
data_label = mlb.fit_transform(data['categories_big'])

MultiLabelBinarizer

Transform between iterable of iterables and a multilabel format

input: 一定要是list of tuples/list of sets集合/list of lists的形式–>绝对不能是list的形式

初始化的时候可以传入参数classes

- 表示想要的顺序:an ordering for the class labels
- array-like of shape [n_classes] (optional) All entries should be unique (cannot contain duplicate classes).

Although a list of sets or tuples is a very intuitive format for multilabel
data, it is unwieldy to process. This transformer converts between this
intuitive format and the supported multilabel format: a (samples x classes)
binary matrix indicating the presence of a class label.

Attributes

classes_ : array of labels

A copy of the `classes` parameter where provided,
or otherwise, the sorted set of classes found when fitting.
mlb.fit_transform([(1, 2), (3,)]) # 这里传入的input是list of tuples
array([[1, 1, 0],[0, 0, 1]])
mlb.classes_ # 和 unique_cate是一样的 论文所有unique的类别
array(['acc-phys', 'adap-org', 'alg-geom', 'astro-ph', 'chao-dyn','chem-ph', 'cmp-lg', 'comp-gas', 'cond-mat', 'cs', 'dg-ga', 'econ','eess', 'funct-an', 'gr-qc', 'hep-ex', 'hep-lat', 'hep-ph','hep-th', 'math', 'math-ph', 'mtrl-th', 'nlin', 'nucl-ex','nucl-th', 'patt-sol', 'physics', 'q-alg', 'q-bio', 'q-fin','quant-ph', 'solv-int', 'stat', 'supr-con'], dtype=object)
len(mlb.classes_) # y变成了34列的vector
34
data_label[:5,:] # 提取出MultiLabelBinarizer之后结果的前五个数据 VS 原数据的y值
array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
np.where(data_label[0]==1)
(array([3]),)
np.where(data_label[0]==1)[0][0]
3
[np.where(l == 1)[0][0] for l in data_label[:5]] # 找出二维array的每个list中1出现的索引值
[3, 19, 3, 14, 3]
mlb.classes_[19]
'math'
data.categories_big.head() # 第二个数据的y确实是math
0    [astro-ph]
1        [math]
2    [astro-ph]
3       [gr-qc]
4    [astro-ph]
Name: categories_big, dtype: object
print(data_label,data_label.shape) # 170618行数据,y经过0/1编码之后是34列的vector
[[0 0 0 ... 0 0 0][0 0 0 ... 0 0 0][0 0 0 ... 0 0 0]...[0 0 0 ... 1 0 0][0 1 0 ... 1 0 0][0 0 0 ... 1 0 0]] (170618, 34)

VS preprocessing.OneHotEncoder独热编码 & pd.get_dummies哑变量

  • 不适用于multilabel的情况–i.e 同一变量的不同类别之间不是mutually exclusive,样本在这个变量上可以取到多个类别的值 i.e 论文的类别可以是多个
  • 如果是multilabel:只能用list的形式表示–>[类别1,类别2]–>这一列是array of list of string
  • OneHotEncoder&dummies:input必须是array-like of integers or strings;不能是array-like of lists
  • 且对于每个数据,在这个变量上only one of k categories can be 1

另外一种MultiLabelBinarizer的方式–用文档词矩阵(dtm)转换的方式

data["categories_string"] = data.categories_big.apply(lambda x:" ".join(x))
data["categories_string"].tail(10) # # 变成每篇论文的所有类别都在一个string中
170608                           quant-ph cs
170609                              quant-ph
170610                              quant-ph
170611                              quant-ph
170612                              quant-ph
170613                              quant-ph
170614                         solv-int nlin
170615                         solv-int nlin
170616    solv-int adap-org hep-th nlin nlin
170617                         solv-int nlin
Name: categories_string, dtype: object
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'(?u)\b[\w-]+\b',binary = True)
vec = vectorizer.fit_transform(data['categories_string'])
vec.toarray()
array([[0, 0, 0, ..., 0, 0, 0],[0, 0, 0, ..., 0, 0, 0],[0, 0, 0, ..., 0, 0, 0],...,[0, 0, 0, ..., 1, 0, 0],[0, 1, 0, ..., 1, 0, 0],[0, 0, 0, ..., 1, 0, 0]])
print(vectorizer.get_feature_names(),len(vectorizer.get_feature_names())) # 也是34个类别
['acc-phys', 'adap-org', 'alg-geom', 'astro-ph', 'chao-dyn', 'chem-ph', 'cmp-lg', 'comp-gas', 'cond-mat', 'cs', 'dg-ga', 'econ', 'eess', 'funct-an', 'gr-qc', 'hep-ex', 'hep-lat', 'hep-ph', 'hep-th', 'math', 'math-ph', 'mtrl-th', 'nlin', 'nucl-ex', 'nucl-th', 'patt-sol', 'physics', 'q-alg', 'q-bio', 'q-fin', 'quant-ph', 'solv-int', 'stat', 'supr-con'] 34
np.sum(~vec.toarray() == data_label) # 两种binarizer(dtm&MultiLabelBinarizer)的结果是一样的
0

思路1

思路1使用TFIDF提取特征,限制最多4000个单词:

data['text'].iloc[:] # 取出一列text,后面的iloc有和没有是一样的
0         remnant evolution after a carbon-oxygen white ...
1         cofibrations in the category of frolicher spac...
2         torsional oscillations of longitudinally inhom...
3         on the energy-momentum problem in static einst...
4         the formation of globular cluster systems in m......
170613    enhancement of magneto-optic effects via large...
170614    explicit and exact solutions to a kolmogorov-p...
170615    linear r-matrix algebra for a hierarchy of one...
170616    pfaff tau-functions  consider the evolution $$...
170617    the general solution of the complex monge-amp\...
Name: text, Length: 170618, dtype: object
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=4000)
data_tfidf = vectorizer.fit_transform(data['text'].iloc[:])

由于这里是多标签分类,可以使用sklearn的多标签分类进行封装:

# 划分训练集和验证集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_tfidf, data_label,test_size = 0.2,random_state = 1)# 构建多标签分类模型
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
clf = MultiOutputClassifier(MultinomialNB()).fit(x_train, y_train)
from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(x_test))) # y_true, y_predict
# multilabel的y用独热编码之后:变成了34列的vector
# 每一列都有precision/recall/f1 score
# 但此时好像就没有precison&recall之间的tradeoff的问题了
              precision    recall  f1-score   support0       0.00      0.00      0.00         01       0.00      0.00      0.00         12       0.00      0.00      0.00         03       0.91      0.85      0.88      36254       0.00      0.00      0.00         45       0.00      0.00      0.00         06       0.00      0.00      0.00         17       0.00      0.00      0.00         08       0.77      0.76      0.77      38019       0.84      0.89      0.86     1071510       0.00      0.00      0.00         011       0.00      0.00      0.00       18612       0.44      0.41      0.42      162113       0.00      0.00      0.00         114       0.75      0.59      0.66      109615       0.61      0.80      0.69      107816       0.90      0.19      0.32       24217       0.53      0.67      0.59      145118       0.71      0.54      0.62      140019       0.88      0.84      0.86     1024320       0.40      0.09      0.15       93421       0.00      0.00      0.00         122       0.87      0.03      0.06       41423       0.48      0.65      0.55       51724       0.37      0.33      0.35       53925       0.00      0.00      0.00         126       0.60      0.42      0.49      389127       0.00      0.00      0.00         028       0.82      0.08      0.15       67629       0.86      0.12      0.21       29730       0.80      0.40      0.53      171431       0.00      0.00      0.00         432       0.56      0.65      0.60      339833       0.00      0.00      0.00         0micro avg       0.76      0.70      0.72     47851macro avg       0.39      0.27      0.29     47851
weighted avg       0.75      0.70      0.71     47851samples avg       0.74      0.76      0.72     47851

思路2

思路2使用深度学习模型,单词进行词嵌入然后训练。将数据集处理进行编码,并进行截断:

data['text'].iloc[:100000]
0        remnant evolution after a carbon-oxygen white ...
1        cofibrations in the category of frolicher spac...
2        torsional oscillations of longitudinally inhom...
3        on the energy-momentum problem in static einst...
4        the formation of globular cluster systems in m......
99995    near optimal jointly private packing algorithm...
99996    of commutators and jacobians  i discuss the pr...
99997    on two conjectures about the sum of element or...
99998    dynamic predictions of kidney graft survival i...
99999    frequency domain model of $f$-mode dynamic tid...
Name: text, Length: 100000, dtype: object
# parameter
max_features= 500
max_len= 150
embed_size=100
batch_size = 128
epochs = 5from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequencetokens = Tokenizer(num_words = max_features)
tokens.fit_on_texts(list(data['text'].iloc[:100000]))y_train = data_label[:100000]
x_sub_train = tokens.texts_to_sequences(data['text'].iloc[:100000])
x_sub_train = sequence.pad_sequences(x_sub_train, maxlen=max_len)
Using TensorFlow backend.

定义模型并完成训练:

# LSTM model
# Keras Layers:
from keras.layers import Dense,Input,LSTM,Bidirectional,Activation,Conv1D,GRU
from keras.layers import Dropout,Embedding,GlobalMaxPooling1D, MaxPooling1D, Add, Flatten
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D# Keras Callback Functions:
from keras.callbacks import Callback
from keras.callbacks import EarlyStopping,ModelCheckpoint
from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
from keras.models import Model
from keras.optimizers import Adamsequence_input = Input(shape=(max_len, ))
x = Embedding(max_features, embed_size, trainable=True)(sequence_input)
x = SpatialDropout1D(0.2)(x)
x = Bidirectional(GRU(128, return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
x = Conv1D(64, kernel_size = 3, padding = "valid", kernel_initializer = "glorot_uniform")(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
x = concatenate([avg_pool, max_pool])
preds = Dense(34, activation="sigmoid")(x)model = Model(sequence_input, preds)
model.compile(loss='binary_crossentropy',optimizer=Adam(lr=1e-3),metrics=['accuracy'])
model.fit(x_sub_train, y_train, batch_size=batch_size, validation_split=0.2,epochs=epochs)
/root/miniconda3/envs/myconda/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:424: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory."Converting sparse IndexedSlices to a dense Tensor of unknown shape. "Train on 80000 samples, validate on 20000 samples
Epoch 1/5
80000/80000 [==============================] - 225s 3ms/step - loss: 0.1064 - accuracy: 0.9665 - val_loss: 0.0734 - val_accuracy: 0.9743
Epoch 2/5
80000/80000 [==============================] - 221s 3ms/step - loss: 0.0724 - accuracy: 0.9741 - val_loss: 0.0689 - val_accuracy: 0.9749
Epoch 3/5
80000/80000 [==============================] - 221s 3ms/step - loss: 0.0677 - accuracy: 0.9755 - val_loss: 0.0659 - val_accuracy: 0.9760
Epoch 4/5
80000/80000 [==============================] - 222s 3ms/step - loss: 0.0664 - accuracy: 0.9761 - val_loss: 0.0637 - val_accuracy: 0.9772
Epoch 5/5
80000/80000 [==============================] - 221s 3ms/step - loss: 0.0630 - accuracy: 0.9770 - val_loss: 0.0629 - val_accuracy: 0.9773<keras.callbacks.callbacks.History at 0x7f5e203d4e10>

Task4 论文种类分类相关推荐

  1. DataWhale | 学术前言趋势分析 | Task4 论文种类分类

    Task4 论文种类分类 任务说明 数据处理步骤 文本分类思路 具体代码实现以及讲解 导包 数据准备 数据处理 本文主要使用TF-IDF+机器学习分类器和FastText两种方法进行论文种类的分类,而 ...

  2. 论文种类分类Task4

    Task4:论文种类分类 # 导入所需的package import seaborn as sns #用于画图 from bs4 import BeautifulSoup #用于爬取arxiv的数据 ...

  3. 论文数据分析-4(论文种类分类)

    任务4:论文种类分类 这部分内容作者还没有完成,先放出来大家参考,作者会继续补充,不喜勿喷 4.1 任务说明 学习主题:论文分类(数据建模任务),利用已有数据建模,对新论文进行类别分类: 学习内容:使 ...

  4. 第四节:论文种类分类-学习笔记

    任务说明 学习主题:论文分类(数据建模任务),利用已有数据建模,对新论文进行类别分类: 学习内容:使用论文标题完成类别分类: 学习成果:学会文本分类的基本方法.TF-IDF等: 数据处理步骤 在原始a ...

  5. 数据分析---论文种类分类

    目录 1.1.任务说明 1.2.数据处理步骤 1.3.文本分类思路 1.4.具体代码实现以及讲解 1.4.1.思路1 1.4.2.思路2 1.1.任务说明 学习主题:论文分类(数据建模任务),利用已有 ...

  6. 数据分析:论文种类分类

    文章目录 任务说明 一.数据处理步骤 二.文本分类思路 任务说明 学习主题:论文分类(数据建模任务),利用已有数据建模,对新论文进行类别分类: 学习内容:使用论文标题完成类别分类: 学习成果:学会文本 ...

  7. 25篇经典机器学习论文的分类

    25篇经典机器学习论文的分类 前言 放假当咸鱼的时候学校要求阅读论文文献,老板找了25篇比较经典的模式识别与机器学习相关的论文要求阅读,作为对人工智能一无所知且前半生学术生涯全贡献给通信的半路出家和尚 ...

  8. 深度有趣 | 29 方言种类分类

    简介 结合上节课的内容,使用WaveNet进行语音分类 原理 对于每一个MFCC特征都输出一个概率分布,然后结合CTC算法即可实现语音识别 相比之下,语音分类要简单很多,因为对于整个MFCC特征序列只 ...

  9. 一起读论文 | 文本分类任务的BERT微调方法论

    导读:今天为大家解读一篇复旦大学邱锡鹏老师课题组的研究论文<How to Fine-Tune BERT for Text Classification?>.这篇论文的主要目的在于在文本分类 ...

最新文章

  1. c语言N*N的二维数组,c语言高手帮个忙(请先看问题,好解答
  2. Crystal Reports 财务日记帐凭证套打设计
  3. 洛谷P3809 后缀数组模板
  4. 第一行代码学习笔记第九章——使用网络技术
  5. SAP 电商云 Spartacus UI 回归测试 wish-list.core-e2e-spec.ts
  6. java 保垒机telnet,开源堡垒机系统Teleport
  7. LINUX入侵检测指导
  8. Ettercap-中间人欺骗
  9. 《长安十二时辰》火了!程序员版本过于真实!
  10. 你真的需要那么多报表么?| 专栏
  11. STM32单片机介绍2
  12. 解决unable to find valid certification path to requested target
  13. [week13] 2 - T1
  14. Java中的自动向量化(SIMD)
  15. 【Laravel】Laravel使用总结(一)
  16. 机器学习——基础知识
  17. Unity3D学习之打飞碟游戏
  18. 二极管、三极管、晶闸管基本知识
  19. was英文读音_was't是什么意思
  20. python 列表操作(完整版)

热门文章

  1. 中亦安图oracle培训,【中亦安图】Systemstate Dump分析经典案例(8)
  2. Outlook 2013 pst/ost邮件数据文件迁移实现
  3. 设备扩展(DEVICE_EXTENSION)
  4. Android相机预览设置适配及显示方式
  5. 葵花宝典第二招:突破单峰密集
  6. 二维小波变换_小波变换完美通俗讲解系列之 (一)
  7. Tiled-免费2D游戏场景Tile编辑器
  8. matlab计算惯性矩,梁单元有限元计算程序(matlab)
  9. 在移动端设置overflow:hidden禁止滚动的解决方法
  10. 《满江红》非岳飞所作?