论文数据统计学习记录

任务说明
分析说明
- 1. 获得方法：**==使用python 爬虫爬取数据集==**
- 2. 这里**==使用正则化加以限制==**
- 3. 判断数据是否存在
- 4. json文件的设置及使用方法
- - 4.1. 用json读取字符串文件
  - 4.2. 用json读取文本文件
  - 4.3. 把字典写入json文件
项目演练
- 1. 导入库
- 2. 读取数据
- - enumerate 函数
- 3. 读取数据之后，对存储数据的变量进行概览
- 4. 查看文件的类别并选出与任务相关的类别（2019年论文数量统计）
- 5. 数据预处理
- - 5.1 数据的种类信息
  - - 5.1.1列表生成式的使用方法
    - 5.1.2 列表生成式的理解：
  - 5.2 数据时间特征预处理
  - - 5.2.1 如何使用pandas处理时间序列（简洁版）
    - - 1. 首先查看表示时间的数据类型，并进行提取：
      - 2.在没有已知类别的情况下，从数据集中找符合某个类别的种类
      - 3. 数据可视化
      - 3.1 名字一样的大类别的统计
        
        3.2 对大类进行可视化
        
        3.3 对大类中的小类进行统计（统计 2019年计算机领域的各个子领域的论文数量）

任务说明

任务主题：论文数量统计，即统计2019年全年计算机各个方向论文数量；

任务内容：赛题的理解、使用 Pandas 读取数据并进行统计；

任务成果：学习 Pandas 的基础操作；

分析说明

我们使用的数据是：2019年全年计算机各个方向论文数量。那么问题来了？数据在哪里？怎么获得？

1. 获得方法：使用python 爬虫爬取数据集

因此便会使用到：

import requests #用于网络连接，发送网络请求，使用域名获取对应信息
from bs4 import BeautifulSoup #用于爬取arxiv的数据

通过爬虫爬取网页上指定文件
参考文章（关于python爬取网页上指定内容）

如果没有限制的进行爬虫，势必会加重后面的数据处理的工作，影响分析思路，那么我们应怎样对爬取数据进行限制呢？需要用到什么库呢？

2. 这里使用正则化加以限制

import re #用于正则表达式，匹配字符串的模式

3. 判断数据是否存在

当我们获得数据之后，第二次再次执行文件的时候，如果再重复爬虫爬取数据的工作，那相当于把重复做了此类工作，不仅会占用内存，还会浪费不必要的时间。那么如何解决此类情况呢？如何判断该文件已经存在呢？（= 对于已经存在的文件进行直接读取操作，如果不存在，那么就进行爬虫获取文件， 其中要包含对文件是否存在进行判定）

可以使用os.path.isfile()或pathlib.Path.is_file() 来检查文件是否存在。

暂时写的一个例子，后面可能会修改：
这里并不配套

import os
import numpy as nptrain_path = './fashion_image_label/fashion_train_jpg_60000/'
train_txt = './fashion_image_label/fashion_train_jpg_60000.txt'
x_train_savepath = './fashion_image_label/fashion_x_train.npy'
y_train_savepath = './fashion_image_label/fahion_y_train.npy'def generateds(path, txt):f = open(txt, 'r')contents = f.readlines()  # 按行读取f.close()x, y_ = [], []for content in contents:value = content.split()  # 以空格分开，存入数组img_path = path + value[0]img = Image.open(img_path)img = np.array(img.convert('L'))img = img / 255.x.append(img)y_.append(value[1])print('loading : ' + content)x = np.array(x)y_ = np.array(y_)y_ = y_.astype(np.int64)return x, y_if os.path.exists(x_train_savepath) and os.path.exists(y_train_savepath) and os.path.exists(x_test_savepath) and os.path.exists(y_test_savepath):print('-------------Load Datasets-----------------')x_train_save = np.load(x_train_savepath)y_train = np.load(y_train_savepath)x_test_save = np.load(x_test_savepath)y_test = np.load(y_test_savepath)x_train = np.reshape(x_train_save, (len(x_train_save), 28, 28))x_test = np.reshape(x_test_save, (len(x_test_save), 28, 28))
else:print('-------------Generate Datasets-----------------')x_train, y_train = generateds(train_path, train_txt)x_test, y_test = generateds(test_path, test_txt)print('-------------Save Datasets-----------------')x_train_save = np.reshape(x_train, (len(x_train), -1))x_test_save = np.reshape(x_test, (len(x_test), -1))np.save(x_train_savepath, x_train_save)np.save(y_train_savepath, y_train)np.save(x_test_savepath, x_test_save)np.save(y_test_savepath, y_test)

参考文章（用python查看文件是否存在的三种方式）
（如何检查文件是否存在）

4. json文件的设置及使用方法

当我们爬取到文件之后应该把文件保存为哪种格式? 不同的文件类型，采用不同的读取方法。

具体如下：

这里我们将文件保存为 .json 格式。下面有关该文件的一些知识：

参考文章（json文件格式详解）
（json文件的读取与写入）

举例说明：

4.1. 用json读取字符串文件

import json  # 读取数据，我们的数据为json格式的str='''[{"name":"kingsan","age":'23'},{"name":"xiaolan","age":"23"}]
'''
print(type(str))
data = json.loads(str)
print(data)
print(type(data))

JSONDecodeError: Expecting value: line 2 column 15 (char 34)

运行时发现错误：

解决办法：
原来数据格式里string类型的数据要用双引号，而不是单引号。

修改之后：

import json
str='''[{"name":"kingsan","age":"23"},{"name":"xiaolan","age":"23"}]
'''
print(type(str))
data = json.loads(str)
print(data)
print(type(data))

<class 'str'>
[{'name': 'kingsan', 'age': '23'}, {'name': 'xiaolan', 'age': '23'}]
<class 'list'>

参考文章（json.decoder.JSONDecodeError: Expecting value错误的解决方法）

4.2. 用json读取文本文件

大概的形式为：

import json
with open('data.json','r') as file:str = file.read()data = json.loads(str)print(data)

具体案例：

import json
data =[{'name':'kingsan',
'age':'23'
}]with open('data.json','w') as file:file.write(json.dumps(data))with open('data.json', "r") as f:for idx, col in enumerate(f):print(idx)   # 显示的行标签print(col)   # 显示某一行的内容

4.3. 把字典写入json文件

（参考）

import json
data ={"grade1": {'name':'kingsan','age':'23'},"grade2": {"name": "xiaoliu","age":"24"},"grade3": {"name":"xiaowang","age":"22"}}with open('data.json','w') as file:   # 写文件file.write(json.dumps(data))with open('data.json', "r") as f:     # 读文件for idx, line in enumerate(f):
#         passprint(idx)print(line)print(type(line))

0
{"grade1": {"name": "kingsan", "age": "23"}, "grade2": {"name": "xiaoliu", "age": "24"}, "grade3": {"name": "xiaowang", "age": "22"}}
str

然后我们的目标是学习pandas的基础操作，会用pandas进行一定的数据统计和分析

那接下来便是正式的项目演练：

项目演练

1. 导入库

# 爬取数据的库
from bs4 import BeautifulSoup #用于爬取arxiv的数据
import requests #用于网络连接，发送网络请求，使用域名获取对应信息# 限制数据格式的库
import re #用于正则表达式，匹配字符串的模式# 数据保存和读取的库
import json #读取数据，我们的数据为json格式的# 数据分析用的
import pandas as pd #数据处理，数据分析# 数据可视化用的
import matplotlib.pyplot as plt #画图工具
import seaborn as sns #用于画图

2. 读取数据

# 读入数据
data  = []#使用with语句优势：1.自动关闭文件句柄；2.自动显示（处理）文件读取数据异常
with open("arxiv-metadata-oai-2019.json", 'r') as f: for idx, line in enumerate(f): # 读取前100行，如果读取所有数据需要8G内存if idx >= 100:breakdata.append(json.loads(line))data = pd.DataFrame(data) #将list变为dataframe格式，方便使用pandas进行分析
data.shape #显示数据大小

此处应注意：我们用的文件是 arxiv-metadata-oai-2019.json ，这个文件要和代码文件在一起，如果没有在一起的话，两种处理方式：
（1）手动复制或剪切到指定位置
（2）通过定义path 变量进行操作

data=[]  # 如果重新读取，注意在此读入是要初始化datapath = r"F:\Python_Tensorflow_codes\006group_learning\team-learning-data-mining-master\AcademicTrends\arxiv-metadata-oai-2019.json"
with open(path, 'r') as f: for idx, line in enumerate(f): # 读取前100行，如果读取所有数据需要8G内存if idx >= 100:breakdata.append(json.loads(line))data = pd.DataFrame(data) #将list变为dataframe格式，方便使用pandas进行分析
data.shape #显示数据大小

enumerate 函数

这里的 enumerate 函数的用法（链接）
json文件的读取与写入

3. 读取数据之后，对存储数据的变量进行概览

data.head() #显示数据的前五行

4. 查看文件的类别并选出与任务相关的类别（2019年论文数量统计）

def readArxivFile(path, columns=['id', 'submitter', 'authors', 'title', 'comments', 'journal-ref', 'doi','report-no', 'categories', 'license', 'abstract', 'versions','update_date', 'authors_parsed'], count=None):'''定义读取文件的函数path: 文件路径columns: 需要选择的列count: 读取行数'''data  = []with open(path, 'r') as f: for idx, line in enumerate(f): if idx == count:breakd = json.loads(line)     # line是 str 格式，将str格式的数据通过 json.loads() 加载进来, 输出为 字典类型d = {col : d[col] for col in columns}data.append(d)data = pd.DataFrame(data)return datadata = readArxivFile('arxiv-metadata-oai-2019.json', ['id', 'categories', 'update_date'])  # 这个方法是为了选出跟任务相关的类别数据print(data.head())

 d = json.loads(line)     # line是 str 格式，将str格式的数据通过 json.loads() 加载进来, 输出为 字典类型d = {col : d[col] for col in columns}data.append(d)

上面的这3步挺受用的：

第一步：将 enumerate(f) 的输出line（为字符串类型的数据）转换成 d （为字典类型）
这样我们就可以通过键和值对数据进行排列，提取
第二步：通过字典表达式，将数据进行一个一个提取。
第三步：将提取到的项目不断添加到数据变量中。

5. 数据预处理

5.1 数据的种类信息

data["categories"].describe()

count      1796911
unique       62055
top       astro-ph
freq         86914
Name: categories, dtype: object

在这里我们要判断只出现一种的数据种类

unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])
len(unique_categories)
unique_categories

5.1.1列表生成式的使用方法

在这里用到的是列表生成式

为什么要用列表生成式

执行顺序：

我们的目的是为了找出独有的种类。这里通过遍历的方式去找

5.1.2 列表生成式的理解：

a = [x.split(' ') for x in data["categories"]]for l in a:for i in l:unique_categories = set([i])print(unique_categories)

当我去这样做的时候:
运行解和列表生成式中的不一样。

{'nlin.SI'}

首先我们的目的是，把种类类别提取出来（并保证唯一性。）
查看数据：

data["categories"]

0                                         astro-ph
1                                          math.AT
2                                         astro-ph
3                                            gr-qc
4                                         astro-ph...
170613                                    quant-ph
170614                            solv-int nlin.SI
170615                            solv-int nlin.SI
170616    solv-int adap-org hep-th nlin.AO nlin.SI
170617                            solv-int nlin.SI
Name: categories, Length: 170618, dtype: object

从上面看出：第一列是下标，第二列是种类信息。

print(type(data["categories"]))

<class 'pandas.core.series.Series'>

使用unique()函数，将唯一种类取出，并且 pd.unique()函数的输出是numpy.ndarray（一个数组类型的数据）

####
data_unique_cate = data["categories"].unique()   # data_unique_cate = pd.unique(data["categories"])
print(type(data_unique_cate))
print(data_unique_cate)

#### 从这里面不断提取数据，然后构成一个集合
list(data_unique_cate)[:10]

['astro-ph','math.AT','gr-qc','nucl-ex','quant-ph','math.DG','hep-ex','cond-mat.str-el cond-mat.mes-hall','math.CA','math.DG math.AG']

到此处我们发现，data_unique_cate（是字符串类型的列表）中有一些元素（字符串类型）是一块的。比如

 'cond-mat.str-el cond-mat.mes-hall'

也就是说，一个元素有两个小元素组成，那么这些小元素可能会有重复的，因为 unique() 只检查的是data_unique_cate中的元素值。

我们在观察：

 'cond-mat.str-el cond-mat.mes-hall'

这两个小元素是通过空格分开的，那么我们可以采用split() 方法，将他们分开

注意：
.split() 方法是字符串分割方法，分割后的结果是列表形式
参考（ Python split()方法）
首先我们要提取这些元素，这些元素是字符串类型的，且小元素是通过空格隔开的。

那么进行如下操作：

for i in range(len(data_unique_cate[:10])):x = data_unique_cate[i]print(x)

astro-ph
math.AT
gr-qc
nucl-ex
quant-ph
math.DG
hep-ex
cond-mat.str-el cond-mat.mes-hall
math.CA
math.DG math.AG

如果直接从for i 循环中，对x进行分割那么势必会出现：

for i in range(len(data_unique_cate[:10])):x = data_unique_cate[i]
#     print(x)x_sp = x.split(" ")print(x_sp)

['astro-ph']
['math.AT']
['gr-qc']
['nucl-ex']
['quant-ph']
['math.DG']
['hep-ex']
['cond-mat.str-el', 'cond-mat.mes-hall']
['math.CA']
['math.DG', 'math.AG']

这样就不对了，因为每个元素现在的类型是一个列表，列表中存放的是字符串类型的元素。并且没有将前面提到的小元素进行分割

再次尝试：

for i in range(len(data_unique_cate[:10])):x = data_unique_cate[i]
#     print(x)x_sp = x[i].split(" ")print(x_sp)

['a']
['a']
['-']
['l']
['t']
['D']
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-58-a96f3620fe05> in <module>30     x = data_unique_cate[i]31 #     print(x)
---> 32     x_sp = x[i].split(" ")33     print(x_sp)34 #     for l in x.split(" "):IndexError: string index out of range

这样的方法更不对了，因为它将小元素每个字符都进行分割了。

总结：首先，我们报保证每个元素是独立的，其次，要把整个元素取出来，然后通过空格分开。

具体做法：

for i in range(len(data_unique_cate[:10])):x = data_unique_cate[i]
#     print(x)
#     x_sp = x[i].split(" ")  # 错误示范
#     print(x_sp)     # 错误示范for l in x.split(" "):   # 此处的 x 是一个元素element = l         # 这里的element是 x 根据 空格 分割后的小元素print(element)

astro-ph
math.AT
gr-qc
nucl-ex
quant-ph
math.DG
hep-ex
cond-mat.str-el
cond-mat.mes-hall
math.CA
math.DG
math.AG

这一个工作，实现了将元素逐个取出，然后将元素分割成小元素。

然后把新生的 element 组合起来生成一个新的列表，然后将其转换为集合类型

list_name = []
for i in range(len(data_unique_cate[:10])):x = data_unique_cate[i]
#     print(x)
#     x_sp = x[i].split(" ")  # 错误示范
#     print(x_sp)     # 错误示范for l in x.split(" "):   # 此处的 x 是一个元素element = l         # 这里的element是 x 根据 空格 分割后的小元素
#         print(element)list_name.append(element)
set(list_name)

{'astro-ph','cond-mat.mes-hall','cond-mat.str-el','gr-qc','hep-ex','math.AG','math.AT','math.CA','math.DG','nucl-ex','quant-ph'}

以上是对前10行的数据进行操作。

接下来就对所有数据进行操作：

list_name = []
for i in range(len(data_unique_cate)):x = data_unique_cate[i]
#     print(x)
#     x_sp = x[i].split(" ")  # 错误示范
#     print(x_sp)     # 错误示范for l in x.split(" "):   # 此处的 x 是一个元素element = l         # 这里的element是 x 根据 空格 分割后的小元素
#         print(element)list_name.append(element)# print(list_name)   # 这种直接打印的方法，执行速度慢，占用内存空间大
set(list_name)   # 这里看出来，python中的set() 方法可以对元素自动排序， 这种排序方式是一种假象。

执行结果：太长省略

注意：
set() 方法是一种将数据转化为集合类型的方法，你看它进行了“排序”（带有引号的假象排序），其实不是这样的

s1 = {7,2,2,1,6, 9,3}
s2 = {4, 2, 1, 7, 2,52,1,4,7,1,14,7,21,33,4, 37}
print('s1', s1)
print('s2', s2)

s1 {1, 2, 3, 6, 7, 9}
s2 {1, 2, 33, 4, 37, 7, 14, 52, 21}

这样之后我们在来看这个列表生成式（执行效率最高）

unique_categories = set([i for l in [x.split(' ') for x in data["categories"]] for i in l])

怎么写出来的：

for x in data["categories"]  遍历所有类别，（其中x 为 类别变量），
然后通过 split("  ") 对每个元素进行分割, 并把所有分割后的小元素放到 [ ] 中括号中，构成列表
再然后 用 for l in  上面的列表
最后 用 for i  in  l 取出每个小元素
然后放到 [] 中，然后转换成集合形式

注意：原程序中写的 set() 是进行去重操作的，从而选取的是 data[“categories”]

5.2 数据时间特征预处理

我们的任务要求对于2019年以后的paper进行分析，所以首先对于时间特征进行预处理，从而得到2019年以后的所有种类的论文：

（从这里看出，始终围绕着问题要求去展开任务。）

data["year"] = pd.to_datetime(data["update_date"]).dt.year #将update_date从例如2019-02-20的str变为datetime格式，并提取处year
del data["update_date"] #删除 update_date特征，其使命已完成
data = data[data["year"] >= 2019] #找出 year 中2019年以后的数据，并将其他数据删除# data.groupby(['categories','year']) #以 categories 进行排序，如果同一个categories 相同则使用 year 特征进行排序
data.reset_index(drop=True, inplace=True) #重新编号
data #查看结果

下面我们将通过这段代码复习一下：

5.2.1 如何使用pandas处理时间序列（简洁版）

若想详细了解，参考文章：
Python——Pandas 时间序列数据处理

明确任务：我们要统计2019年的论文数量。我们已经知道了论文的种类，那么我们要从众多的时间中选出时间为 2019 的数据

1. 首先查看表示时间的数据类型，并进行提取：

print(type(data["update_date"]))
print("-"*10)
print(data["update_date"])
print("*"*10)
print(type(data["update_date"][0]))

<class 'pandas.core.series.Series'>
----------
0         2019-08-19
1         2019-08-19
2         2019-08-19
3         2019-10-21
4         2019-08-19...
170613    2019-08-17
170614    2019-08-15
170615    2019-08-17
170616    2019-08-17
170617    2019-08-21
Name: update_date, Length: 170618, dtype: object
**********
<class 'str'>

进行观察：

1. 表示时间的类别是 update_date
2. 时间数据存放在data["update_date"]中，格式为 pd.Series格式
3. 时间数据中的元素目前是 str 类型， 也就是说，现在还不是时间数据，现在是字符串数据，长得像时间而已
4. 并且这种时间字符，包含 ： 年月日

针对上面的观察，我们进行如下处理：

首先我们要**将字符串类型表示时间的数据转化成时间序列**
参考文章:

pandas 字符串类型转换成时间类型 object to datetime64[ns]

Pandas 将DataFrame字符串日期转化成时间类型日期(这两篇文章包含时间转化，日期加1，时间日期的年月日格式抽取)

   这里使用 pd.to_datetime() 方法

data_time = pd.to_datetime(data["update_date"])   #  data["year"] = pd.to_datetime(data["update_date"].values)
data_time

0        2019-08-19
1        2019-08-19
2        2019-08-19
3        2019-10-21
4        2019-08-19...
170613   2019-08-17
170614   2019-08-15
170615   2019-08-17
170616   2019-08-17
170617   2019-08-21
Name: year, Length: 170618, dtype: datetime64[ns]

此时已经变成了时间数据，那么我们还要把时间数据中的年份提取出来

使用  pandas.Series.dt.year    其中  dt = datetime  这样记起来方便

data["year"] = data_time.dt.year
print(data["year"])


0         2019
1         2019
2         2019
3         2019
4         2019...
170613    2019
170614    2019
170615    2019
170616    2019
170617    2019
Name: year, Length: 170618, dtype: int64

为了避免时间的冲突，我们把原来的时间特征进行删除

print(data.head())  # 删除之前的数据，用于对比
del data["update_date"]  # 在原有数据集上进行操作
print(data.head())  # 删除之后的数据

          id categories update_date  year
0  0704.0297   astro-ph  2019-08-19  2019
1  0704.0342    math.AT  2019-08-19  2019
2  0704.0360   astro-ph  2019-08-19  2019
3  0704.0525      gr-qc  2019-10-21  2019
4  0704.0535   astro-ph  2019-08-19  2019id categories  year
0  0704.0297   astro-ph  2019
1  0704.0342    math.AT  2019
2  0704.0360   astro-ph  2019
3  0704.0525      gr-qc  2019
4  0704.0535   astro-ph  2019

注意， del data[“update_date”] 中的 del 是在原数据上进行的修改，如果执行一次之后，在执行这条语句的话，会报错，因为原来的数据已经删除了，找不到了
报错显示：

KeyError: 'update_time'

接着我们进行筛选，把2019年的数据选出来

data = data[data["year"] == 2019] #找出 year 中2019年的数据，并将其他数据删除
print(data.head())

          id categories  year
0  0704.0297   astro-ph  2019
1  0704.0342    math.AT  2019
2  0704.0360   astro-ph  2019
3  0704.0525      gr-qc  2019
4  0704.0535   astro-ph  2019

选出时间是2019年的数据, 并将其结果重新赋值给 data 变量, 其中 data 是 DataFrame格式

别嫌自己啰嗦：重申目标（统计2019年论文数据）

data.groupby(['categories','year']) #以 categories 进行排序，如果同一个categories 相同则使用 year 特征进行排序
data.reset_index(drop=True, inplace=True) #重新编号
data #查看结果


id  categories  year
0   0704.0297   astro-ph    2019
1   0704.0342   math.AT 2019
2   0704.0360   astro-ph    2019
3   0704.0525   gr-qc   2019
4   0704.0535   astro-ph    2019
... ... ... ...
170613  quant-ph/9904032    quant-ph    2019
170614  solv-int/9511005    solv-int nlin.SI    2019
170615  solv-int/9809008    solv-int nlin.SI    2019
170616  solv-int/9909010    solv-int adap-org hep-th nlin.AO nlin.SI    2019
170617  solv-int/9909014    solv-int nlin.SI    2019
170618 rows × 3 columns

2.在没有已知类别的情况下，从数据集中找符合某个类别的种类

从现在开始，我们得到了2019年所有的论文种类数据，那么下面选出计算机领域中的所有文章。
说明：我们要把计算机领域相关的字符找出来，然后在这里进行匹配选择

那么怎么找类别呢？
2. 1 通过爬虫爬取所有类别：

#爬取所有的类别
website_url = requests.get('https://arxiv.org/category_taxonomy').text #获取网页的文本数据
soup = BeautifulSoup(website_url,'lxml') #爬取数据，这里使用lxml的解析器，加速
root = soup.find('div',{'id':'category_taxonomy_list'}) #找出 BeautifulSoup 对应的标签入口
tags = root.find_all(["h2","h3","h4","p"], recursive=True) #读取 tags#初始化 str 和 list 变量
level_1_name = ""
level_2_name = ""
level_2_code = ""
level_1_names = []
level_2_codes = []
level_2_names = []
level_3_codes = []
level_3_names = []
level_3_notes = []#进行
for t in tags:if t.name == "h2":level_1_name = t.text    level_2_code = t.textlevel_2_name = t.textelif t.name == "h3":raw = t.textlevel_2_code = re.sub(r"(.*)\((.*)\)",r"\2",raw) #正则表达式：模式字符串：(.*)\((.*)\)；被替换字符串"\2"；被处理字符串：rawlevel_2_name = re.sub(r"(.*)\((.*)\)",r"\1",raw)elif t.name == "h4":raw = t.textlevel_3_code = re.sub(r"(.*) \((.*)\)",r"\1",raw)level_3_name = re.sub(r"(.*) \((.*)\)",r"\2",raw)elif t.name == "p":notes = t.textlevel_1_names.append(level_1_name)level_2_names.append(level_2_name)level_2_codes.append(level_2_code)level_3_names.append(level_3_name)level_3_codes.append(level_3_code)level_3_notes.append(notes)#根据以上信息生成dataframe格式的数据
df_taxonomy = pd.DataFrame({'group_name' : level_1_names,'archive_name' : level_2_names,'archive_id' : level_2_codes,'category_name' : level_3_names,'categories' : level_3_codes,'category_description': level_3_notes})#按照 "group_name" 进行分组，在组内使用 "archive_name" 进行排序
df_taxonomy.groupby(["group_name","archive_name"])
df_taxonomy

 group_name  archive_name    archive_id  category_name   categories  category_description
0   Computer Science    Computer Science    Computer Science    Artificial Intelligence cs.AI   Covers all areas of AI except Vision, Robotics...
1   Computer Science    Computer Science    Computer Science    Hardware Architecture   cs.AR   Covers systems organization and hardware archi...
2   Computer Science    Computer Science    Computer Science    Computational Complexity    cs.CC   Covers models of computation, complexity class...
3   Computer Science    Computer Science    Computer Science    Computational Engineering, Finance, and Science cs.CE   Covers applications of computer science to the...
4   Computer Science    Computer Science    Computer Science    Computational Geometry  cs.CG   Roughly includes material in ACM Subject Class...
... ... ... ... ... ... ...
150 Statistics  Statistics  Statistics  Computation stat.CO Algorithms, Simulation, Visualization
151 Statistics  Statistics  Statistics  Methodology stat.ME Design, Surveys, Model Selection, Multiple Tes...
152 Statistics  Statistics  Statistics  Machine Learning    stat.ML Covers machine learning papers (supervised, un...
153 Statistics  Statistics  Statistics  Other Statistics    stat.OT Work in statistics that does not fit into the ...
154 Statistics  Statistics  Statistics  Statistics Theory   stat.TH stat.TH is an alias for math.ST. Asymptotics, ...
155 rows × 6 columns

爬虫这部分没怎么明白，先空着，等考完试进行补充。

3. 数据可视化

我们通过爬虫得到了数据，那么要类别进行筛选，并和 .json文件中的种类进行匹配，进行统计

_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()_df

这里看是能看懂，但如果自己写的话，不会这么写，接下来是，个人对于这段代码的理解

先确定一下目的：我们要对种类（类别）进行统计

输入是：前面通过爬虫爬到的数据，那么回过头去，去看一下爬取的数据（155行 * 6列）

我们现在要做的是：名字一样的进行统计
然后在把名字一样，细分领域也一样的进行统计，
然后在统计计算机大类，细分领域在2019年的论文数量，并完成可视化（用饼状图描述占比）

下面逐一进行

3.1 名字一样的大类别的统计

df_taxonomy.groupby("group_name").count()

 archive_name    archive_id  category_name   categories  category_description
group_name
Computer Science    40  40  40  40  40
Economics   3   3   3   3   3
Electrical Engineering and Systems Science  4   4   4   4   4
Mathematics 32  32  32  32  32
Physics 51  51  51  51  51
Quantitative Biology    10  10  10  10  10
Quantitative Finance    9   9   9   9   9
Statistics  6   6   6   6   6
1

这样统计的结果仅仅是爬虫中的大型类别
各个大类的数量，一共有 155行，8 大类，
显然这样是没有和之前的论文数据结合起来的

那么怎样结合？如何把两个数据结合起来（连接起来，按照两个数据共有的类别进行连接）？

使用merge函数

data.merge(df_taxonomy, on="categories", how="left")

这里只是单纯的连接起来了，然后在把重复的id去除（我们统计的是种类，不是个数）

data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id", "group_name"])

就按照group_name大类名，统计id个数（此时已经去重了，所以一个类就是一个名）
其中使用了聚合函数 agg({“针对哪个”:“使用什么方法”})

_df = data.merge(df_taxonomy, on="categories", how="left").drop_duplicates(["id","group_name"]).groupby("group_name").agg({"id":"count"}).sort_values(by="id",ascending=False).reset_index()_df


group_name  id
0   Physics 79985
1   Mathematics 51567
2   Computer Science    40067
3   Statistics  4054
4   Electrical Engineering and Systems Science  3297
5   Quantitative Biology    1994
6   Quantitative Finance    826
7   Economics   576

此时统计大类完成

3.2 对大类进行可视化

fig = plt.figure(figsize=(15,12))
explode = (0, 0, 0, 0.2, 0.3, 0.3, 0.2, 0.1)
plt.pie(_df["id"],  labels=_df["group_name"], autopct='%1.2f%%', startangle=160, explode=explode)
plt.tight_layout()
plt.show()

这一部分直接写程序就行，具体函数的使用方法可以参考（https://www.jb51.net/article/100389.htm）
参数的含义：python中使用matplotlib绘制饼状图

下面对绘图的使用方法和参数进行一定的总结：

1. 首先 导入需要的库  import maplotlib.pyplot as plt
2. 明确一下是否要 设置中文字体显示  plt.rcParams['font.sans-serif'] = ['SimHei']  # 设置字体，解决中文无法显示的问题
3. 先写出 plt.pie(数值大小, label = 对应标签, color="颜色", explode=[扇形部分之间的空隙1, 扇形部分之间的空隙2, ……]， autopic="各部分的占比显示情况") 这里的参数需要事先设置好。
4. plt.legend(名称, loc = "位置")  添加图例，显示标签
5. plt.show()

注意如果有中文字体无法显示的情况，那么可以查看（Python 3下Matplotlib画图中文显示乱码的解决方法）

注意这里的绘制饼状图的方法也可以迁移到绘制其他图形上

3.3 对大类中的小类进行统计（统计 2019年计算机领域的各个子领域的论文数量）

首先，我们要选出计算机领域的文章。那问题来了：
Q1：从哪里选出来呢？ = 数据中要包含爬取的数据和论文中的数据 = pandas 结合（融合数据的方法）
A1：从融合后的数据中选择

Q2：怎么样才能选出计算机领域的呢？ = 用pandas查询某个类别
A2：先**查看（用眼看）计算机领域是列表头的哪一个，然后再查询（有个方法叫 query() 进行查询）**这个值。

Q3：然后怎么统计？
A3：把大类和小类进行分组，然后进行统计

具体操作：

group_name = "Computer Science"
cats = data.merge(df_taxonomy, on="categories").query("group_name == @group_name")
cats.groupby(["group_name", "category_name"]).count().reset_index()

 group_name  category_name   id  categories  year    archive_name    archive_id  category_description
0   Computer Science    Artificial Intelligence 558 558 558 558 558 558
1   Computer Science    Computation and Language    2153    2153    2153    2153    2153    2153
2   Computer Science    Computational Complexity    131 131 131 131 131 131
3   Computer Science    Computational Engineering, Finance, and Science 108 108 108 108 108 108
4   Computer Science    Computational Geometry  199 199 199 199 199 199
5   Computer Science    Computer Science and Game Theory    281 281 281 281 281 281
6   Computer Science    Computer Vision and Pattern Recognition 5559    5559    5559    5559    5559    5559
7   Computer Science    Computers and Society   346 346 346 346 346 346
8   Computer Science    Cryptography and Security   1067    1067    1067    1067    1067    1067
9   Computer Science    Data Structures and Algorithms  711 711 711 711 711 711
10  Computer Science    Databases   282 282 282 282 282 282
11  Computer Science    Digital Libraries   125 125 125 125 125 125
12  Computer Science    Discrete Mathematics    84  84  84  84  84  84
13  Computer Science    Distributed, Parallel, and Cluster Computing    715 715 715 715 715 715
14  Computer Science    Emerging Technologies   101 101 101 101 101 101
15  Computer Science    Formal Languages and Automata Theory    152 152 152 152 152 152
16  Computer Science    General Literature  5   5   5   5   5   5
17  Computer Science    Graphics    116 116 116 116 116 116
18  Computer Science    Hardware Architecture   95  95  95  95  95  95
19  Computer Science    Human-Computer Interaction  420 420 420 420 420 420
20  Computer Science    Information Retrieval   245 245 245 245 245 245
21  Computer Science    Logic in Computer Science   470 470 470 470 470 470
22  Computer Science    Machine Learning    177 177 177 177 177 177
23  Computer Science    Mathematical Software   27  27  27  27  27  27
24  Computer Science    Multiagent Systems  85  85  85  85  85  85
25  Computer Science    Multimedia  76  76  76  76  76  76
26  Computer Science    Networking and Internet Architecture    864 864 864 864 864 864
27  Computer Science    Neural and Evolutionary Computing   235 235 235 235 235 235
28  Computer Science    Numerical Analysis  40  40  40  40  40  40
29  Computer Science    Operating Systems   36  36  36  36  36  36
30  Computer Science    Other Computer Science  67  67  67  67  67  67
31  Computer Science    Performance 45  45  45  45  45  45
32  Computer Science    Programming Languages   268 268 268 268 268 268
33  Computer Science    Robotics    917 917 917 917 917 917
34  Computer Science    Social and Information Networks 202 202 202 202 202 202
35  Computer Science    Software Engineering    659 659 659 659 659 659
36  Computer Science    Sound   7   7   7   7   7   7
37  Computer Science    Symbolic Computation    44  44  44  44  44  44
38  Computer Science    Systems and Control 415 415 415 415

这种结果看起来不好看，没有对重复的数据进行归类。

因为我们现在就知道我们是对计算机领域进行的处理，那么这张表我们至于要category_name 和 id 和年份 2019 即可

要重新对表进行定义

没想出来。

看答案是用的 pivot() 方法：
查看这个方法：python pandas库——pivot使用心得解决了EXCEL的变换问题

那么问：
pivot() 方法解决了什么问题？？？？
一文看懂pandas的透视表pivot_table

Pandas透视表（pivot_table）详解

学术前沿趋势分析_学习_论文数据统计Task1相关推荐

论文数据统计Task1
论文数据统计Task1 数据集具体代码实现导入所需包读入数据并查看数据大小显示数据的前五行进行数据预处理查看所有论文的种类特征处理筛选数据数据分析及可视化心得体会数据集链接:数 ...
【学术前沿趋势分析】
学术前沿趋势分析 Task 01:论文数据统计 Task 02:论文作者统计 Task 03:论文代码统计 Task 04:论文种类分类 Task:5:作者信息关联 Task 01:论文数据统计任务 ...
【竞赛算法学习】学术前沿趋势分析-论文数据统计
任务1:论文数据统计 1.1 任务说明任务主题:论文数量统计,即统计2019年全年计算机各个方向论文数量: 任务内容:赛题的理解.使用 Pandas 读取数据并进行统计: 任务成果:学习 Panda ...
Datawhale数据分析学习——学术前沿趋势分析任务1
数据分析学习--学术前沿趋势分析任务1 前言赛题背景任务1:论文数据统计 1.1 任务说明 1.2 数据集介绍 1.3 arxiv论文类别介绍 1.4 任务整体思路 1.5 具体代码实现以及讲解 ...
阿里云天池学习赛-零基础入门数据分析-学术前沿趋势分析(task1)
阿里云天池学习赛零基础入门数据分析-学术前沿趋势分析前言一.赛题描述及数据说明 1:数据集的格式如下: 2:数据集格式举例: 二.task1论文数量统计(数据统计任务):统计2019年全年,计算机 ...
数据分析入门（学术前沿趋势分析）Task1-论文数据统计
此次赛题是零基础入门数据分析(学术前沿趋势分析),使用公开的arXiv论文完成对应的数据分析操作.赛题内容包括对论文数量.作者出现频率.论文源码的统计,对论文进行分类以及对论文作者的关系进行建模. 目 ...
学术前沿趋势分析（一）
学术前沿趋势分析(一) 任务说明数据集代码实现数据分析一.任务说明任务主题:统计2019年全年计算机各个方向论文数量. 任务内容:赛题的理解.使用 Pandas 读取数据并进行统计. 任务成 ...
数据分析-学术前沿趋势分析一
数据分析-学术前沿趋势分析 1 简介 1.1 问题背景 1.2 数据说明 2 数据介绍 3 具体代码实现 3.1 导入相关package并读取原始数据 3.2 数据预处理 3.3 数据分析及可视化总 ...
数据分析之学术前沿分析任务1：论文数据统计
任务1:论文数据统计 1.1 任务说明 1.2 数据集介绍 1.3 arxiv论⽂文类别介绍 1.4 具体代码实现以及讲解 1.4.1 导⼊入package并读取原始数据 1.4.2 数据预处理理 1 ...

学术前沿趋势分析_学习_论文数据统计Task1

论文数据统计学习记录

任务说明

分析说明

1. 获得方法：使用python 爬虫爬取数据集

2. 这里使用正则化加以限制

3. 判断数据是否存在

4. json文件的设置及使用方法

4.1. 用json读取字符串文件

4.2. 用json读取文本文件

4.3. 把字典写入json文件

项目演练

1. 导入库

2. 读取数据

enumerate 函数

3. 读取数据之后，对存储数据的变量进行概览

4. 查看文件的类别并选出与任务相关的类别（2019年论文数量统计）

5. 数据预处理

5.1 数据的种类信息

5.1.1列表生成式的使用方法

5.1.2 列表生成式的理解：

5.2 数据时间特征预处理

5.2.1 如何使用pandas处理时间序列（简洁版）

1. 首先查看表示时间的数据类型，并进行提取：

2.在没有已知类别的情况下，从数据集中找符合某个类别的种类

3. 数据可视化

3.1 名字一样的大类别的统计

3.2 对大类进行可视化

3.3 对大类中的小类进行统计（统计 2019年计算机领域的各个子领域的论文数量）

学术前沿趋势分析_学习_论文数据统计Task1相关推荐

最新文章

热门文章

学术前沿趋势分析_学习_论文数据统计Task1

论文数据统计学习记录

任务说明

分析说明

1. 获得方法：使用python 爬虫爬取数据集

2. 这里**使用正则化加以限制**

3. 判断数据是否存在

4. json文件的设置及使用方法

4.1. 用json读取字符串文件

4.2. 用json读取文本文件

4.3. 把字典写入json文件

项目演练

1. 导入库

2. 读取数据

enumerate 函数

3. 读取数据之后，对存储数据的变量进行概览

4. 查看文件的类别并选出与任务相关的类别（2019年论文数量统计）

5. 数据预处理

5.1 数据的种类信息

5.1.1列表生成式的使用方法

5.1.2 列表生成式的理解：

5.2 数据时间特征预处理

5.2.1 如何使用pandas处理时间序列（简洁版）

1. 首先查看表示时间的数据类型， 并进行提取：

2.在没有已知类别的情况下，从数据集中找符合某个类别的种类

3. 数据可视化

3.1 名字一样的大类别的统计

3.2 对大类进行可视化

3.3 对大类中的小类进行统计（统计 2019年计算机领域的各个子领域的 论文数量）

学术前沿趋势分析_学习_论文数据统计Task1相关推荐

最新文章

热门文章

2. 这里使用正则化加以限制

1. 首先查看表示时间的数据类型，并进行提取：

3.3 对大类中的小类进行统计（统计 2019年计算机领域的各个子领域的论文数量）