kaggle数据集_Kaggle上有170万份ArXiv文章的数据集

kaggle数据集

“arXiv is a free distribution service and an open-access archive for 1.7 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics”, as stated by its editors. ArXiv is a gold mine of knowledge. The more you dig into, the more valuable information you learn. It also makes it easier to follow the trends in science.

如前所述，“ arXiv是一项免费分发服务，是一个开放的档案库，可容纳170万条物理学，数学，计算机科学，定量生物学，定量金融，统计，电气工程和系统科学以及经济学领域的学术文章”。它的编辑。 ArXiv是知识的金矿。您越深入研究，就会学到更多有价值的信息。它还使跟踪科学趋势变得更加容易。

If you are into the field of data science, you have probably read articles on arXiv. If you haven’t done it yet, you should. Since data science is still an evolving field, new papers leading to new enhancements are published everyday. This makes the platforms like arXiv even more valuable.

如果您是数据科学领域的专家，您可能已经阅读了有关arXiv的文章。如果您还没有这样做，那应该。由于数据科学仍然是一个不断发展的领域，因此每天都会发表新的文章，以进行新的改进。这使arXiv等平台更具价值。

arXiv has made its entire corpus available as a dataset on Kaggle. The dataset contains relevant features such as article titles, authors, categories, content (both abstract and full text) and citations of 1.7 million scholarly articles avaiable on arXiv.

arXiv已将其整个语料库作为数据集在Kaggle上提供。数据集包含相关特征，例如文章标题，作者，类别，内容(摘要和全文)以及arXiv上170万篇学术文章的引用。

This dataset is amazing resource to do machine learning and deep learning applications. Some of the applications that can be done are:

该数据集是进行机器学习和深度学习应用程序的绝佳资源。可以完成的一些应用程序是：

Natural language processing (NLP) and understanding (NLU) use cases自然语言处理(NLP)和理解(NLU)用例
Text generation with deep learning using the content of articles使用文章内容通过深度学习生成文本
Predictive analytics such as category prediction of articles预测分析，例如文章类别预测
Trend analysis of topics in different scientific fields不同科学领域主题的趋势分析
Paper recommender engine纸张推荐器引擎

Deep learning models are data hungry. With the advancements in computing and processing, models can absorb more data than ever. Such a big dataset of scientific text is a highly valuable raw material for NLP, NLU and text generation. We may even have a model that writes scholarly articles on some topics. OpenAI’s new text generator, GPT-3, makes us think beyond the limits. Thus, I don’t think it is too far to have a deep learning model to write about science.

深度学习模型需要大量数据。随着计算和处理技术的进步，模型可以吸收比以往更多的数据。如此庞大的科学文本数据集对于NLP，NLU和文本生成是非常有价值的原材料。我们甚至可能有一个模型可以撰写有关某些主题的学术文章。 OpenAI的新文本生成器GPT-3使我们的思考超出了极限。因此，我认为拥有一个关于科学的深度学习模型并不过分。

Eleonora Presani, arXiv executive director said that “by offering the dataset on Kaggle we go beyond what humans can learn by reading all these articles and we make the data and information behind arXiv available to the public in a machine-readable format”. I definitely agree with her on the learning opportunities. Having all of these articles as a dataset allows to go beyond learning by reading. A ton of valuable insights can be discovered from this gold mine of articles by data analysis and machine learning. For instance, some not-so-obvious connections between different technologies can light up.

arXiv执行董事Eleonora Presani表示：“通过在Kaggle上提供数据集，我们超越了人类通过阅读所有这些文章所能学到的知识，并以机器可读的格式向公众提供了arXiv背后的数据和信息”。我绝对同意她的学习机会。将所有这些文章作为数据集可以超越阅读学习的范围。通过数据分析和机器学习，可以从这个金矿中找到大量有价值的见解。例如，不同技术之间的一些不太明显的连接可能会点亮。

Converting the entire arXiv articles to a well-structured and organized dataset has the potential to accelerate scientific discoveries. Science grows and advances by building on itself. There is no need to reinvent the wheel when we can focus on improving the wheel. By analyzing this arXiv dataset, we can obtain a concise summary of what science has been up to and shed light on what we need to focus going forward.

将整个arXiv文章转换为结构合理且组织良好的数据集有可能加速科学发现。科学在自身的基础上发展壮大。当我们可以专注于改进车轮时，无需重新发明车轮。通过分析此arXiv数据集，我们可以获得有关最新科学知识的简明摘要，并阐明了今后我们需要关注的重点。

There is just so much to do with this dataset. I highly encourage you to at least take a look at it. You don’t have to create a machine learning product but it will also be a helpful resource for practicing data analysis and processing skills.

这个数据集有很多事情要做。我强烈建议您至少看看它。您不必创建机器学习产品，但它也将是练习数据分析和处理技能的有用资源。

Thank you for reading. Please let me know if you have any feedback.

感谢您的阅读。如果您有任何反馈意见，请告诉我。

https://www.kaggle.com/Cornell-University/arxiv?select=arxiv-metadata-oai-snapshot.json

https://www.kaggle.com/Cornell-University/arxiv?select=arxiv-metadata-oai-snapshot.json
https://blogs.cornell.edu/arxiv/2020/08/05/leveraging-machine-learning-to-fuel-new-discoveries-with-the-arxiv-dataset/

https://blogs.cornell.edu/arxiv/2020/08/05/leveraging-machine-learning-to-fuel-new-discoveries-with-the-arxiv-dataset/

翻译自: https://towardsdatascience.com/a-dataset-of-1-7-million-arxiv-articles-available-on-kaggle-8a11075cac32

kaggle数据集

查看全文

http://www.taodudu.cc/news/show-994967.html

深度学习数据集中数据差异大_使用差异隐私来利用大数据并保留隐私
小型数据库_如果您从事“小型科学”工作，那么您是否正在利用数据存储库？
参考文献_参考
数据统计测试方法_统计测试：了解如何为数据选择最佳测试！
每个Power BI开发人员的Power Query提示
a/b测试_如何进行A / B测试？
面向数据科学家的实用统计学_数据科学家必知的统计数据
在Python中有效使用JSON的4个技巧
虚拟主机创建虚拟lan_创建虚拟背景应用
python 传不定量参数_Python中的定量金融
贝叶斯朴素贝叶斯_手动执行贝叶斯分析
GitHub动作简介
照顾好自己才能照顾好别人_您必须照顾的5个基本数据
认识数据分析_认识您的最佳探索数据分析新朋友
arima模型怎么拟合_7个统计测试，用于验证和帮助拟合ARIMA模型
天池幸福感的数据处理_了解幸福感与数据（第1部分）
詹森不等式_注意詹森差距
数据分析师需求分析师_是什么让分析师出色？
猫眼电影评论_电影的人群意见和评论家的意见一样好吗？
ai前沿公司_美术是AI的下一个前沿吗？
mardown 标题带数字_标题中带有数字的故事更成功吗？
使用Pandas 1.1.0进行稳健的2个DataFrames验证
rstudio 关联r_使用关联规则提出建议（R编程）
jquery数据折叠_通过位折叠缩小大数据
决策树信息熵计算_决策树熵|熵计算
流式数据分析_流式大数据分析
数据科学还是计算机科学_数据科学101
js有默认参数的函数加参数_函数参数：默认，关键字和任意
相似邻里算法_纽约市-邻里之战
数据透视表和数据交叉表_数据透视表的数据提取

kaggle数据集_Kaggle上有170万份ArXiv文章的数据集相关推荐

TF之TFSlim：利用经典VGG16模型(InceptionV3)在ImageNet数据集基础上训练自己的五个图像类别数据集的训练过程记录
TF之TFSlim:利用经典VGG16模型(InceptionV3)在ImageNet数据集基础上训练自己的五个图像类别数据集的训练过程记录目录训练控制台显示输出结果文件训练控制台显示输出结 ...
用python将照片做成数据集_那个20多万“不可描述”照片的数据集，有人用python做了鉴黄模型 | Demo...
有人上手了! 近日, GitHub出现一个名为"NSFW Model"的项目.通俗一点来说,就是一个鉴黄模型.学习Python中有不明白推荐加入交流群号:864573496 群里 ...
Android小应用：余额宝万份收益查询小工具
开发这个软件的初衷是因为手机上的余额宝万份收益只能显示昨天的收益,无法查看今天的最新收益,实际上当天下午3点以后,就可以查看当天的万份收益了. 这个软件就是从官方网站把最新的收益显示出来,用到的流量很 ...
以太坊2.0质押地址余额超过170万枚
据etherscan数据,当前以太坊2.0存款合约地址已收到1700161.0 ETH,质押量超过170万枚. 文章链接:https://www.tuoluocaijing.cn/kuaixun/de ...
免费gpu：kaggle本地项目上传使用说明
1.创建一个notebook 2.上传压缩包 3.上传完成随便起一个英文名称,作为数据集名(其实相当于把压缩包当做数据集来上传的,压缩包里可以有数据集或代码等文件).然后点击右下方create 4.配 ...
2011年至2018年全国城市区县历史天气数据集，包含预报高温、低温、风速、风向、天气现象数据，mysql数据集，数据量800万以上数据大小1G
2011年至2018年全国城市区县历史天气数据集,包含预报高温.低温.风速.风向.天气现象数据,mysql数据集,数据量800万以上数据大小1G 数据集格式,数据集来源于网络公开数据,本人整理所得 8 ...
【论文相关】1.1 T 的 arXiv 数据集：170 万篇论文，可以看到下辈子
By 超神经内容提要:近日,arXiv 将 170 万+ 篇的论文,打包成数据集,放在了 kaggle 平台,以后访问和下载论文,就更方便了.该数据集目前大小 1.1 TB 左右,而且之后还会随着每 ...
小鹏汽车窃密特斯拉实锤？前员工回应：确实上传过30万份Autopilot源码
此前有舆论称:2019加盟小鹏汽车的自动驾驶团队中,有一名为曹光植的特斯拉前员工曾在特斯拉期间上传过30多万份Autopilot相关源码.特斯拉则将此事告上法庭:称这名离职员工"涉嫌盗取Au ...
年薪 170 万阿里 P8 程序员征婚上热搜，程序员婚恋观大曝光！
整理 | 伍杏玲出品 | 程序人生(ID:coder_life) 上个月,民政部公布,2018年中国单身成年人口已经超过2亿, 独居成年人口超过7700万. 前两天有一位阿里P8程序员决定告别这个& ...

kaggle数据集_Kaggle上有170万份ArXiv文章的数据集

相关文章：

kaggle数据集_Kaggle上有170万份ArXiv文章的数据集相关推荐

最新文章

热门文章