TensorFlow 直接可用的 30 个最大的机器学习数据集

30 Largest TensorFlow Datasets for Machine Learning

Created by researchers at Google Brain, TensorFlow is one of the largest open-source data libraries for machine learning and data science. It’s an end-to-end platform for both complete beginners and experienced data scientists. The TensorFlow library includes tools, pre-trained models, machine learning guides, as well as a corpora of open datasets. To help you find the training data you need, this article will briefly introduce some of the largest TensorFlow datasets for machine learning. We’ve divided the following list into image, video, audio, and text datasets.

TensorFlow Image Datasets

1. CelebA: One of the largest publicly available face image datasets, the Celebrity Faces Attributes Dataset (CelebA) contains over 200,000 images of celebrities.

Each image includes 5 facial landmarks and 40 binary attribute annotations.

2. Downsampled Imagenet: This dataset was built for density estimation and generative modeling tasks. It includes just over 1.3 million images of objects, scenes, vehicles, people, and more. The images are available in two resolutions: 32 x 32 and 64 x 64.

3. Lsun – Lsun is a large-scale image dataset created to help train models for scene understanding. The dataset contains over 9 million images divided into scene categories, such as bedroom, classroom, and dining room.

4. Bigearthnet – Bigearthnet is another large-scale dataset, containing aerial images from the Sentinel-2 satellite. Each image covers a 1.2 km x 1.2 km patch of the ground. The dataset includes 43 imbalance labels for each image.

5. Places 365 – As the name suggests, Places 365 contains over 1.8 million images of different places or scenes. Some of the categories include office, pier, and cottage. Places 365 is one of the largest datasets available for scene recognition tasks.

6. Quickdraw Bitmap – The Quickdraw dataset is a collection of images drawn by the Quickdraw player community. It contains 5 million drawings that span 345 categories. This version of the Quickdraw dataset includes the images in grayscale 28 x 28 format.

7. SVHN Cropped – From Stanford University, Street View House Numbers (SVHN) is a TensorFlow dataset built to train digit recognition algorithms. It contains 600,000 examples of real-world image data which have been cropped to 32 x 32 pixels.

8. VGGFace2 – One of the largest face image datasets, VGGFace2 contains images downloaded from the Google search engine. The faces vary in age, pose, and ethnicity. There are an average of 362 images of each subject.

9. COCO – Made by collaborators from Google, FAIR, Caltech, and more, COCO is one of the largest labeled image datasets in the world. It was built for object detection, segmentation, and image captioning tasks.

via cocodataset.org

The dataset contains 330,000 images, 200,000 of which are labeled. Within the images are 1.5 million object instances across 80 categories.

10. Open Images Challenge 2019 – Containing around 9 million images, this dataset is one of the largest labeled image datasets available online. The images contain image-level labels, object bounding boxes, and object segmentation masks, as well as visual relationships.

11. Open Images V4 – This dataset is another iteration of the Open Images dataset mentioned above. V4 contains 14.6 million bounding boxes for 600 different object classes. The bounding boxes have been manually drawn by human annotators.

12. AFLW2K3D – This dataset contains 2000 facial images all annotated with 3D facial landmarks. It was created to evaluate 3D facial landmark detection models.

TensorFlow是由谷歌大脑的研究人员创建、最大的机器学习和数据科学的开源数据库之一。它是一个端到端平台，适合完全没有经验的初学者和有经验的数据科学家。TensorFlow库包括工具、预训练模型、机器学习教程以及一整套公开数据集。为了帮助你找到所需的训练数据，本文将简单介绍一些TensorFlow中用于机器学习的大型数据集。我们将以下数据集的列表分为图像、视频、音频和文本。

TensorFlow图像数据集

1. CelebA：明星脸属性数据集（CelebA）是最大的公开可用的人脸图像数据集，其中包含200,000多个名人图像。

每个图像包括5个面部标注和40个二进制属性标注。

2. Downsampling Imagenet：该数据集是为密度估计和生成性建模任务而建立的。它包括了130多万张物体、场景、车辆、人物等图像。这些图像有两种分辨率规格：32×32和64×64。

3. Lsun—Lsun是一个大规模的图像数据集，创建该数据集是为了帮助训练模型进行场景理解。该数据集包含超过900万张图像，按场景类别划分，如卧室、教室和餐厅。

4. Bigearthnet—Bigearthnet是另一个大规模数据集，它包含来自Sentinel-2卫星的航空图像。每张图像覆盖了1.2公里×1.2公里的一片地面。该数据集中有43个类别不平衡的标签。

5. Places 365—顾名思义，Places 365包含180多万张不同地方或场景的图片。其中一些类别包括办公室、码头和别墅。Places 365是用于场景识别任务的最大数据集之一。

6. Quickdraw位图—Quickdraw数据集是由Quickdraw玩家社区绘制的图像集合。它包含500万张图纸，跨越345个类别。这个版本的Quickdraw数据集包括28×28的灰度图像。

7. SVHN Cropped—街景房号（SVHN）是为训练数字识别算法，由斯坦福大学建立的TensorFlow数据集。它包含60万个真实世界的、被裁剪成32×32像素的图像数据实例。

8. VGGFace2—最大的人脸图像数据集之一，VGGFace2包含从谷歌搜索引擎下载的图像。数据集中的人脸在年龄、姿势和种族上都有所不同。每个类别平均有362张图像。

9. COCO—由谷歌、FAIR、加州理工学院等合作者制作，是世界上最大的标签图像数据集之一。它是为物体检测、分割和图像字幕任务而建立的。

通过cocodataset.org

数据集包含330,000张图像，其中20万张有标签。在所有图像中，共包含了80个类别的150万个对象实例。

10. Open Images Challenge 2019—包含约900万张图像，该数据集是网上最大的、标注的图像数据集之一。这些图像包含图像级标签、对象边界框和对象分割掩码，以及他们之间的视觉关系。

11. Open Images V4—这个数据集是上述Open Images数据集的另一个迭代。V4版本中包含了600个不同物体类别的1460万个边界框。这些边界框是由人类标注者手动绘制的。

12. AFLW2K3D—该数据集包含2000张面部图像，均有3D面部真实标注。它的创建是为了评估3D面部标注检测模型。

Video Datasets

13. UCF101 – From the University of Central Florida, UCF101 is a video dataset built to train action recognition models. The dataset has 13,320 videos that span 101 action categories.

14. BAIR Robot Pushing – From Berkeley Artificial Intelligence Research, BAIR Robot Pushing contains 44,000 example videos of robot pushing motions.

15. Moving MNIST – This dataset is a variant of the MNIST benchmark dataset. Moving MNIST contains 10,000 videos.

Each video shows 2 handwritten digits moving around within a 64 x 64 frame.

16. EMNIST – Extended MNIST contains digits from the original MNIST dataset converted into a 28 x 28 pixel format.

TensorFlow Audio Datasets

17. CREMA-D – Created for emotion recognition tasks, CREMA-D consists of vocal emotional expressions. This dataset contains 7,442 audio clips voiced by 91 actors of varying age, ethnicity, and gender.

18. Librispeech – Librispeech is a simple audio dataset which contains 1000 hours of English speech derived from audiobooks from the LibriVox project. It has been used to train both acoustic models and language models.

19. Libritts – This dataset contains around 585 hours of English speech, prepared with the assistance of Google Brain team members. Libritts was originally designed for Text-to-speech (TTS) research, but can be used for a variety of voice recognition tasks.

20. TED-LIUM – TED-LIUM is a dataset that consists of over 110 hours of English TED Talks. All talks have been transcribed.

21. VoxCeleb – A large audio dataset built for speaker identification tasks, VoxCeleb contains over 150,000 audio samples from 1,251 speakers.

视频数据集

13. UCF101—来自中央佛罗里达大学，UCF101是为训练动作识别模型而建立的视频数据集。该数据集有101个动作类别的13320个视频，。

14. BAIR Robot Pushing—来自伯克利人工智能研究，BAIR Robot Pushing包含44000个机器人推的动作的示例视频。

15. Moving MNIST—这个数据集是MNIST基准数据集的一个变体。Moving MNIST包含10,000个视频。

每个视频都显示了在64×64大小的帧内2个手写数字的移动过程。

16. EMNIST—扩展的MNIST数据集，包含了原始MNIST数据集转换成28 x 28像素大小的图片。

TensorFlow音频数据集

17. CREMA-D—为情感识别任务而创建，CREMA-D由语音情感表达组成。该数据集包含由年龄，种族和性别不同的91位演员表达的7,442个音频剪辑。

18. Librispeech—Librispeech是一个简单的音频数据集，它包含1000小时的英语语音，这些语音来自LibriVox项目的有声读物。它被用于训练声学模型和语言模型。

19. Libritts—这个数据集包含约585小时的英语语音，是在Google Brain团队成员的协助下准备的。Libritts最初是为Text-to-speech（TTS）研究设计的，但可以用于各种语音识别任务。

20. TED-LIUM—TED-LIUM是一个包含110多个小时的英语TED演讲的数据集。所有的演讲内容都已被转录。

21. VoxCeleb—VoxCeleb是为演讲者识别任务而建立的大型音频数据集，包含来自1,251位演讲者的150,000多个音频样本。

Text Datasets

22. C4 (Common Crawl’s Web Crawl Corpus) – Common Crawl is an open source repository of web page data. It’s available in over 40 languages and spans seven years of data.

23. Civil Comments – This dataset is an archive of over 1.8 million examples of public comments from 50 English-language news sites.

24. IRC Disentanglement – This TensorFlow dataset includes just over 77,000 comments from the Ubuntu IRC Channel. The metadata for each sample includes the message ID and timestamps.

25. Lm1b – Known as the Language Model Benchmark, this dataset contains 1 billion words. It was originally made to measure progress in statistical language modeling.

26. SNLI – The Stanford Natural Language Inference Dataset is a corpus of 570,000 human-written sentence pairs. All of the pairs have been manually labeled for balanced classification.

27. e-SNLI – This dataset is an extension of SNLI mentioned above, which contains the original dataset’s 570,000 sentence pairs classified as: entailment, contradiction, and neutral.

28. MultiNLI – Modeled after the SNLI dataset, MultiNLI includes 433,000 sentence pairs all annotated with entailment information.

29. Wiki40b – This large-scale dataset includes text from Wikipedia articles in 40 different languages. The data has been cleaned and non-content sections, as well as structured objects, have been removed.

30. Yelp Polarity Reviews – This dataset contains 598,000 highly polar Yelp reviews. They have been extracted from the data included in the Yelp Dataset Challenge 2015.

While the datasets above are some of the largest and most widely-used TensorFlow datasets for machine learning, the TensorFlow library is vast and continuously expanding. Please visit the TensorFlow website for more information about how the platform can help you build your own models.

Still can’t find the training data you need? At Lionbridge, we use our state-of-the-art AI platform to create custom datasets at scale. Contact our sales team or sign up for a free trial to start building high-quality datasets today.

文本数据集

22. C4(Common Crawl's Web Crawl Corpus)—Common Crawl是一个开放源码的网页数据库。它包含了超过40种语言、跨越7年的数据。

23. Civil Comments—这个数据集是由来自50个英文新闻网站的180多万条公众评论构成的。

24. IRC Disentanglement—这个TensorFlow数据集包括来自Ubuntu IRC频道的77000多条评论。每个样本的元数据包括消息ID和时间戳。

25. Lm1b—被称为语言模型基准，这个数据集包含10亿个单词。它最初是为了衡量统计语言建模的进展。

26. SNLI—斯坦福自然语言推理数据集是一个包含57万个人类写作句子对的语料库。所有的句对都经过人工标注，类别是均衡的。

27.e-SNLI—这个数据集是上面提到的SNLI的扩展，它包含了原始数据集的57万个句子对，分类为：包含、矛盾和中性。

28. MultiNLI—仿照SNLI数据集，MultiNLI包含433,000个句子对，都有尾部信息注释。

29. Wiki40b—这个大规模的数据集包括40种不同语言的维基百科文章。这些数据已经被清理，其中的非内容部分以及结构化对象已经被去掉。

30. Yelp极性评论—这个数据集包含598,000条高度极性的Yelp评论。它们是从2015年Yelp数据集挑战赛中的数据提取出来的。

虽然上述数据集是机器学习中最大、最广泛使用的一些TensorFlow数据集，但TensorFlow库是庞大的，并在不断扩展。请访问TensorFlow网站，了解更多关于该平台如何帮助您构建自己的模型的信息。

如果还是找不到你需要的训练数据？在Lionbridge，使用我们最先进的AI平台来大规模创建自定义数据集。联系我们的销售团队或注册免费试用版，立即开始构建高质量的数据集。