哥伦比亚大学公众人物脸部数据集

原文：

Introduction

The PubFig database is a large, real-world face dataset consisting of 58,797 images of 200 people collected from the internet. Unlike most other existing face datasets, these images are taken in completely uncontrolled situations with non-cooperative subjects. Thus, there is large variation in pose, lighting, expression, scene, camera, imaging conditions and parameters, etc. The PubFig dataset is similar in spirit to the Labeled Faces in the Wild (LFW) dataset created at UMass-Amherst, although there are some significant differences in the two:

LFW contains 13,233 images of 5,749 people, and is thus much broader than PubFig. However, it's also smaller and much shallower (many fewer images per person on average).
LFW is derived from the Names and Faces in the News work of T. Berg, et al. These images were originally collected using news sources online. For many people, there are often several images taken at the same event, with the person wearing similar clothing and in the same environment. Our paper at ICCV 2009 showed that this can often be exploited by algorithms to give unrealistics boosts in performance.
Of course, the PubFig dataset no doubt has biases of its own, and we welcome any attempts to categorize these.

We have created a face verification benchmark on this dataset that test the abilities of algorithms to classify a pair of images as being of the same person or not. Importantly, these two people should have never been seen by the algorithm during training. In the future, we hope to create recognition benchmarks as well.

Citation

The database is made available only for non-commercial use. If you use this dataset, please cite the following paper:

"Attribute and Simile Classifiers for Face Verification,"

Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, and Shree K. Nayar,

International Conference on Computer Vision (ICCV), 2009.

[bibtex] [pdf] [webpage]News

December 23, 2010: Updated PubFig to v1.2. The changes are as follows:We added md5 checksums for all images in the datafiles on the download page.
September 10, 2010: Updated PubFig to v1.1. The major changes are as follows:We recomputed attribute values using updated classifiers, expanding to 73 attributes.
- Attribute values now exist for the development set as well as the evaluation set (previously only the evaluation set had attribute values).
- We updated the face rectangles for faces to be much tighter around the face, as opposed to the rather loose boundaries given before.
- We removed 679 bad images, including non-jpegs, images with non-standard colorspaces, corrupted images, and images with very poor alignment.
- We generated a new cross-validation set, taking into account these deleted images. We ran our algorithm with our new attribute classifiers on this set, obtaining a new curve.
- We removed the verification subsets by pose, lighting, and expression, as they were not being used. Instead, we created a single datafile which contains the manual labels for these parameters.
- Some of the datafile formats have changed slightly, to be more consistent with the others.
- We added the python script used to generate the output ROC curves
- We updated this website to be cleaner and easier to read
December 21, 2009: Added face locations to dataset
December 2, 2009: Created website and publicly released v1.0 of dataset

Related Projects

Attribute and Simile Classifiers for Face Verification (Columbia)
FaceTracer: A Search Engine for Large Collections of Images with Faces (Columbia)
Labeled Faces in the Wild (UMass-Amherst)
Names and Faces (SUNY-Stonybrook)

译：

介绍

PubFig数据库是一个大型的真实世界人脸数据集，包含从互联网上收集的200人的58797张图像。与大多数其他现有的人脸数据集不同，这些图像是在完全不受控制的情况下拍摄的，而非合作对象。因此，在姿势、灯光、表情、场景、摄像机、成像条件和参数等方面存在很大差异。PubFig数据集在精神上与麻省大学阿默斯特分校创建的野生（LFW）数据集中的标签人脸相似，尽管两者之间存在一些显著差异：

●LFW包含5749人的13233张图像，因此比PubFig的范围更广。然而，它也更小，也更浅（平均每个人的图像更少）。

●LFW来源于T.Berg等人新闻作品中的姓名和面孔。这些图片最初是通过在线新闻来源收集的。对于许多人来说，在同一个活动中，经常会有几张照片，这些照片中的人穿着相似的衣服，在同一个环境中拍摄。我们在2009年ICCV上发表的论文显示，这通常可以被算法利用，从而给表现带来非现实的提升。

●当然，PubFig数据集无疑有其自身的偏差，我们欢迎任何对这些数据进行分类的尝试。

我们已经在这个数据集上创建了一个人脸验证基准，测试算法将一对图像分类为是否属于同一个人的能力。重要的是，这两个人在训练期间不应该被算法看到。在未来，我们也希望建立认可基准。

引用

该数据库仅用于非商业用途。如果您使用此数据集，请引用以下论文：

用于人脸验证的属性和明喻分类器

Neeraj Kumar，Alexander C.Berg，Peter N.Belhumer和Shree K.Nayar，

国际计算机视觉会议（ICCV），2009年。

[bibtex][pdf][webpage]新闻

●2010年12月23日：将PubFig更新为v1.2。变化是以下：我们添加了下载页面上数据文件中所有图像的md5校验和。

●2010年9月10日：将PubFig更新为v1.1。主要的变化是以下：我们重新计算属性值使用更新的分类器，扩展到73个属性。

○开发集和评估集现在都有属性值（以前只有评估集有属性值）。

○我们更新了面矩形，使面周围的面更加紧密，而不是之前给出的相当松散的边界。

○我们删除了679幅不良图像，包括非JPEG图像、具有非标准色彩空间的图像、损坏的图像以及对齐非常差的图像。

○考虑到这些删除的图像，我们生成了一个新的交叉验证集。我们在这个集合上用我们的新属性分类器运行我们的算法，得到一个新的曲线。

○我们通过姿势、照明和表情移除验证子集，因为它们没有被使用。相反，我们创建了一个包含这些参数的手动标签的数据文件。

○一些数据文件格式略有变化，以便与其他格式更加一致。

○我们添加了用于生成输出ROC曲线的python脚本

○我们更新了这个网站，使其更干净、更易于阅读

●2009年12月21日：向数据集添加面位置

●2009年12月2日：创建网站并公开发布数据集v1.0