[转]Facebook 如何存储150亿张、1.5PB的照片

Technorati 标签: 网站架构

原文地址：http://slj.me/2009/05/efficient-storage-of-billions-of-photos/

Facebook 的照片分享很受欢迎，迄今，Facebook 用户已经上传了150亿张照片，加上每张照片有四个不同尺寸的缩略图，就共有600多亿张图片，总容量超过1.5PB，而每周新增的照片为2亿2000万张，约25TB，高峰期，Facebook 每秒处理55万张照片，这些数字让如何管理这些数据成为一个巨大的挑战。本文由 Facebook 工程师撰写，讲述了他们是如何管理这些照片的。

旧的 NFS 照片架构

老的照片系统架构分以下几个层：

上传层接收用户上传的照片并保存在 NFS 存储层。
照片服务层接收 HTTP 请求并从 NFS 存储层输出照片。
NFS存储层建立在商业存储系统之上。

因为每张照片都以文件形式单独存储，这样庞大的照片量导致非常庞大的元数据规模，超过了 NFS 存储层的缓存上限，导致每次招聘请求会上传都包含多次I/O操作。庞大的元数据成为整个照片架构的瓶颈。这就是为什么 Facebook 主要依赖 CDN 的原因。为了解决这些问题，他们做了两项优化：

Cachr: 一个缓存服务器，缓存 Facebook 的小尺寸用户资料照片。
NFS文件句柄缓存：部署在照片输出层，以降低 NFS 存储层的元数据开销。

新的 Haystack 照片架构

新的照片架构将输出层和存储层合并为一个物理层，建立在一个基于 HTTP 的照片服务器上，照片存储在一个叫做 haystack 的对象库，以消除照片读取操作中不必要的元数据开销。新架构中，I/O 操作只针对真正的照片数据（而不是文件系统元数据）。haystack 可以细分为以下几个功能层：

HTTP 服务器
照片存储
Haystack 对象存储
文件系统
存储空间

存储

Haystack 部署在商业存储刀片服务器上，典型配置为一个2U的服务器，包含：

两个4核CPU
16GB – 32GB 内存
硬件 RAID，含256-512M NVRAM 高速缓存
超过12个1TB SATA 硬盘

每个刀片服务器提供大约10TB的存储能力，使用了硬件 RAID-6, RAID 6在保持低成本的基础上实现了很好的性能和冗余。不佳的写性能可以通过高速缓存解决，硬盘缓存被禁用以防止断电损失。

文件系统

Haystack 对象库是建立在10TB容量的单一文件系统之上。文件系统中的每个文件都在一张区块表中对应具体的物理位置，目前使用的文件系统为 XFS。

Haystack 对象库

Haystack 是一个简单的日志结构，存储着其内部数据对象的指针。一个 Haystack 包括两个文件，包括指针和索引文件：

Haystack 对象存储结构

指针和索引文件结构

Haystack 写操作

Haystack 写操作同步将指针追加到 haystack 存储文件，当指针积累到一定程度，就会生成索引写到索引文件。为了降低硬件故障带来的损失，索引文件还会定期写道存储空间中。

Haystack 读操作

传到 haystack 读操作的参数包括指针的偏移量，key，代用Key,Cookie 以及数据尺寸。Haystack 于是根据数据尺寸从文件中读取整个指针。

Haystack 删除操作

删除比较简单，只是在 Haystack 存储的指针上设置一个已删除标志。已经删除的指针和索引的空间并不回收。

照片存储服务器

照片存储服务器负责接受 HTTP 请求，并转换成相应的 Haystack 操作。为了降低I/O操作，该服务器维护着全部 Haystack 中文件索引的缓存。服务器启动时，系统就会将这些索引读到缓存中。由于每个节点都有数百万张照片，必须保证索引的容量不会超过服务器的物理内存。

对于用户上传的图片，系统分配一个64位的独立ID，照片接着被缩放成4种不同尺寸，每种尺寸的图拥有相同的随机 Cookie 和 ID，图片尺寸描述（大，中，小，缩略图）被存在代用key 中。接着上传服务器通知照片存储服务器将这些资料联通图片存储到 haystack 中。

每张图片的索引缓存包含以下数据

Haystack 使用 Google 的开源 sparse hash data 结构以保证内存中的索引缓存尽可能小。

照片存储的写/修改操作

写操作将照片数据写到 Haystack 存储并更新内存中的索引。如果索引中已经包含相同的 Key，说明是修改操作。

照片存储的读操作

传递到 Haystack 的参数包括 Haystack ID，照片的 Key, 尺寸以及 Cookie，服务器从缓存中查找并到 Haystack 中读取真正的数据。

照片存储的删除操作

通知 Haystack 执行删除操作之后，内存中的索引缓存会被更新，将便宜量设置为0，表示照片已被删除。

重新捆扎

重新捆扎会复制并建立新的 Haystack，期间，略过那些已经删除的照片的数据，并重新建立内存中的索引缓存。

HTTP 服务器

Http 框架使用的是简单的 evhttp 服务器。使用多线程，每个线程都可以单独处理一个 HTTP 请求。

结束语

Haystack 是一个基于 HTTP 的对象存储，包含指向实体数据的指针，该架构消除了文件系统元数据的开销，并实现将全部索引直接存储到缓存，以最小的 I/O 操作实现对照片的存储和读取。

========================摘抄分割线==================================

英文原版：

本文国际来源：http://www.facebook.com/FacebookEngineering#/note.php?note_id=76191543919&ref=mf

Engineering @ Facebook 发布的博文

Needle in a haystack: efficient storage of billions of photos

The Photos application is one of Facebook’s most popular features. Up to date, users have uploaded over 15 billion photos which makes Facebook the biggest photo sharing website. For each uploaded photo, Facebook generates and stores four images of different sizes, which translates to a total of 60 billion images and 1.5PB of storage. The current growth rate is 220 million new photos per week, which translates to 25TB of additional storage consumed weekly. At the peak there are 550,000 images served per second. These numbers pose a significant challenge for the Facebook photo storage infrastructure.

NFS photo infrastructure

The old photo infrastructure consisted of several tiers:

Upload tier receives users’ photo uploads, scales the original images and saves them on the NFS storage tier.
Photo serving tier receives HTTP requests for photo images and serves them from the NFS storage tier.
NFS storage tier built on top of commercial storage appliances.

Since each image is stored in its own file, there is an enormous amount of metadata generated on the storage tier due to the namespace directories and file inodes. The amount of metadata far exceeds the caching abilities of the NFS storage tier, resulting in multiple I/O operations per photo upload or read request. The whole photo serving infrastructure is bottlenecked on the high metadata overhead of the NFS storage tier, which is one of the reasons why Facebook relies heavily on CDNs to serve photos. Two additional optimizations were deployed in order to mitigate this problem to some degree:

Cachr: a caching server tier caching smaller Facebook “profile” images.
NFS file handle cache – deployed on the photo serving tier eliminates some of the NFS storage tier metadata overhead

Haystack Photo Infrastructure

The new photo infrastructure merges the photo serving tier and storage tier into one physical tier. It implements a HTTP based photo server which stores photos in a generic object store called Haystack. The main requirement for the new tier was to eliminate any unnecessary metadata overhead for photo read operations, so that each read I/O operation was only reading actual photo data (instead of filesystem metadata). Haystack can be broken down into these functional layers -

HTTP server
Photo Store
Haystack Object Store
Filesystem
Storage

In the following sections we look closely at each of the functional layers from the bottom up.

Storage

Haystack is deployed on top of commodity storage blades. The typical hardware configuration of a 2U storage blade is –

2 x quad-core CPUs
16GB – 32GB memory
hardware raid controller with 256MB – 512MB of NVRAM cache
12+ 1TB SATA drives

Each storage blade provides around 10TB of usable space, configured as a RAID-6 partition managed by the hardware RAID controller. RAID-6 provides adequate redundancy and excellent read performance while keeping the storage cost down. The poor write performance is partially mitigated by the RAID controller NVRAM write-back cache. Since the reads are mostly random, the NVRAM cache is fully reserved for writes. The disk caches are disabled in order to guarantee data consistency in the event of a crash or a power loss.

Filesystem

Haystack object stores are implemented on top of files stored in a single filesystem created on top of the 10TB volume.

Photo read requests result in read() system calls at known offsets in these files, but in order to execute the reads, the filesystem must first locate the data on the actual physical volume. Each file in the filesystem is represented by a structure called an inode which contains a block map that maps the logical file offset to the physical block offset on the physical volume. For large files, the block map can be quite large depending on the type of the filesystem in use.

Block based filesystems maintain mappings for each logical block, and for large files, this information will not typically fit into the cached inode and is stored in indirect address blocks instead, which must be traversed in order to read the data for a file. There can be several layers of indirection, so a single read could result in several I/Os depending on whether or not the indirect address blocks are cached.

Extent based filesystems maintain mappings only for contiguous ranges of blocks (extents). A block map for a contiguous large file could consist of only one extent which would fit in the inode itself. However, if the file is severely fragmented and its blocks are not contiguous on the underlying volume, its block map can grow large as well. With extent based filesystems, fragmentation can be mitigated by aggressively allocating a large chunk of space whenever growing the physical file.

Currently, the filesystem of choice is XFS, an extent based filesystem providing efficient file preallocation.

Haystack Object Store

Haystack is a simple log structured (append-only) object store containing needles representing the stored objects. A Haystack consists of two files – the actual haystack store file containing the needles, plus an index file. The following figure shows the layout of the haystack store file:

The first 8KB of the haystack store is occupied by the superblock. Immediately following the superblock are needles, with each needle consisting of a header, the data, and a footer:

A needle is uniquely identified by its tuple, where the offset is the needle offset in the haystack store. Haystack doesn’t put any restriction on the values of the keys, and there can be needles with duplicate keys. Following figure shows the layout of the index file -

There is a corresponding index record for each needle in the haystack store file, and the order of the needle index records must match the order of the associated needles in the haystack store file. The index file provides the minimal metadata required to locate a particular needle in the haystack store file. Loading and organizing index records into a data structure for efficient lookup is the responsibility of the Haystack application (Photo Store in our case). The index file is not critical, as it can be rebuilt from the haystack store file if required. The main purpose of the index is to allow quick loading of the needle metadata into memory without traversing the larger Haystack store file, since the index is usually less than 1% the size of the store file.

Haystack Write Operation

A Haystack write operation synchronously appends new needles to the haystack store file. After the needles are committed to the larger Haystack store file, the corresponding index records are then written to the index file. Since the index file is not critical, the index records are written asynchronously for faster performance.

The index file is also periodically flushed to the underlying storage to limit the extent of the recovery operations caused by hardware failures. In the case of a crash or a sudden power loss, the haystack recovery process discards any partial needles in the store and truncates the haystack store file to the last valid needle. Next, it writes missing index records for any trailing orphan needles at the end of the haystack store file.

Haystack doesn’t allow overwrite of an existing needle offset, so if a needle’s data needs to be modified, a new version of it must be written using the same tuple. Applications can then assume that among the needles with duplicate keys, the one with the largest offset is the most recent one.

Haystack Read Operation

The parameters passed to the haystack read operation include the needle offset, key, alternate key, cookie and the data size. Haystack then adds the header and footer lengths to the data size and reads the whole needle from the file. The read operation succeeds only if the key, alternate key and cookie match the ones passed as arguments, if the data passes checksum validation, and if the needle has not been previously deleted (see below).

Haystack Delete Operation

The delete operation is simple – it marks the needle in the haystack store as deleted by setting a “deleted” bit in the flags field of the needle. However, the associated index record is not modified in any way so an application could end up referencing a deleted needle. A read operation for such a needle will see the “deleted” flag and fail the operation with an appropriate error. The space of a deleted needle is not reclaimed in any way. The only way to reclaim space from deleted needles is to compact the haystack (see below).

Photo Store Server.

Photo Store Server is responsible for accepting HTTP requests and translating them to the corresponding Haystack store operations. In order to minimize the number of I/Os required to retrieve photos, the server keeps an in-memory index of all photo offsets in the haystack store file. At startup, the server reads the haystack index file and populates the in-memory index. With hundreds of millions of photos per node (and the number will only grow with larger capacity drives), we need to make sure that the index will fit into the available memory. This is achieved by keeping a minimal amount of metadata in memory, just the information required to locate the images.

When a user uploads a photo, it is assigned a unique 64-bit id. The photo is then scaled down to 4 different sizes. Each scaled image has the same random cookie and 64-bit key, and the logical image size (large, medium, small, thumbnail) is stored in the alternate key. The upload server then calls the photo store server to store all four images in the Haystack.

The in-memory index keeps the following information for each photo:

Haystack uses the open source Google sparse hash data structure to keep the in-memory index small, since it only has 2 bits of overhead per entry.

Photo Store Write/Modify Operation

A write operation writes photos to the haystack and updates the in-memory index with the new entries. If the index already contains records with the same keys then this is a modification of existing photos and only the index records offsets are modified to reflect the location of the new images in the haystack store file. Photo store always assumes that if there are duplicate images (images with the same key) it is the one stored at a larger offset which is valid.

Photo Store Read Operation

The parameters passed to a read operation include haystack id and a photo key, size and cookie. The server performs a lookup in the in-memory index based on the photo key and retrieves the offset of the needle containing the requested image. If found it calls the haystack read operation to get the image. As noted above haystack delete operation doesn’t update the haystack index file record. Therefore a freshly populated in-memory index can contain stale entries for the previously deleted photos. Read of a previously deleted photo will fail and the in-memory index is updated to reflect that by setting the offset of the particular image to zero.

Photo Store Delete Operation

After calling the haystack delete operation the in-memory index is updated by setting the image offset to zero signifying that the particular image has been deleted.

Compaction

Compaction is an online operation which reclaims the space used by the deleted and duplicate needles (needles with the same key). It creates a new haystack by copying needles while skipping any duplicate or deleted entries. Once done it swaps the files and in-memory structures.

HTTP Server

The HTTP framework we use is the simple evhttp server provided with the open source libevent library. We use multiple threads, with each thread being able to serve a single HTTP request at a time. Because our workload is mostly I/O bound, the performance of the HTTP server is not critical.

Summary

Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

Peter Vajgel, Doug Beaver and Jason Sobel are infrastructure engineers at Facebook.