一. RAID的基本原理和概念

1.1 RAID基本概念

RAID=Redundant Arrays of Independent Disks,中文称之为独立冗余磁盘阵列。于1987年由美国伯克利大学提出,从本质上来说,RAID是一种多磁盘管理技术,利用多个相互独立的高性能磁盘驱动器组成磁盘子系统,从而提供比单个磁盘更高的存储性能和数据冗余的技术,其中部分物理存储空间可用来记录和保存用于重建用户数据的冗余信息,同时并发的在多个磁盘上读写数据来提高存储系统的I/O性能。

图 1.1 RAID的基本框架图

1.2 RAID基本原理

1.2.1 镜像(Mirroring)


图1.2.1 镜像读写示意图

1.2.2 数据条带(Data stripping)

RAID 由多块磁盘组成,数据条带技术将数据以块的方式均匀分布存储在多个磁盘中(单个磁盘所分得的数据块的大小叫条带粒度,且条带粒度包含的扇区的数量是整数),从而可以对数据进行并发处理。这样写入和读取数据就可以在多个磁盘上同时进行,并发产生非常高的聚合 I/O ,有效提高了整体 I/O 性能,而且具有良好的线性扩展性。这对大容量数据尤其显著,如果不分块,数据只能按顺序存储在磁盘阵列的磁盘上,需要时再按顺序读取。而通过条带技术,可获得数倍与顺序访问的性能提升。

                                                                 图1.2.2 数据条带示意图

For example, in a four-disk system using only disk striping (used in RAID level 0), segment 1 is written to disk 1,segment 2 is written to disk 2, and so on. Disk striping enhances performance because multiple drives are accessed simultaneously, but disk striping does not provide data redundancy.

​​​​​​​        ​​​​​​​        

图1.2.3 RAID0数据条带示意图

1.2.3 数据校验(Data parity)


图1.2.4 数据校验示意图

采用数据校验时, RAID 要在写入数据同时进行校验计算,并将得到的校验数据存储在 RAID 成员磁盘中。校验数据可以集中保存在某个磁盘或分散存储在多个不同磁盘中,甚至校验数据也可以分块,不同 RAID 等级实现各不相同。当其中一部分数据出错时,就可以对剩余数据和校验数据进行反校验计算重建丢失的数据。校验技术相对于镜像技术的优势在于节省大量开销,但由于每次数据读写都要进行大量的校验运算,对计算机的运算速度要求很高,必须使用硬件 RAID 控制器。在数据重建恢复方面,检验技术比镜像技术复杂得多且慢得多。

1.2.4 磁盘组(RAID group)

磁盘组RG的成员盘由创建vd命令时指定,需要特别注意的是RG可有多个VD组成,但VD的RAID级别一定是相同的。可以查看的RG相关信息有:总容量(Total Capacity),剩余容量(Free Capacity),回拷的开关状态(CopyBack Switch),成员盘整体的cache状态(PD Cache),逻辑块的大小(Logical Block Size),成员盘的信息(包含ENC,SLOT),包含的热备盘信息,RAID组容量的整体使用情况(RAID Group Block Infomation)。

1.2.5 逻辑磁盘(Virtual drive)

逻辑磁盘VD与磁盘组RG的关系属于子集与母集的关系,可以查看的VD相关信息有:RAID级别(RAID Level),在VDLIST当中的标识(ID),所属RAID组的标识(RAID ID),VD状态(Status),上报的设备类型(Map Type),OS下盘符标号(OS Drive Letter),条带大小(Stripe Size),Write Cache策略,Real Write Cache策略,Read Cache策略,VD 初始化信息(VD Initialize)。

1.2.6 后台初始化(Background Initialization)

Background initialization is a check for media errors on the drives when you create a virtual drive. Background initialization is an automatic operation that starts five minutes after you create the virtual drive. This check ensures that striped data segments are the same on all of the drives in the drive group.Background initialization is similar to a consistency check. The difference between the two is that a background initialization is forced on new virtual drives and a consistency check is not.

1.2.7 写cache策略(Write Cache)

Write-back:RAID caching is a cost-effective way to improve I/O performance by writing data to a controller’s cache before it is written to disk. Write-back cache improves application performance by storing write data to high performance cache memory during periods of heavy use. Where there is a break in user requests, the data is written from the cache memory to the array. During normal write-back operation, data is written to cache (DRAM), the I/O is acknowledged as “complete” to the application that issued the write, and later the write is flushed to disk. If power is lost while write-back cache is enabled, the writes in DRAM may be lost. Since the controller has already acknowledged the I/O as complete, the application is unaware of the data loss.( Recommended settings for hardware RAID arrays based on HDD).

Write-through:In write-through, data is simultaneously updated to cache and memory. This process is simpler and more reliable. This is used when there are no frequent writes to the cache(The number of write operations is less). It helps in data recovery (In case of a power outage or system failure). A data write will experience latency (delay) as we have to write to two locations (both Memory and Cache). It Solves the inconsistency problem. But it questions the advantage of having a cache in write operation (As the whole point of using a cache was to avoid multiple access to the main memory).

1.2.8 Disk sanitization

Disk sanitization is the process of physically obliterating data with specified byte patterns or random data so that recovery of the original data becomes impossible. Using the sanitization process ensures that no one can recover the data on the disks. Sanitize 支持 Block Erase(块擦除)、Overwrite(覆写)、Crypto Erase(密钥删除)三种类型擦除操作:

Block Erase从 block 级别,也就是从物理上彻底擦除 SSD 上的数据

Overwrite用特定的数据格式覆盖用户数据。Overwrite 擦除方式最早在 HDD 上应用,HDD 的数据是存储在带有磁性涂层的金属盘片上,写入新数据可以通过覆写的方式完成。NVMe SSD 时代,协议演进到 NVMe1.3 引入 Sanitize 功能,Overwrite 擦除方式也得以沿用。然而,SSD 的存储介质与 HDD 不同,读取和写入的基本单位不是 HDD 的比特(bit)或字节(byte),而是一个页(Page),新的数据写入需要先擦除(Erase),然后再写入(Program),擦除必须按照块(Block)为单位进行,这无形中会引入额外擦除,从而降低 SSD 寿命。

Crypto Erase对于支持自加密功能的 SSD,通过删除密钥,使加密数据不可识别。

1.2.9 Disk Secure Erase

Secure Erase and Sanitize both securely erase the data on the SSD and reset the SSD to factory settings. After you Sanitize or Secure Erase SSD, all data will be permanently removed on the solid-state drive and cannot be recovered. But there are some differences between those two methods.

Secure Erase only deletes the mapping table but will not erase all blocks that have been written to. However, Sanitize will delete the mapping table and will erase all blocks that have been written to. Thus, Secure Erase is faster to complete than Sanitize. But not all SSD support Sanitize. For example, if you want to know whether your SanDisk SSD supports Sanitize, you need to refer to the SanDisk SSD Dashboard to check it.

二. RAID Level

2.1 RAID 0:

​​​​​​​        ​​​​​​​      


从严格意义上说,RAID 0不是RAID,因为它没有数据冗余和校验。RAID 0技术只是实现了带区组。在实现过程中,RAID 0只是连续地分割数据并行地读/写于多个磁盘上。由于数据块被并行地保存在不同的磁盘上,因此RAID 0具有很高的数据传输率。另外,由于组成RAID 0的所有硬盘空间都可以用来保存数据,因此RAID 0的存储空间利用率也是最高的 。所以RAID 0只适用于类似Video/Audio信号存储、临时文件的转储等对速度要求极其严格的特殊应用。由于没有任何的数据冗余,所以安全性极低,只要RAID里的任何一块磁盘损坏,都会发生所有数据丢失的毁灭性的情况。换句话说,RAID 0模式中,硬盘个数越多,安全性越低。因此,RAID 0不适用于关键任务环境,但是,它却非常适合于视频、图象的制作和编辑。

2.2 RAID 1:

​​​​​​​        ​​​​​​​        

RAID1主要是通过数据镜像实现数据冗余,在两对分离的磁盘上产生互为备份的数据,因RAID 1具有很高的安全性,它甚至可以保证在一半数量的磁盘出现问题时还能不间断地工作,但是整个系统的处理能力会受到影响。不过,由于 RAID 1需要通过两次读写来实现磁盘镜像,这样虽然保证了镜像磁盘随时与原磁盘上的数据完全一致,但是磁盘控制器的负载相当大。另外,RAID 1的数据空间浪费极其严重,是RAID各种等级中成本最高的一种。它只有一半的磁盘空间利用率,只有当系统需要极高的可靠性时,人们才会选择使用RAID 1。因此RAID1常用于对容错要求极严的应用场合。

2.3 RAID 3:

​​​​​​​        ​​​​​​​        


2.4 RAID 5:

​​​​​​​        ​​​​​​​        

RAID5也被叫做带分布式奇偶位的条带。每个条带上都有相当于一个"块"那么大的地方被用来存放奇偶位。与RAID 3不同的是,RAID 5把奇偶位信息也分布在所有的磁盘上,而并非一个磁盘上,大大减轻了奇偶校验盘的负担。尽管有一些容量上的损失,RAID 5却能提供较为完美的整体性能,因而也是被广泛应用的一种磁盘阵列方案。它适合于输入/输出密集、高读/写比率的应用程序,如事务处理等。为了具有RAID5级的冗余度,我们需要至少三个磁盘组成的磁盘阵列。RAID5可以通过磁盘阵列控制器硬件实现,也可以通过某些网络操作系统软件实现。

2.5 RAID 6:

​​​​​​​        ​​​​​​​        

RAID6是带有两种分布存储的奇偶校验码的独立磁盘结构。它使用了分配在不同的磁盘上的第二种奇偶校验来实现增强型的RAID5,它能承受多个驱动器同时出现故障. RRAID6是由一些大型企业提出来的私有RAID级别标准,它的全称叫“Independent Data disks with two independent distributed parity schemes(带有两个独立分布式校验方案的独立数据磁盘)”。这种RAID级别是在RAID 5的基础上发展而成,因此它的工作模式与RAID 5有异曲同工之妙,不同的是RAID 5将校验码写入到一个驱动器里面,而RAID 6将校验码写入到两个驱动器里面,这样就增强了磁盘的容错能力,同时RAID 6阵列中允许出现故障的磁盘也就达到了两个,但相应的阵列磁盘数量最少也要4个.

2.6 RAID 10:

​​​​​​​        ​​​​​​​        

RAID10,也被称为镜象阵列条带,现在我们一般称它为RAID 0+1。RAID 10(RAID 0+1)提供100%的数据冗余,支持更大的卷尺寸。组建RAID 10(RAID 0+1)需要4个磁盘,其中两个为条带数据分布,提供了RAID 0的读写性能,而另外两个则为前面两个硬盘的镜像,保证了数据的完整备份。RAID (0+1) 允许多个硬盘损坏,因为它完全使用硬盘来实现资料备余。RAID 0+1是存储性能和数据安全兼顾的方案。它在提供与RAID 1一样的数据安全保障的同时,也提供了与RAID0近似的存储性能.

2.7 RAID 50:

​​​​​​​        ​​​​​​​        ​​​​​​​        

RAID50被称为分布奇偶位阵列条带。它由两组RAID5磁盘组成(每组最少3个),每一组都使用了分布式奇偶位,而两组硬盘再组建成RAID0,实验跨磁盘抽取数据。RAID50提供可靠的数据存储和优秀的整体性能,并支持更大的卷尺寸。即使两个物理磁盘发生故障(每个阵列中一个),数据也可以顺利恢复过来. RAID 50最少需要6个驱动器。



三. RAID 数据保护

3.1 热备盘与冷备盘(hotspare Drive):

如果具有容错冗余能力的 RAID 阵列中坏掉了一块硬盘,RAID 阵列会如何自我进行恢复呢?以 2 盘的 RAID1 为例。假如坏掉了一块盘,RAID1 阵列将只有 1 块盘在正常运行,这时的RAID1 阵列将处于降级(Degraded)状态,也就意味着当前阵列已无容错冗余能力,虽然还能继续行,但是数据已经不安全,需要人为干预进行修复。只需要拔出坏掉的硬盘,换一块相同容量的、好的硬盘插上去,RAID1 阵列就会自动开始恢复重建过程。简单来说,就是将剩余 1 块盘中的数据重新拷贝到新换上的这块盘中。根据硬盘大小的不同,阵列恢复重建过程将从十几小时到几十小时不等。

那么,换上的这块硬盘,不管是从抽屉里拿出来的还是去科技市场买了一块新的,都是通过人为操作插入到整个阵列里的。在出问题之前,这块盘就冷冷的躺在抽屉里而并不会通电,这块盘就叫冷备盘(Cold Spare)。那能不能让阵列自动找一块好的硬盘来替换掉坏掉的盘呢?当然可以

通过热备盘(Hot Spare)实现。简单来说,就是在建好 RAID 阵列后,再向其中插入 1 到多块与阵列中硬盘相同容量的盘,将其设置为 Hot Spare 模式。这些盘在阵列健康的时候就静静的呆在那,也不存数据,也没有读写访问。一旦阵列中有硬盘出问题,阵列处于 Degraded 状态时,RAID控制器会立即激活热备盘,开始阵列的恢复重建工作。


The hot spare can be of two types:

• Global hot spare-全局热备盘

Use a global hot spare drive to replace any failed drive in a redundant drive group as long as its capacity is equal to or larger than the coerced capacity of the failed drive. A global hot spare defined on any channel should be available to replace a failed drive on both channels.

Global hot spares can be created without first creating a logical drive. If all logical drives are deleted, global hot spares become unconfigured good.

• Dedicated hot spare-本地热备盘,只服务于已经指定的RAID组

Use a dedicated hot spare to replace a failed drive only in a selected drive group. One or more drives can be designated as a member of a spare drive pool. The most suitable drive from the pool is selected for failover. A dedicated hot spare is used before one from the global hot spare pool.Hot spare drives can be located on any RAID channel. Standby hot spares (not being used in RAID drive group) are polled every 60 seconds at a minimum, and their status made available in the drive group management software. RAID controllers offer the ability to rebuild with a disk that is in a system but not initially set to be a hot spare.Observe the following parameters when using hot spares:

• Hot spares are used only in drive groups with redundancy: RAID levels 1, 5, 6, 10, 50, and 60.

• A hot spare connected to a specific RAID controller can be used to rebuild a drive that is connected only to the same controller.

• You must assign the hot spare to one or more drives through the controller BIOS or must use drive group management software to place it in the hot spare pool.

• A hot spare must have free space equal to or greater than the drive it replaces.

For example, to replace a 500-GB drive, the hot spare must be 500-GB or larger.

• A dedicated hot spare becomes a global hot spare if all the logical drives in the drive group that the hot spare is dedicated to are deleted (the drive group is deleted).

3.2 预拷贝(precopy):


  1. 正常使用时,实时监控磁盘状态
  2. 当某个磁盘疑似出现故障时,将该盘上的数据拷贝到热备盘上去
  3. 拷贝完成后,若有新盘替换故障盘,再将数据迁移回新盘(回拷)​​​​​​​

3.3 重构(rebuild):


  1. 阵列中有成员盘故障或数据失效
  2. 阵列中配置有热备盘且没有被其他RAID组占用,或者新盘替换了故障盘

3.4 回拷(copyBack):

If a member drive of a RAID array with redundancy becomes faulty, the hot spare drive automatically replaces the failed drive and starts data synchronization(即是重构的过程). After a new data drive is installed to replace the faulty one, data is copied from the hot spare drive to the new data drive. As the data copyback is complete, the hot spare drive restores its hot spare state.

3.5  一致性检查(ccheck):

Consistency Check verifies the redundancy is the same across the Virtual Disk members at a redundancy level RAID group. 针对RAID1/RAID5/RAID6/ RAID10/RAID50/RAID60这类具备冗余功能的RAID级别,Consistency Check(一致性检测)对RAID组中的数据进行一致性检测,RAID0没有Consistency Check。

对于RAID1/ RAID10这类基于“镜像”的RAID算法,如果主备成员盘之间的Consistency Check结果不一致,则会记录数据不一致的情况,但是不会进行数据的重新写入操作,原因是RAID卡无法判断哪个数据是正确的。对RAID5/RAID6/RAID50/RAID60,Consistency Check会读取各个成员盘中的数据并做奇偶运算,如果运算结果和校验盘中的数据不一致,则用新生成的数据覆盖校验盘中原数据。

3.6 巡读(patrolread):

Patrol Read check blocks on the drives ,Patrol read involves the review of your system for possible drive errors that could lead to a drive failure and then action to correct errors. The goal is to protect data integrity by detecting drive failure before the failure can damage data. The corrective actions depend on the drive group configuration and the type of errors.

Patrol read cannot be performed on a drive that has any of the following operations in progress:

  • RAID hot spare drive recovery
  • Dynamic drive expansion
  • Full or background initialization
  • Consistency check

相较于ccheck:Consistency Check仅仅测试存储阵列中硬盘上包含数据和校验信息的部分,而不是存储阵列中空白区域(无数据和校验信息);Patrol Read检查存储阵列中成员盘的每一个扇区。由于当前硬盘容量越来越大,Patrol Read测试比Consistency Check显得更加重要。Patrol Read可以发现存储阵列中暂无数据和校验信息部分的错误,而这些错误正常是无法发现,直到这些区域被测试或者被写(注:好处是避免对错误区域进行读写)。如果短时间内读写足够多的区域,那么就会达到坏区数量的极限,从而导致Raid Fail(尤其是Rebuid状态)。

至于这两者的功能,Patrol Read可以做Consistency Check大部分任务,由错误读写或者硬盘错误引起的不一致也可以被Patrol Read发现。

3.7 替换(replace):

The Replace operation lets you copy data from a source drive into a destination drive that is not a part of the virtual drive. The Replace operation often creates or restores a specific physical configuration for a drive group. For example, a specific arrangement of drive group members on the device I/O buses. You can run a Replace operation automatically or manually.Typically, when a drive fails or is expected to fail, the data is rebuilt on a hot spare. The failed drive is replaced with a new disk. Then the data is copied from the hot spare to the new drive, and the hot spare reverts from a rebuild drive to its riginal hot spare status. The Replace operation runs as a background activity, and the virtual drive is still available online to the host.A Replace operation is also initiated when the first SMART error occurs on a drive that is part of a virtual drive. The destination drive is a hot spare that qualifies as a rebuild drive. The drive that has the SMART error is marked as failed only after the successful completion of the Replace operation. This situation avoids putting the drive group in a Degraded status.The Replace operation runs as a background activity, and the virtual drive is still available online to the host.

3.8 物理盘的状态(Drive States):

Online:A drive that can be accessed by the RAID controller and is part of the virtual drive.

Unconfigured Good:A drive that is functioning normally but is not configured as a part of a virtual drive or as a hot spare.

Hot Spare:A drive that is powered up and ready for use as a spare in case an online drive fails.

Fault :A drive that was originally configured as Online or Hot Spare, but on which the firmware detects an unrecoverable error.

Unconfigured Bad :A drive on which the firmware detects an unrecoverable error; the drive was Unconfigured Good or the drive could not be initialized.

Offline :A drive that is part of a virtual drive but which has invalid data as far as the RAID configuration is concerned.

3.9 紧急热备(Emergency Hotspare):

After the emergency spare function is enabled for a RAID array that supports redundancy and has no hot spare drive specified, a drive in the Unconfigured Good state will automatically replace a failed member drive and rebuild data to avoid data loss.


