行存储索引改换成列存储索引_如何使用列存储索引来改善数据仓库登台环境

行存储索引改换成列存储索引

My team and I were recently tasked with refactoring older data marts, particularly those that were created with SQL Server 2008 in mind. As we all know, SQL Server has undergone significant changes since the release of SQL Server 2008. One of those changes relates to the introduction of columnstore as an alternative to the traditional B-tree index (rowstore). Whilst most of the existing documentation relating to columnstore seem to focus on the benefit of columnstore against data warehouse workloads, in this article I argue that the usage of columnstore index should not be limited to facts and dimensions instead let’s introduce it in our data warehouse staging environments too.

我和我的团队最近的任务是重构较旧的数据集市，尤其是那些考虑到SQL Server 2008的数据集市。众所周知，自SQL Server 2008发布以来，SQL Server进行了重大更改。这些更改之一涉及引入列存储，以替代传统的B树索引（行存储）。尽管与列存储有关的大多数现有文档似乎都侧重于列存储相对于数据仓库工作负载的好处，但在本文中，我认为列存储索引的用法不应仅限于事实和维度，而应在数据仓库暂存中进行介绍环境。

数据仓库暂存环境 (Data Warehouse Staging Environment)

The staging environment is an important aspect of the data warehouse that is usually located between the source system and a data mart. It is used to temporarily store data extracted from source systems and is also used to conduct data transformations prior to populating a data mart.

登台环境是数据仓库的重要方面，通常位于源系统和数据集市之间。它用于临时存储从源系统提取的数据，还用于在填充数据集市之前进行数据转换。

For demo purposes, the source data used in this article comes from a CustomerOrders sample database that I received after attending Uwe Ricken’s Johannesburg SQL Saturday Precon event. Figure 2 highlights the properties of the CustomerOrders table that will be used as source data – which is basically a heap table with 2 million rows.

出于演示目的，本文中使用的源数据来自我在参加Uwe Ricken的约翰内斯堡SQL Saturday Precon活动后收到的CustomerOrders示例数据库。图2突出显示了将用作源数据的CustomerOrders表的属性，该表基本上是具有200万行的堆表。

A preview of CustomerOrders table is shown in Figure 3.

图3显示了CustomerOrders表的预览。

In order to demonstrate the benefits of columnstore in staging environment, we have to compare it against its rowstore counterpart, thus, our sample source data will be extracted into two staging databases; one that uses rowstore (SQLShack_RB) and the other using columnstore (SQLShack_CB). By default, the sizes of data and log files for the two databases are similar, as shown in Figure 4.

为了证明列存储在暂存环境中的好处，我们必须将其与行存储副本进行比较，因此，我们的示例源数据将被提取到两个暂存数据库中。一个使用行存储（ SQLShack_RB ），另一个使用列存储（ SQLShack_CB ）。默认情况下，两个数据库的数据和日志文件的大小相似， 如图4所示。

The two databases each have a table that will be used to store the source data. As shown in Figure 5, the table in the SQLShack_RB has a primary key (clustered index) whereas SQLShack_CB has a clustered columnstore index.

这两个数据库都有一个表，该表将用于存储源数据。 如图5所示， SQLShack_RB中的表具有主键（聚集索引），而SQLShack_CB具有聚集列存储索引。

数据仓库暂存SSIS包 (Data Warehouse Staging SSIS Packages)

The source data will be extracted using SQL Server Integration Services (SSIS) package. Figure 6 shows a basic setup of the package that will be used – which simply uses a data flow task with an OLE DB source and destination components.

将使用SQL Server集成服务（SSIS）包提取源数据。图6显示了将使用的程序包的基本设置–该程序仅将数据流任务与OLE DB源和目标组件一起使用。

数据仓库分段：行存储与列存储 (Data Warehouse Staging: Rowstore vs Columnstore)

Data Extraction and Load

数据提取与加载

One of the differences that you will notice between rowstore and columnstore is in the total amount of time each packages would have taken to run. As shown in Figure 7, writing source data into the table with columnstore index took almost double the amount of time it took to load the rowstore staging table.

您将注意到行存储和列存储之间的差异之一是每个程序包运行所花费的总时间。 如图7所示，使用columnstore索引将源数据写入表所花费的时间几乎是加载行存储登台表所花费的时间的两倍。

Figure 7: SSIS package execution results

图7：SSIS包执行结果

Furthermore, as the data was being written into the two tables, I ran a SQL Server trace and the results of that trace are shown in Table 1. Table 1 basically provides a breakdown of what was happening in the backend as the two SSIS packages were executing. Evidently, writing data into the columnstore table generated more wait types, used more CPU and memory than writing to a rowstore staging table.

此外，当将数据写入两个表中时，我运行了一个SQL Server跟踪，该跟踪的结果显示在表1中 。表1基本上提供了两个SSIS程序包执行时后端发生的情况的细分。显然，与写入行存储登台表相比，将数据写入到列存储表中会产生更多的等待类型，使用更多的CPU和内存。

Database	Wait types	CPU	reads	writes	Physical reads	Used memory
SQLShack_RB	WRITELOG	84	36,709	235	0	627
SQLShack_CB	WRITELOG , PREEMPTIVE_OS_WRITEFILEGATHER, PREEMPTIVE_OS_FLUSHFILEBUFFERS	14,179	3,424,278	17,103	40	28,654

数据库	等待类型	中央处理器	读	写	物理阅读	已用记忆体
SQLShack_RB	写日志	84	36,709	235	0	627
SQLShack_CB	WRITELOG，PREEMPTIVE_OS_WRITEFILEGATHER，PREEMPTIVE_OS_FLUSHFILEBUFFERS	14,179	3,424,278	17,103	40	28,654

Table 1

表格1

However, just because the columnstore SSIS package took longer to run is not necessary a bad thing. For instance, let’s try to simulate storing the first 3 rows shown Figure 3; the rowstore will write such data one page at a time and at the end the data will be stored as shown in Table 2. Columnstore, however, works differently as for every source column it allocate segments that are then used to store values of a given row into their respective segments as shown Figure 8.

但是，仅仅因为columnstore SSIS包花费了更长的运行时间就没有必要了。例如，让我们尝试模拟存储图3所示的前3行；行存储将一次将这样的数据写入一页，最后将存储数据，如表2所示。但是，Columnstore的工作方式有所不同，因为它为每个源列分配了一些段，然后这些段用于将给定行的值存储到各自的段中，如图8所示。

	Page 1
Id	Customer_Id	OrderNumber	InvoiceNumber
1	74500	98043-GJND	2044-50211
2	25155	49524-BSHG	2046-98664
3	60970	98628-NUVG	2049-16668

	第1页
ID	顾客ID	订单号	发票号码
1个	74500	98043-GJND	2044-50211
2	25155	49524-BSHG	2046-98664
3	60970	98628-NUVG	2049-16668

Table 2: Typical rowstore data storage

表2：典型的行存储数据存储

The benefit of the columnstore approach is that data is organised according to data types thereby making it easier to compress.

列存储方法的好处是，数据是根据数据类型进行组织的，从而使其更易于压缩。

Figure 8: Typical columnstore data storage

图8：典型的列存储数据存储

Now that we have an understanding of what happens when data is being written and stored into a columnstore, it is not surprising that the table using a columnstore index used half the disk space compared to its rowstore counterpart as shown in Figure 9.

现在我们已经了解了将数据写入并存储到列存储中时会发生什么，毫不奇怪的是，使用列存储索引的表使用的磁盘空间是行存储对应表的一半， 如图9所示。

Figure 9

图9

Reading data out of the Staging tables

从暂存表中读取数据

In addition to the reduced storage space, another benefit of columnstore index relates to the improved speed of reading data. For instance, let’s say the CustomerOrders data staged in the two staging databases is used to populate a fictitious FactOrders table. As part of loading FactOrders table, we need to look-up a list of invoice numbers that have so far been issued to a given customer. A T-SQL query that can be used obtain such a list against the rowstore staging table is shown in Script 1.

除了减少存储空间外，列存储索引的另一个好处还涉及读取数据的速度提高。例如，假设在两个登台数据库中登台的CustomerOrders数据用于填充虚构的FactOrders表。作为加载FactOrders表的一部分，我们需要查找到目前为止已发给给定客户的发票编号列表。 脚本1中显示了一个T-SQL查询，该查询可用于针对行存储库暂存表获取此类列表。
```
SELECT [InvoiceNumber]
FROM [SQLShack_RB].[dbo].[STG_CustomerOrders_RB]
WHERE Customer_id=29248
```
Script 1

脚本1

The results of tracing the execution of Script 1 are shown in Figure 10. You will notice that the Reads value of 47840 is almost similar to the used_pages value (47497) shown in Figure 9. This means that the entire rowstore staging table was scanned as part of returning the result set. Depending on the size of the table you are querying, conducting a full table scan is not ideal as it can consume lots of resources.

跟踪脚本1执行的结果如图10所示。您将注意到Reads值47840几乎与图9中显示的used_pages值（47497）相似。这意味着作为返回结果集的一部分，已扫描了整个行存储区临时表。根据要查询的表的大小，进行全表扫描并不理想，因为它会消耗大量资源。

Figure 10

图10

However, if customise the script shown in Script 1 so that it can be run against the columnstore staging table, we get the results shown in Figure 11. You can easily see that reading data out of the columnstore table took significantly lesser time, lesser CPU and read fewer pages compared to the output shown Figure 10.

但是，如果自定义脚本1中显示的脚本，以便可以针对columnstore临时表运行该脚本，我们将得到如图11所示的结果。您可以很容易地看到，与图10所示的输出相比，从columnstore表中读取数据花费的时间明显更少，CPU占用更少，页面读取更少。

Figure 11

图11

Finally, just beware that an ideal scenario for querying columnstore tables is when you have filters in your query that results into retrieving values off a single column (i.e. Invoice Number). Conversely, the more columns you specify in your SELECT clause, the more pages would have to be read which means the longer the query will run. Thus, remember that columnstore is not a silver bullet to whatever issues you are having in your data warehouse environment; if you plan to replace rowstore staging tables with columnstore, you have to change the way you are reading data from staging tables.

最后，请注意，查询列存储表的理想方案是在查询中使用过滤器时导致从单个列中检索值（即发票编号）。相反，您在SELECT子句中指定的列越多，则必须读取的页面就越多，这意味着查询将运行的时间越长。因此，请记住，列存储并不是解决数据仓库环境中出现的任何问题的灵丹妙药。如果计划用列存储替换行存储过渡表，则必须更改从过渡表读取数据的方式。

结论 (Conclusion)

In this article we have demonstrated that the data warehouse environment can benefit from columnstore indexing as it has several advantages over rowstore. Although it takes longer to write data into a staging environment, columnstore indexing uses less disk space and compresses database files in a manner that makes it convenient to later retrieve data out of that staging table.

在本文中，我们证明了数据仓库环境可以从列存储索引中受益，因为它比行存储具有多个优势。尽管将数据写入暂存环境需要更长的时间，但列存储索引却使用较少的磁盘空间，并以方便以后从该暂存表中检索数据的方式压缩数据库文件。

翻译自: https://www.sqlshack.com/use-columnstore-index-improve-your-data-warehouse-staging-nvironment/

行存储索引改换成列存储索引