行存储索引改换成列存储索引

My team and I were recently tasked with refactoring older data marts, particularly those that were created with SQL Server 2008 in mind. As we all know, SQL Server has undergone significant changes since the release of SQL Server 2008. One of those changes relates to the introduction of columnstore as an alternative to the traditional B-tree index (rowstore). Whilst most of the existing documentation relating to columnstore seem to focus on the benefit of columnstore against data warehouse workloads, in this article I argue that the usage of columnstore index should not be limited to facts and dimensions instead let’s introduce it in our data warehouse staging environments too.

我和我的团队最近的任务是重构较旧的数据集市,尤其是那些考虑到SQL Server 2008的数据集市。 众所周知,自SQL Server 2008发布以来,SQL Server进行了重大更改。这些更改之一涉及引入列存储,以替代传统的B树索引(行存储)。 尽管与列存储有关的大多数现有文档似乎都侧重于列存储相对于数据仓库工作负载的好处,但在本文中,我认为列存储索引的用法不应仅限于事实和维度,而应在数据仓库暂存中进行介绍环境。

数据仓库暂存环境 (Data Warehouse Staging Environment)

The staging environment is an important aspect of the data warehouse that is usually located between the source system and a data mart. It is used to temporarily store data extracted from source systems and is also used to conduct data transformations prior to populating a data mart.

登台环境是数据仓库的重要方面,通常位于源系统和数据集市之间。 它用于临时存储从源系统提取的数据,还用于在填充数据集市之前进行数据转换。

For demo purposes, the source data used in this article comes from a CustomerOrders sample database that I received after attending Uwe Ricken’s Johannesburg SQL Saturday Precon event. Figure 2 highlights the properties of the CustomerOrders table that will be used as source data – which is basically a heap table with 2 million rows.

出于演示目的,本文中使用的源数据来自我在参加Uwe Ricken的约翰内斯堡SQL Saturday Precon活动后收到的CustomerOrders示例数据库。 图2突出显示了将用作源数据的CustomerOrders表的属性,该表基本上是具有200万行的堆表。

A preview of CustomerOrders table is shown in Figure 3.

图3显示了CustomerOrders表的预览。

In order to demonstrate the benefits of columnstore in staging environment, we have to compare it against its rowstore counterpart, thus, our sample source data will be extracted into two staging databases; one that uses rowstore (SQLShack_RB) and the other using columnstore (SQLShack_CB). By default, the sizes of data and log files for the two databases are similar, as shown in Figure 4.

为了证明列存储在暂存环境中的好处,我们必须将其与行存储副本进行比较,因此,我们的示例源数据将被提取到两个暂存数据库中。 一个使用行存储( SQLShack_RB ),另一个使用列存储( SQLShack_CB )。 默认情况下,两个数据库的数据和日志文件的大小相似, 如图4所示

The two databases each have a table that will be used to store the source data. As shown in Figure 5, the table in the SQLShack_RB has a primary key (clustered index) whereas SQLShack_CB has a clustered columnstore index.

这两个数据库都有一个表,该表将用于存储源数据。 如图5所示, SQLShack_RB中的表具有主键(聚集索引),而SQLShack_CB具有聚集列存储索引。

数据仓库暂存SSIS包 (Data Warehouse Staging SSIS Packages)

The source data will be extracted using SQL Server Integration Services (SSIS) package. Figure 6 shows a basic setup of the package that will be used – which simply uses a data flow task with an OLE DB source and destination components.

将使用SQL Server集成服务(SSIS)包提取源数据。 图6显示了将使用的程序包的基本设置–该程序仅将数据流任务与OLE DB源和目标组件一起使用。

数据仓库分段:行存储与列存储 (Data Warehouse Staging: Rowstore vs Columnstore)

  1. Data Extraction and Load

    数据提取与加载

    One of the differences that you will notice between rowstore and columnstore is in the total amount of time each packages would have taken to run. As shown in Figure 7, writing source data into the table with columnstore index took almost double the amount of time it took to load the rowstore staging table.

    您将注意到行存储和列存储之间的差异之一是每个程序包运行所花费的总时间。 如图7所示,使用columnstore索引将源数据写入表所花费的时间几乎是加载行存储登台表所花费的时间的两倍。

    Figure 7: SSIS package execution results

    图7:SSIS包执行结果

    Furthermore, as the data was being written into the two tables, I ran a SQL Server trace and the results of that trace are shown in Table 1. Table 1 basically provides a breakdown of what was happening in the backend as the two SSIS packages were executing. Evidently, writing data into the columnstore table generated more wait types, used more CPU and memory than writing to a rowstore staging table.

    此外,当将数据写入两个表中时,我运行了一个SQL Server跟踪,该跟踪的结果显示在表1中表1基本上提供了两个SSIS程序包执行时后端发生的情况的细分。 显然,与写入行存储登台表相比,将数据写入到列存储表中会产生更多的等待类型,使用更多的CPU和内存。

    Database Wait types CPU reads writes Physical reads Used memory
    SQLShack_RB WRITELOG 84 36,709 235 0 627
    SQLShack_CB WRITELOG , PREEMPTIVE_OS_WRITEFILEGATHER, PREEMPTIVE_OS_FLUSHFILEBUFFERS 14,179 3,424,278 17,103 40 28,654
    数据库 等待类型 中央处理器 物理阅读 已用记忆体
    SQLShack_RB 写日志 84 36,709 235 0 627
    SQLShack_CB WRITELOG,PREEMPTIVE_OS_WRITEFILEGATHER,PREEMPTIVE_OS_FLUSHFILEBUFFERS 14,179 3,424,278 17,103 40 28,654

    Table 1

    表格1

    However, just because the columnstore SSIS package took longer to run is not necessary a bad thing. For instance, let’s try to simulate storing the first 3 rows shown Figure 3; the rowstore will write such data one page at a time and at the end the data will be stored as shown in Table 2. Columnstore, however, works differently as for every source column it allocate segments that are then used to store values of a given row into their respective segments as shown Figure 8.

    但是,仅仅因为columnstore SSIS包花费了更长的运行时间就没有必要了。 例如,让我们尝试模拟存储图3所示的前3行; 行存储将一次将这样的数据写入一页,最后将存储数据,如表2所示。 但是,Columnstore的工作方式有所不同,因为它为每个源列分配了一些段,然后这些段用于将给定行的值存储到各自的段中,如图8所示。

    Page 1
    Id Customer_Id OrderNumber InvoiceNumber
    1 74500 98043-GJND 2044-50211
    2 25155 49524-BSHG 2046-98664
    3 60970 98628-NUVG 2049-16668
    第1页
    ID 顾客ID 订单号 发票号码
    1个 74500 98043-GJND 2044-50211
    2 25155 49524-BSHG 2046-98664
    3 60970 98628-NUVG 2049-16668

    Table 2: Typical rowstore data storage

    表2:典型的行存储数据存储

    The benefit of the columnstore approach is that data is organised according to data types thereby making it easier to compress.

    列存储方法的好处是,数据是根据数据类型进行组织的,从而使其更易于压缩。

    Figure 8: Typical columnstore data storage

    图8:典型的列存储数据存储

    Now that we have an understanding of what happens when data is being written and stored into a columnstore, it is not surprising that the table using a columnstore index used half the disk space compared to its rowstore counterpart as shown in Figure 9.

    现在我们已经了解了将数据写入并存储到列存储中时会发生什么,毫不奇怪的是,使用列存储索引的表使用的磁盘空间是行存储对应表的一半, 如图9所示。

    Figure 9

    图9

  2. Reading data out of the Staging tables

    从暂存表中读取数据

    In addition to the reduced storage space, another benefit of columnstore index relates to the improved speed of reading data. For instance, let’s say the CustomerOrders data staged in the two staging databases is used to populate a fictitious FactOrders table. As part of loading FactOrders table, we need to look-up a list of invoice numbers that have so far been issued to a given customer. A T-SQL query that can be used obtain such a list against the rowstore staging table is shown in Script 1.

    除了减少存储空间外,列存储索引的另一个好处还涉及读取数据的速度提高。 例如,假设在两个登台数据库中登台的CustomerOrders数据用于填充虚构的FactOrders表。 作为加载FactOrders表的一部分,我们需要查找到目前为止已发给给定客户的发票编号列表。 脚本1中显示了一个T-SQL查询,该查询可用于针对行存储库暂存表获取此类列表。

    
    SELECT [InvoiceNumber]
    FROM [SQLShack_RB].[dbo].[STG_CustomerOrders_RB]
    WHERE Customer_id=29248

    Script 1

    脚本1

    The results of tracing the execution of Script 1 are shown in Figure 10. You will notice that the Reads value of 47840 is almost similar to the used_pages value (47497) shown in Figure 9. This means that the entire rowstore staging table was scanned as part of returning the result set. Depending on the size of the table you are querying, conducting a full table scan is not ideal as it can consume lots of resources.

    跟踪脚本1执行的结果如图10所示。 您将注意到Reads值47840几乎与图9中显示的used_pa​​ges值(47497)相似。 这意味着作为返回结果集的一部分,已扫描了整个行存储区临时表。 根据要查询的表的大小,进行全表扫描并不理想,因为它会消耗大量资源。

    Figure 10

    图10

    However, if customise the script shown in Script 1 so that it can be run against the columnstore staging table, we get the results shown in Figure 11. You can easily see that reading data out of the columnstore table took significantly lesser time, lesser CPU and read fewer pages compared to the output shown Figure 10.

    但是,如果自定义脚本1中显示的脚本,以便可以针对columnstore临时表运行该脚本,我们将得到如图11所示的结果。 您可以很容易地看到,与图10所示的输出相比,从columnstore表中读取数据花费的时间明显更少,CPU占用更少,页面读取更少。

    Figure 11

    图11

    Finally, just beware that an ideal scenario for querying columnstore tables is when you have filters in your query that results into retrieving values off a single column (i.e. Invoice Number). Conversely, the more columns you specify in your SELECT clause, the more pages would have to be read which means the longer the query will run. Thus, remember that columnstore is not a silver bullet to whatever issues you are having in your data warehouse environment; if you plan to replace rowstore staging tables with columnstore, you have to change the way you are reading data from staging tables.

    最后,请注意,查询列存储表的理想方案是在查询中使用过滤器时导致从单个列中检索值(即发票编号)。 相反,您在SELECT子句中指定的列越多,则必须读取的页面就越多,这意味着查询将运行的时间越长。 因此,请记住,列存储并不是解决数据仓库环境中出现的任何问题的灵丹妙药。 如果计划用列存储替换行存储过渡表,则必须更改从过渡表读取数据的方式。

结论 (Conclusion)

In this article we have demonstrated that the data warehouse environment can benefit from columnstore indexing as it has several advantages over rowstore. Although it takes longer to write data into a staging environment, columnstore indexing uses less disk space and compresses database files in a manner that makes it convenient to later retrieve data out of that staging table.

在本文中,我们证明了数据仓库环境可以从列存储索引中受益,因为它比行存储具有多个优势。 尽管将数据写入暂存环境需要更长的时间,但列存储索引却使用较少的磁盘空间,并以方便以后从该暂存表中检索数据的方式压缩数据库文件。

翻译自: https://www.sqlshack.com/use-columnstore-index-improve-your-data-warehouse-staging-nvironment/

行存储索引改换成列存储索引

行存储索引改换成列存储索引_如何使用列存储索引来改善数据仓库登台环境相关推荐

  1. 行存储索引改换成列存储索引_列存储索引增强功能–数据压缩,估计和节省

    行存储索引改换成列存储索引 Data compression is required to reduce database storage size as well as improving perf ...

  2. 行存储索引改换成列存储索引_索引策略–第2部分–内存优化表和列存储索引

    行存储索引改换成列存储索引 In the first part we started discussion about choosing the right table structure and d ...

  3. 用python提取不同的两列数据对比_比较两列数据fram中的值

    另一种方法是使用pandas.DataFrame的.loc方法,该方法返回符合布尔索引条件的行的索引位置:df.loc[(df['256'] != df['Z'])].index 输出:Int64In ...

  4. mysql 按两列排序吗_按两列排序MySQL表

    噜噜哒 这可能有助于某人正在寻找通过两列排序表的方法,但是以相似的方式.这意味着使用聚合排序功能组合两种排序.例如,在使用全文搜索检索文章以及文章发布日期时,它非常有用.这只是一个例子,但是如果你理解 ...

  5. 五大列级庄_法国波尔多列级名庄之五级庄介绍

    对于法国波尔多的列级名庄,大家最熟悉的可能是一级庄,但对五级庄可能却不是特别了解.下面就来介绍一下法国波尔多的五级名庄. 五级庄介绍 1.庞特卡奈古堡红葡萄酒 庞特卡奈古堡(Chateau Ponte ...

  6. 五大列级庄_什么是列级酒庄

    展开全部 列级酒庄就是在分级体系中占有一席之地的酒庄,法国各地都有不同62616964757a686964616fe58685e5aeb931333330343139的分级体系,所以列级名庄也就非常之 ...

  7. mysql列别名引用_引用聚合列的MySQL别名

    继续我的问题 summarizing-two-conditions-on-the-same-sql-table,我添加了一个RATIO列,它只是一个SUM(-)列除以第二个SUM(-)列: SELEC ...

  8. 五大列级庄_什么是“列级庄”

    作为中国一家列级酒庄的品牌运营,上一篇文章跟大家分享了中国的列级庄,今天我和大家分享列级庄的来源以及更多葡萄酒酒庄的等级制度. 列级庄来源的故事: 1855年,巴黎举行了世界博览会.当时,法国国王拿破 ...

  9. python两列时间间隔计算器_计算两列之间的Pandas DataFrame时间差异(以小时和分钟为单位)...

    熊猫的时间戳差异返回datetime.timedelta对象.可以使用* as_type *方法将其轻松转换为小时,就像这样 import pandas df = pandas.DataFrame(c ...

最新文章

  1. final关键字的几大特征
  2. {网络编程}和{多线程}应用:基于UDP协议【实现多发送方发送数据到同一个接收者】--练习
  3. C++实现计数排序(附完整源码)
  4. 分享20个常用的Python函数,轻松玩转Pandas!!
  5. 每日一linux命令
  6. openGauss 学习环境部署(docker方式),并使用dbeaver进行连接
  7. MariaDB-5.5.56 主主复制+keepalived高可用
  8. 通过条形码扫描器攻击工控系统
  9. vue引入jsmind(右键菜单)
  10. IDEA MyEclipse Eclipse 快捷键大全(最终版)
  11. 台式计算机怎么安装无线网卡,台式机无线网卡怎么用 台式机USB无线网卡安装使用教程...
  12. 国内常见的日内CTA策略介绍以及实现
  13. mysql中有关视图的概念、操作及作用
  14. 中国大学计算机专业排名教育部,中国校友会网2018中国大学计算机类各本科专业排行榜...
  15. 针对不同场景的Python合并多个Excel方法
  16. postman 传 map数据怎么传
  17. 阿里品牌数据品牌银行分析师认证真题资料库整理答案
  18. ubuntu介绍以及使用
  19. 循环神经网络中梯度爆炸的原因
  20. GIT提示Another git process seems to be running in this repository

热门文章

  1. Excel中复杂跨行跨列数据
  2. Java的jdk1.6与jre1.8中存在的差异
  3. pta 习题集5-19 列车厢调度
  4. PHP加密解密函数之Base64
  5. 第五章、使用复合赋值和循环语句
  6. Android的图片压缩并上传
  7. Spring2.5事务配置的5种方法
  8. css命中与jquery命中
  9. 计算机网络学习笔记(23. HTTP连接类型)
  10. Web前端经典面试题-JavaScript