ssis zip压缩文件

In the world of SSIS development architecture, preference should be given to extracting data from flat files instead of non-Microsoft relational databases. This is because you often don’t have to worry about driver support and compatibility issues in your SSIS development/server machine that is often attributed to non-Microsoft database vendors. In fact, I’ve been in several situations whereby we cannot upgrade to another version of SSIS (i.e. BIDS to SSDT) due to the lack of external vendor driver compatibility issues in the newer versions of SSIS.

在SSIS开发体系结构的世界中,应优先考虑从平面文件而不是非Microsoft关系数据库中提取数据。 这是因为您通常不必担心通常归因于非Microsoft数据库供应商的SSIS开发/服务器计算机中的驱动程序支持和兼容性问题。 实际上,由于在新版本的SSIS中缺少外部供应商驱动程序兼容性问题,我曾在几种情况下无法升级到SSIS的另一个版本(即BIDS到SSDT)。

Although preference should be given to importing flat files, it must be noted that wherever possible, the desired flat format should be delimited. This is because properly delimited files reduce time spent on column mapping configuration as SSIS is able to detect row and column delimiters. You may not realise the benefit of having SSIS automatically do column mappings for you until you are sitting with a fixed width dataset that you have to manually break down into 100+ output columns. Figure 1 illustrates a perfect world scenario of a fictitious Fruit Sales transaction file that has rows separated by carriage return/line feed and columns delimited by a vertical bar (also known as Pipe).

尽管应该优先导入平面文件,但必须注意,应尽可能限制所需的平面格式。 这是因为正确定界的文件减少了在列映射配置上花费的时间,因为SSIS能够检测行和列定界符。 您可能没有意识到让SSIS为您自动进行列映射的好处,除非您使用的是固定宽度的数据集,而您必须手动将其分解为100多个输出列。 图1展示了一个虚构的“水果销售”交易文件的理想情况,该文件具有以回车/换行分隔的行和以竖线(也称为管道)定界的列。

Unfortunately, as ETL developers, we often have to extract data from an imperfect world which has flat files formatted in all sorts of ways. Figure 2 illustrates another representation of our fictitious Fruits sales dataset with SSIS unable to detect column delimiters. Whenever you encounter such files, you are limited to two more options in SSIS and that is either you treat them as Fixed Width formatted or Ragged Right formatted. The aim of this article is to explore different ways of working with the latter formatting option, Ragged Right.

不幸的是,作为ETL开发人员,我们经常不得不从一个不完美的世界中提取数据,该世界具有以各种方式格式化的平面文件。 图2展示了我们虚构的水果销售数据集的另一种表示形式,其中SSIS无法检测列分隔符。 每当遇到此类文件时,在SSIS中您都只能使用两个选项,那就是将它们视为“固定宽度”格式还是“锯齿状右”格式。 本文的目的是探索使用后一种格式选项Ragged Right的不同方法。

定义不整齐的右格式 (Defining Ragged Right Format)

Whether you are importing flat files using SQL Server Integration Services (SSIS) or SQL Server Management Studio (SSMS), the Ragged right option in the Format drop-down box in both tools, can be found at the bottom of the list as shown in Figure 3. This shouldn’t be mistaken as the preferred order of file format rather is nothing more than an alphabetical list of available file format. In fact, I strongly believe that Ragged right option is actually a hybrid between Delimited and Fixed width formats. Unlike delimited files, where both column and row delimiters are required, in Ragged right format only row delimiters are auto-detected with the content of the rows stored within a single column. Should you want to break the row into several columns, then you have to manually specify column limit. This is contrary to Fixed width format where you should not only specify column limits but row delimiters too.

无论您是使用SQL Server Integration Services(SSIS)还是SQL Server Management Studio(SSMS)导入平面文件,都可以在列表底部找到两个工具的Format下拉框中的Ragged right选项,如下所示。 图3 。 这不应被误认为是文件格式的首选顺序,而仅仅是可用文件格式的字母顺序列表。 实际上,我坚决认为,“右对齐”选项实际上是定界和固定宽度格式之间的混合。 与需要行和列定界符的定界文件不同,在Ragged right格式中,仅行定界符被自动检测,并且行的内容存储在单个列中。 如果要将行分成几列,则必须手动指定列数限制。 这与固定宽度格式相反,在固定宽度格式中,您不仅应指定列限制,还应指定行定界符。

配置列 (Configuring Columns)

The configuration of columns is perhaps a critical part of the entire ETL process as it helps us build mapping metadata for your ETL. In fact, regardless of where or not SSIS/SSMS can detect delimiters, if you skip Column Mapping section – your ETL will fail validation. In order to clarify how Ragged right formatted files work, I have gone a step back and used Figure 4 to actually displayed a preview of our fictitious Fruits transaction dataset from Notepad++. It can already be seen from Notepad++ that the file only has row delimiter in a form of CRLF.

列的配置可能是整个ETL过程的关键部分,因为它有助于我们为您的ETL构建映射元数据。 实际上,无论SSIS / SSMS可以在哪里检测到定界符,如果跳过“列映射”部分,则ETL都会通过验证。 为了阐明Ragged正确格式文件的工作方式,我退后一步,使用图4实际显示了Notepad ++中虚拟的Fruits交易数据集的预览。 从Notepad ++可以看出,该文件仅具有CRLF形式的行定界符。

When the file shown in Figure 4 is imported into SSIS, it looks as shown in Figure 5 and immediately the row delimiter is automatically detected.

4所示的文件导入SSIS后,其外观如图5所示,并且立即自动检测到行分隔符。

At this point, we can deal with column configuration in one of two ways:

此时,我们可以采用以下两种方法之一来处理列配置:

  1. Single Column Mapping

    单列映射

    Single column mapping is available by default when working with Ragged right format. This default mapping is the simplest yet most error prone. It is simpler because you don’t have to spend time breaking down data rows into two or more columns. However, such an approach usually result into a Text truncation error similarly to the one shown in Figure 6.

    当使用锯齿状右格式时,默认情况下单列映射可用。 此默认映射是最简单但最容易出错的。 这很简单,因为您不必花时间将数据行分为两列或更多列。 但是,这种方法通常会导致文本截断错误,类似于图6所示。

    Figure 6: Text Truncation Error 图6:文本截断错误

    Text truncation error is one of the common errors you are likely to come across when importing files. It doesn’t matter whether you are dealing with Delimiter, Fixed width or Ragged Right delimited files. In flat files, column length is never dynamic, which means that when you previously configured your connection in such a way that [Column 0] has a length of 5 but you later load a file with a length of 6, you will run into a truncation error.

    文本截断错误是导入文件时可能会遇到的常见错误之一。 无论您是处理定界符,固定宽度还是锯齿状右定界文件,都没有关系。 在平面文件中,列的长度永远不会动态变化,这意味着,当您以前以[Column 0]的长度为5的方式配置连接,但后来又加载长度为6的文件时,则会遇到截断错误。

    The easiest way to avoid truncation issues is by adding fat to your column length (i.e. if you know the length of your given column is 75 make it into a 100).

    避免截断问题的最简单方法是在列长度上增加脂肪(即,如果您知道给定列的长度为75,则将其变成100)。

    Another disadvantage of mapping all rows against a single column is that in order to break down row values into several columns, you will have to conduct additional transformation (either within SSIS using transformation tasks or in SQL Server database using T-SQL).

    将所有行映射到单个列的另一个缺点是,为了将行值分解为几列,您将必须执行其他转换(在使用转换任务的SSIS中或在使用T-SQLSQL Server数据库中)。

    For instance, the script in Figure 7 shows several T-SQL substring logic that was used to break values from single column ([Column 0]) into separate columns.

    例如, 图7中的脚本显示了几种T-SQL子字符串逻辑,这些逻辑用于将值从单个列( [列0] )分解为单独的列。

    Figure 7: Single Column T-SQL Transformation 图7:单列T-SQL转换

    The results of execution of the script in Figure 7 are shown in Figure 8.

    图7中脚本的执行结果如图8所示。

    Figure 8: Execution Result of Single Column T-SQL Transformation Script 图8:单列T-SQL转换脚本的执行结果

  2. Multiple Column Mappings

    多列映射

    An alternative to single column mapping is to use markers to specify several column limits. For instance, looking at our fictitious dataset given in Figure 4, we can guess several column headings. Thus, at the beginning of every row, the first two characters look like item numbers; while next 6 characters look like a transaction date followed by what looks like fruit names.

    单列映射的替代方法是使用标记来指定多个列限制。 例如,查看图4中给出的虚拟数据集,我们可以猜出几个列标题。 因此,在每一行的开头,前两个字符看起来像项目编号。 接下来的6个字符看起来像交易日期,后面跟着水果名称。

    The rest of the column limit look as shown in Figure 9.

    列限制的其余部分如图9所示。

    Figure 9: Multiple Column Limit 图9:多列限制

    The biggest benefit of attempting to specify column limit is that when it comes to data transformation, we have less data manipulation to do compared to single column mapping. Figure 10 shows that we didn’t have to manipulate the first three columns because we knew what they represented instead we focused on splitting data from [Column 3].

    尝试指定列限制的最大好处是,在数据转换方面,与单列映射相比,我们要做的数据操作更少。 图10展示了我们不必操纵前三列,因为我们知道它们所代表的意思,而是专注于从[Column 3]中拆分数据。

    Figure 10: Multiple Column Limit Data Transformation 图10:多列限制数据转换

    Just like single column mapping, specifying column limit has its own shortcomings. For instance, if the length of values that are mapped against [Column 2]) suddenly increases, some of those values will be carried over to next columns (should there be no subsequent columns, truncation error will be raised). Figure 11 shows a revised fictitious dataset with Banana fruit names in rows 9-10 renamed to SIFISOBananas. Unlike before, [Column 2] is now unable to store the extended fruit name thereby causing a split on the fruit name with the last three characters (i.e. nas) being stored in [Column 3].

    就像单列映射一样,指定列数限制也有其缺点。 例如,如果映射到[Column 2]的值的长度突然增加,那么其中一些值将被带到下一列(如果没有后续的列,则会出现截断错误)。 图11显示了一个修改后的虚拟数据集,第9-10行中的香蕉果实名称重命名为SIFISOBananas 。 与以前不同, [列2]现在无法存储扩展的水果名称,从而导致对水果名称的拆分,最后三个字符(即nas )存储在[列3]中

    Figure 11: Increased Length of Expected Value 图11:期望值的长度增加

结论 (Conclusion)

In this article we established that we are often not presented with well formatted flat files. Instead, we have to deal with Ragged right formatted flat files. We identified several ways of dealing with column configurations in Ragged right format, namely that row data could either be mapped against a single column or across multiple columns.

在本文中,我们确定通常不会向我们提供格式良好的平面文件。 相反,我们必须处理参差不齐的正确格式的平面文件。 我们确定了几种以Ragged格式处理列配置的方法,即行数据可以映射到单个列或跨多个列。

Download article scripts/demo material here

在此处下载文章脚本/演示材料

参考: ( Reference: )

  • SQL Server Import and Export Wizard SQL Server导入和导出向导
  • Flat File Connection Manager 平面文件连接管理器
  • T-SQL Substring Syntax T-SQL子字符串语法

翻译自: https://www.sqlshack.com/working-with-ragged-right-formatted-files-in-ssis/

ssis zip压缩文件

ssis zip压缩文件_在SSIS中处理参差不齐的正确格式的文件相关推荐

  1. keil 生成bin找不到afx文件_【学习笔记】Keil不能正确生成.bin文件的解决办法

    前段时间我写过如何利用CW.IAR和Keil生成image文件,效果还不错,有些用户反馈挺有帮助的,毕竟待项目开发到最后是需要生成image文件用来量产烧写,我们总不至于到最后使用调试下载吧(不过还别 ...

  2. import引入json文件_关于TypeScript中import JSON的正确姿势详解

    前言 Typescript是微软内部出品的,用actionscript的语法在写js的一门新语言,最近 TypeScript 中毒,想想我一个弱类型出身的人,怎么就喜欢上了类型约束--当然这不是重点, ...

  3. ssis zip压缩文件_SSIS平面文件与原始文件

    ssis zip压缩文件 In this article, we will give an overview of using Flat Files and Raw Files in SSIS, th ...

  4. cmd 将文件夹下文件剪切到另外一个文件_总结java中文件拷贝剪切的5种方式-JAVA IO基础总结第五篇...

    本文是Java IO总结系列篇的第5篇,前篇的访问地址如下: 总结java中创建并写文件的5种方式-JAVA IO基础总结第一篇 总结java从文件中读取数据的6种方法-JAVA IO基础总结第二篇 ...

  5. Java实现剪切MP3格式的文件_总结java中文件拷贝剪切的5种方式-JAVA IO基础总结第五篇...

    本文是Java IO总结系列篇的第5篇,前篇的访问地址如下: 很多朋友在看我的<java IO总结系列>之前觉得创建文件.文件夹删除文件这些基础操作真的是太简单了.但看了我的文章之后,有小 ...

  6. xftp如何搜索文件_头条搜索站长平台如何添加网站和sitemap文件?附图文教程

    头条搜索站长平台已经上线了,目前我们广大站长都可以登录该平台后添加新网站和提交 sitemap 地图文件,建议大家可以前往尝试一下,多一个搜索平台就多一条路,认为倒是挺好的.下面就跟大家简单介绍头条搜 ...

  7. ssis for循环容器_使用SSIS ForEach Loop容器以日期顺序处理文件

    ssis for循环容器 One positive thing to come out of my recent project that involved rewriting one of the ...

  8. android ndk怎样加载o文件_在Android中使用TFLite c++部署

    之前的文章中,我们跟大家介绍过如何使用NNAPI来加速TFLite-Android的inference(可参考使用NNAPI加速android-tflite的Mobilenet分类器).不过之前介绍的 ...

  9. java使用缓冲区读取文件_在Java中使用Google的协议缓冲区

    java使用缓冲区读取文件 最近发布了 有效的Java第三版 ,我一直对确定此类Java开发书籍的更新感兴趣,该书籍的最新版本仅通过Java 6进行了介绍 . 在此版本中,显然存在与Java 7 , ...

最新文章

  1. 御水.20180506
  2. npm的一些常用命令(在国内,建议使用cnpm,在淘宝镜像里面下载就行)
  3. 基于Windows Server 2008 R2的WSFC实现SQL Server 2012高可用性组(AlwaysOn Group)
  4. Redis-Bitmap介绍及使用
  5. ajax上传json到服务器
  6. 首届FineReport平台主题设计大赛火热启动
  7. centos7 python3 爬虫登陆邮箱_Centos7搭建Scrapy爬虫环境
  8. BTA12A-ASEMI高效mos管BTA12A
  9. Vuforia开发问题记录(四)------- Vuforia AR项目在小米8 SE上运行黑屏
  10. 超简单!只需四步将照片处理成手工素描
  11. 计算机网络原理-应用层
  12. 【WPF】Xaml用户控件(Usercontrol)绑定属性/事件
  13. Android 按键消息处理 1
  14. 有个问题,win10系统,网络诊断,将来会自动连接到jinling,什么意思?
  15. buuctf [强网杯 2019]随便注 1
  16. 【历届稳定检索 | 重交大、招商交科主办】第五届交通工程与运输系统国际学术会议(ICTETS 2021)...
  17. QAD2016EE的知识点
  18. UBNT ER-4 配置IPsec实现不同网络互访
  19. 关于python搭建网站后台
  20. PHP实现平台商品和京东价格做对比

热门文章

  1. UOJ 7 NOI2014 购票
  2. JavaScript 是一种什么样的语言
  3. 【转载】一步步构建大型网站架构
  4. 盘点:移动服务 #AzureChat
  5. windows中使用Git工具连接GitHub(配置篇)
  6. 详细讲解委托和协议、看了这个我顿悟--很经典!
  7. ios-deploy out of date (1.9.4 is required). To upgrade with Brew: brew upgrade ios-deploy
  8. Linux---阻塞与非阻塞、同步与异步的区别
  9. 大数据与Hadoop的区别
  10. 邻居把偶然的救急当成了依赖,怎么办?