sql批量插入防止重复插入_使用具有严格业务规则SQL批量插入

sql批量插入防止重复插入

This article will cover SQL bulk insert operations deterministic outcomes and responses covering not allowing any bad data to allowing all data to be inserted, regardless of errors.

本文将介绍SQL批量插入操作的确定性结果和响应，内容包括不允许出现任何错误数据，允许插入所有数据（无论是否存在错误）。

As we import data our data and migrate the data to feeds, applications, APIs, or other possible reports, we want to consider applying our business rules to our data as early as possible in our flow as possible. One of the main business rules that affects any data flow covers how we address errors during our data import and our rule for what we do with these errors. The cost of catching an error later in a flow can be enormous.

在导入数据并将数据迁移到提要，应用程序，API或其他可能的报告时，我们希望考虑在业务流程中尽早将业务规则应用于数据。影响任何数据流的主要业务规则之一包括在数据导入过程中如何解决错误以及如何处理这些错误的规则。稍后在流中捕获错误的代价可能是巨大的。

This helps increase performance and it also helps us catch errors early, which we can get further clarification in the case of a data vendor, or possibly avoid using the data in question. With SQL bulk insert we can enforce business rules early in our data flow to reduce resource usage and get results with our data.

这有助于提高性能，还可以帮助我们及早发现错误，对于数据供应商，我们可以对其进行进一步说明，或者可以避免使用有问题的数据。使用SQL批量插入，我们可以在数据流的早期执行业务规则，以减少资源使用并获得数据结果。

流程中的3种可能的数据场景 (3 Possible Data Scenarios Early In A Flow)

Using the premise of “having no data is better than having bad data” (bad data here could be inaccurate or meaningless for our business context), we want to consider how we structure our data flow design in a manner that eliminates bad data as early as possible. We should consider three possible scenarios related to this as this step may involve one or more of the three scenarios.

使用“没有数据总比拥有坏数据更好”的前提（这里的坏数据对于我们的业务环境可能是不准确的或毫无意义的），我们希望考虑如何以一种能够尽早消除坏数据的方式来构造数据流设计尽可能。我们应该考虑与此相关的三种可能情况，因为此步骤可能涉及这三种情况中的一种或多种。

No bad data (errors) allowed: if a data point is invalid or meaningless for our business context out of an entire set, does it invalidate the entire set? For this scenario it would. This may be useful to us (see scenario three), but we would see this as an invalidation of the data set 不允许有不良数据（错误） ：如果数据点对整个集合中的业务上下文无效或无意义，那么会使整个集合无效吗？对于这种情况，它将。这对我们可能有用（请参阅方案三），但是我们会将其视为数据集的无效
Error checks: are we using a data vendor (or source) we can contact if we find bad data following a SQL bulk insert? For this scenario, getting feedback quickly is key
错误检查 ：如果在SQL批量插入后发现不良数据，我们是否正在使用数据供应商（或数据源）可以联系？对于这种情况，快速获得反馈是关键
Keep all data: if we determine that data are invalid (in either scenario one or two), could we use this information? This also applies to situations where we identify errors in commonly used or well-respected data sources where we uncover inaccurate data that the vendor or source may not be aware of and we want this information because of other business reasons 保留所有数据 ：如果我们确定数据无效（在一种情况或两种情况下），我们可以使用此信息吗？这也适用于以下情况：我们发现了常用或受人尊敬的数据源中的错误，在这些错误中我们发现了供应商或数据源可能不知道的不正确数据，并且由于其他业务原因而需要此信息

Depending on our answers to the above questions, we may choose some options during our file load. For an example, if one bad data point invalidates an entire data set, we may choose to fail an entire transaction when the first failure is experienced by not specifying any errors allowed. Likewise, if we have strict rules, we also want to ensure that our table design and validation following our file load eliminates any data that don’t fit what we expect. Finally, we can use some of the features we’ve covered to even automate reading out to vendors.

根据我们对上述问题的回答，我们可以在文件加载期间选择一些选项。例如，如果一个错误的数据点使整个数据集无效，那么我们可能会在未遇到第一个失败时通过不指定任何允许的错误来选择使整个事务失败。同样，如果我们有严格的规则，我们还希望确保在文件加载后进行表设计和验证，以消除所有不符合我们期望的数据。最后，我们可以使用已介绍的某些功能甚至自动向供应商读出信息。

方案1：不允许出现错误 (Scenario 1: Allowing No Errors)

In most cases, we want data without errors and preventing errors early in our data flow can save us time and prevent us from importing bad data. There are some rare situations where we may identify a bad data set and want to keep it (see point 3, which I cover in more details in a below section). In these situations, we want SQL bulk insert to limit the lowest possible we’ll allow – for our example purposes that will be zero (the default). Also, we want our table to have the exact structure of data that we expect from the file, so that any data that fail to comply to our business rules, fail the file insert.

在大多数情况下，我们希望数据没有错误，并且在数据流的早期防止错误可以节省我们的时间，并防止我们导入错误的数据。在某些罕见的情况下，我们可能会识别出错误的数据集并希望保留它（请参阅第3点，我将在下一节中对其进行详细介绍）。在这些情况下，我们希望SQL批量插入限制所允许的最低限度–就我们的示例而言，它将为零（默认值）。另外，我们希望表具有文件中期望的确切数据结构，以使任何不符合业务规则的数据都无法插入文件。

In the below example, our SQL bulk insert allows no errors – any value in the file that does not match what we expect in our table will fail. Additionally, we don’t specify options like KEEPNULLs because we don’t want any null values. We’ll set a business rule that DataY must always be greater than DataX, which we’ll check in a query. Both values must also be numerical values between 0 and 101. Because of this business rule, we’re using the smallest whole integer for our columns in the table – tinyint – and following our insert, we’re filtering out any records that aren’t between 0 and 101. In addition, we look for records where DataX exceeds the value of DataY.

在下面的示例中，我们SQL批量插入没有错误-文件中与表中期望值不匹配的任何值都会失败。另外，我们不指定诸如KEEPNULLs之类的选项，因为我们不希望任何null值。我们将设置一个业务规则，即DataY必须始终大于DataX，我们将在查询中检查该规则。这两个值也必须是0到101之间的数值。由于这个业务规则，我们在表中的列中使用最小的整数-tinyint-，然后在插入之后，过滤掉所有不t在0到101之间。此外，我们还会查找DataX超出DataY值的记录。

CREATE TABLE etlImport6(DataX TINYINT NOT NULL,DataY TINYINT NOT NULL
)BULK INSERT etlImport6
FROM 'C:\ETL\Files\Read\Bulk\data_20180101.txt'
WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = '\n',FIRSTROW = 2
)SELECT *
FROM etlImport6
WHERE DataX &gt; DataYOR DataX &gt; 9OR DataY &gt; 9OR DataX &lt; 0OR DataY &lt; 0

From the table design to the specifications in SQL bulk insert to the following validation queries, we want to apply our business rules as strict as possible to filter out erroneous data points early. We can take this further by reducing part of our where clause and steps, assuming storage is not a concern. In the below code, we execute the same steps, except our table structure limits us to decimal values of 1 whole digit only, meaning -9 to 9 will be allowed.

从表设计到SQL批量插入中的规范再到以下验证查询，我们希望尽可能严格地应用我们的业务规则，以尽早过滤掉错误的数据点。假设存储不是问题，我们可以通过减少部分where子句和步骤来进一步实现这一目标。在下面的代码中，我们执行相同的步骤，不同之处在于表结构将我们限制为只能为1个整数的十进制值，这意味着将允许-9到9。

CREATE TABLE etlImport6_Strict(DataX DECIMAL(1,0) NOT NULL,DataY DECIMAL(1,0) NOT NULL
)BULK INSERT etlImport6_Strict
FROM 'C:\ETL\Files\Read\Bulk\data_20180101.txt'
WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = '\n',FIRSTROW = 2
)SELECT *
FROM etlImport6_Strict
WHERE DataX &gt; DataYOR DataX &lt; 0OR DataY &lt; 0

The data type tinyint costs less (1 byte), but allows a broader range. Our decimal specification is strict – throwing an arithmetic overflow if we go over, even at a slightly higher storage cost. We see that between strict SQL bulk insert specifications, table design and queries, we can reduce the number of steps to ensure that we have our data correct as early as possible in our data flow.

数据类型tinyint的成本较低（1个字节），但允许范围更广。我们的十进制规范很严格–如果超过，则会抛出算术溢出，即使存储成本稍高。我们看到，在严格SQL大容量插入规范，表设计和查询之间，我们可以减少步骤数，以确保我们尽早在数据流中正确地处理数据。

方案2：带有错误检查SQL批量插入 (Scenario 2: SQL Bulk Insert with Error Checks)

We’ll now look at a scenario that applies if we have a contact source when we experience errors and we want to quickly determine more details about these errors. In my experience with finding errors in data where there is a contact source, the fastest way to get information is to have the bad data that you can send quickly. In our example, we’ll actually be using a feature of our tool – its ability to generate an error file and insert the error files data for reporting, such as an email that can be send to us with the vendor copied in if any errors are experienced. Using the same file, we’ll add an X,Y value in the file (row 8), which we see doesn’t match our required format and we’ll have this error logged and imported.

现在，我们将研究一种情况，该情况适用于遇到错误并希望快速确定有关这些错误的更多详细信息时有联系源的情况。根据我在有联系源的地方查找数据错误的经验，获取信息最快的方法是拥有可以快速发送的不良数据。在我们的示例中，我们实际上将使用该工具的功能-它具有生成错误文件并插入错误文件数据以进行报告的功能，例如可以发送给我们的电子邮件，如果存在任何错误，则复制供应商有经验。使用相同的文件，我们将在文件中添加X，Y值（第8行），该值与所需的格式不匹配，并且将记录并导入此错误。

CREATE TABLE etlImport6Error(ErrorFileData VARCHAR(MAX)
)---- Clear from previous example
TRUNCATE TABLE etlImport6BULK INSERT etlImport6
FROM 'C:\ETL\Files\Read\Bulk\data_20180101.txt'
WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = '\n',FIRSTROW = 2,MAXERRORS = 100,ERRORFILE = 'C:\ETL\Files\Read\Bulk\data_20180101'
)BULK INSERT etlImport6Error
FROM 'C:\ETL\Files\Read\Bulk\data_20180101.Error.txt'SELECT *
FROM etlImport6SELECT *
FROM etlImport6Error

DECLARE @message VARCHAR(MAX) = '<p>We uncovered the below errors in the file data_20180101.txt:</p>'SELECT@message += '<p>' + ErrorFileData + '</p>'
FROM etlImport6ErrorIF LEN(@message) &gt; 0
BEGINSELECT @messageEXEC msdb.dbo.sp_send_dbmail  @profile_name = 'datasourceOne', @recipients = 'vendor@vendorweuse.com', @body = @message, @subject = 'Errors found'
END

The above design calls SQL bulk insert twice – once to get the file’s data and the second time to insert the errors. With this etlImport6Error table, we then send an email alert (or we could use the query for a report to send) to our vendor about details. Our business rules fundamentally will determine whether it’s worth contacting the vendor due to errors. I would suggest keeping this process as quick as possible, as trying to manually keep a list of errors will consume extra time. This assumes that our business requirements expect us to get insight into data like these.

上面的设计调用SQL批量插入两次-一次获取文件的数据，第二次插入错误。使用此etlImport6Error表，然后我们向供应商发送有关详细信息的电子邮件警报（或者我们可以使用查询来发送报告）。我们的业务规则将从根本上确定由于错误而值得与供应商联系。我建议尽可能快地执行此过程，因为尝试手动保存错误列表会浪费额外的时间。假设我们的业务需求期望我们深入了解此类数据。

方案3：允许任何数据（包括不准确的数据） (Scenario 3: Allowing Any Data (Including Inaccurate Data))

In some cases, we want to allow anything when we insert a file’s data. Since we know that having no data is better than having some data, this means that an inaccurate source of data can tell us information about the source, industry, or other related information if we identify errors, especially if the source of data is commonly used. For these cases following a SQL bulk insert, we want to keep the data for reports and analysis even if we know the file has bad data since the purpose is not to use as a data source, but determine if we can see these data being used.

在某些情况下，插入文件的数据时我们希望允许任何内容。由于我们知道没有数据总比拥有一些数据好，因此，这意味着如果我们发现错误，则不正确的数据源可能会告诉我们有关源，行业或其他相关信息的信息，尤其是在通常使用数据源的情况下。对于SQL批量插入后的这些情况，即使我们知道文件中的数据不正确，我们也希望保留数据以进行报告和分析，因为其目的不是用作数据源，而是确定是否可以看到这些数据在使用中。

CREATE TABLE etlImport6_Allow(DataX VARCHAR(MAX),DataY VARCHAR(MAX)
)BULK INSERT etlImport6_Allow
FROM 'C:\ETL\Files\Read\Bulk\data_20180101.txt'
WITH (FIELDTERMINATOR = ',',ROWTERMINATOR = '\n',FIRSTROW = 2
)SELECT *
FROM etlImport6_Allow---- Failure: Conversion failed when converting the varchar value 'X' to data type tinyint.
SELECT CAST(DataX AS TINYINT), CAST(DataY AS TINYINT)
FROM etlImport6_Allow---- Tracking records
SELECT *
FROM etlImport6_Allow
WHERE DataX NOT LIKE '[0-9]'OR DataY NOT LIKE '[0-9]'

When we look at the results, we see everything from the file and since the varchar maximum specification will allow up to 2 gigabytes of information, it’s unlikely that our load will fail due to space reasons. We may also specify smaller column sizes, if we don’t expect to have extreme values – like a varchar of 500. After our insert into our varchar columns, when we try to convert the data to what we would expect – tinyints – we get a conversion error. From here, we can use a strict search for records using numerical patterns to identify the records that aren’t numbers.

当查看结果时，我们会看到文件中的所有内容，并且由于varchar最大规格将允许多达2 GB的信息，因此由于空间原因，加载不太可能失败。如果我们不希望有极高的值（例如varchar为500），也可以指定较小的列大小。在将varchar列插入后，当我们尝试将数据转换为期望的数据时（tinyints），我们得到转换错误。从这里，我们可以使用数字模式对记录进行严格搜索，以识别不是数字的记录。

Generally, if we are going to apply allowing anything in our environment from a file source for the reasons of tracking this information, this will occur after we identify a faulty data source. This is a sequential step from one of the first steps, such as running the SQL bulk insert in our first section, finding our source is erroneous, and realizing that this finding may help our business. From a business perspective, this can be very useful. Faulty data sources can help us predict events in our industry, so if we discover a faulty data set, we may want to keep aggregates, summaries, or even the entire data set.

通常，如果出于跟踪此信息的原因而要应用允许文件环境中的任何内容，则将在我们确定故障数据源之后发生。这是第一步中的一个连续步骤，例如，在第一部分中运行SQL批量插入，发现我们的源是错误的，并意识到这一发现可能对我们的业务有所帮助。从业务角度来看，这可能非常有用。错误的数据源可以帮助我们预测行业中的事件，因此，如果我们发现错误的数据集，则可能需要保留汇总，摘要甚至整个数据集。

结论 (Conclusion)

Businesses can have different rules for data and the context and industry can matter. I’ve imported thousands of data sets and these three scenarios can apply to any file source – some businesses have different rules for the same data sets that other businesses have different rules. As we see with SQL bulk insert, we can use it with a combination of data structures for any of these scenarios.

企业可以有不同的数据规则，而上下文和行业可能很重要。我已经导入了成千上万个数据集，并且这三种情况都可以应用于任何文件源–一些企业对于相同数据集的规则不同，而其他企业具有不同的规则。正如我们在SQL批量插入中看到的那样，我们可以将其与数据结构的组合用于任何这些情况。

目录 (Table of contents)

T-SQL’s Native Bulk Insert Basics

Working With Line Numbers and Errors Using Bulk Insert

Considering Security with SQL Bulk Insert

SQL Bulk Insert Concurrency and Performance Considerations

Using SQL Bulk Insert With Strict Business Rules

T-SQL的本机大容量插入基础知识

使用批量插入处理行号和错误

考虑使用SQL批量插入的安全性

SQL批量插入并发和性能注意事项

使用具有严格业务规则SQL批量插入

翻译自: https://www.sqlshack.com/using-sql-bulk-insert-with-strict-business-rules/

sql批量插入防止重复插入