如何在SQL Server 2016中使用R导入/导出CSV文件

介绍 (Introduction)

Importing and exporting CSV files is a common task to DBAs from time to time.

导入和导出CSV文件是DBA经常执行的一项常见任务。

For import, we can use the following methods

对于导入，我们可以使用以下方法

BCP utility BCP实用程序
Bulk Insert 批量插入
OpenRowset with the Bulk option OpenRowset
Writing a CLR stored procedure or using PowerShell
编写CLR存储过程或使用PowerShell

For export, we can use the following methods

对于导出，我们可以使用以下方法

BCP utility BCP实用程序
Writing a CLR stored procedure or using PowerShell
编写CLR存储过程或使用PowerShell

But to do both import and export inside T-SQL, currently, the only way is via a custom CLR stored procedure.

但是目前，要同时在T-SQL中进行导入和导出，唯一的方法是通过自定义CLR存储过程。

This dilemma is changed since the release of SQL Server 2016, which has R integrated. In this article, we will demonstrate how to use R embedded inside T-SQL to do import / export work.

自从集成了RSQL Server 2016版本以来，这一难题就得到了改变。在本文中，我们将演示如何使用嵌入在T-SQL中的R进行导入/导出工作。

SQL Server 2016中的R集成 (R Integration in SQL Server 2016)

To use R inside SQL Server 2016, we should first install the R Service in-Database. For detailed installation, please see Set up SQL Server R Services (In-Database)

若要在SQL Server 2016中使用R，我们应首先在数据库中安装R Service。有关详细的安装，请参阅设置SQL Server R Services（数据库内）

T-SQL integrates R via a new stored procedure: sp_execute_external_script.

T-SQL通过新的存储过程sp_execute_external_script集成了R。

The main purpose of R language is for data analysis, especially, in statistics way. However, since any data analysis work naturally needs to deal with external data sources, among which is CSV file, we can use this capability to our advantage.

R语言的主要目的是用于数据分析，尤其是以统计的方式。但是，由于任何数据分析工作自然都需要处理外部数据源，其中包括CSV文件，因此我们可以利用此功能来发挥我们的优势。

What is more interesting here is SQL Server R service is installed with an enhanced and tailored for SQL Server 2016 R package RevoScaleR package, which contains some handy functions.

此处更有趣的是，SQL Server R服务已安装并经过增强并针对SQL Server 2016 R程序包RevoScaleR程序包进行了定制，其中包含一些方便的功能。

环境准备 (Environment Preparation)

Let’s first prepare some real-world CSV files, I recommend to download CSV files from 193,992 datasets found.

首先，让我们准备一些实际的CSV文件，我建议从找到的193,992个数据集中下载CSV文件。

We will download the first two dataset CSV files, “College Scorecard” and “Demographic Statistics By Zip Code”, just click the arrow-pointed two links as shown below, and two CSV files will be downloaded.

我们将下载前两个数据集CSV文件，即“大学记分卡”和“按邮政编码分类的人口统计学”，只需单击箭头所示的两个链接，如下所示，将下载两个CSV文件。

After downloading the two files, we can move the “Consumer_complain.csv” and “Most-Recent-Cohorts-Scorecard-Elements.csv” to a designated folder. In my case, I created a folder C:\RData and put them there as shown below

下载两个文件后，我们可以将“ Consumer_complain.csv”和“ Most-Recent-Cohorts-Scorecard-Elements.csv”移动到指定文件夹。就我而言，我创建了一个文件夹C：\ RData并将其放置在下面，如下所示

These two files are pretty typical in feature, the Demographic_Statistics_By_Zip_Code.csv are all pure numeric values, and another file has big number of columns, 122 columns to be exact.

这两个文件在功能上非常典型，Demographic_Statistics_By_Zip_Code.csv都是纯数字值，另一个文件的列数很多，准确的说是122列。

I will load these two files to my local SQL Server 2016 instance, i.e. [localhost\sql2016] in [TestDB] database.

我将这两个文件加载到本地SQL Server 2016实例，即[TestDB]数据库中的[localhost \ sql2016]。

数据导入/导出要求 (Data Import / Export Requirements)

We will do the following for this import / export requirements:

我们将针对此导入/导出要求执行以下操作：

Import the two csv files into staging tables in [TestDB]. Input parameter is a csv file name
将两个csv文件导入[TestDB]中的登台表。输入参数是一个csv文件名
Export the staging tables back to a csv file. Input parameters are staging table name and the csv file name
将登台表导出回csv文件。输入参数是登台表名称和csv文件名称
Import / Export should be done inside T-SQL
导入/导出应在T-SQL内部完成

实施导入 (Implementation of Import)

In most of the data loading work, we will first create staging tables and then start to load. However, with some amazing functions in RevoScaleR package, this staging table creation step can be omitted as the R function will auto create the staging table, it is such a relief when we have to handle a CSV file with 100+ columns.

在大多数数据加载工作中，我们将首先创建临时表，然后开始加载。但是，由于RevoScaleR软件包中有一些了不起的功能，因此可以省略此登台表创建步骤，因为R函数将自动创建登台表，当我们必须处理包含100列以上的CSV文件时，这是一种缓解。

The implementation is straight-forward

实现简单明了

Read csv file with read.csv R function into variable c, which will be the source (line 7)
使用read.csv R函数将csv文件读入变量c，这将成为源（第7行）
From the csv file full path, we extract the file name (without directory and suffix), we will use this file name as the staging table name (line 8, 9)
从csv文件的完整路径中，我们提取文件名（不带目录和后缀），我们将使用此文件名作为登台表名（第8、9行）
Create a sql server connection string
创建一个SQL Server连接字符串
Create a destination SQL Server data source using RxSQLServerData function (line 12)
使用RxSQLServerData函数创建目标SQL Server数据源（第12行）
Using RxDataStep function to import the source into the destination (line 13)
使用RxDataStep函数将源导入到目标（第13行）
If we want to import a different csv file, we just need to change the first line to assign the proper value to @filepath
如果要导入其他的csv文件，则只需更改第一行即可为@filepath分配适当的值

One special notd here, line 11 defines a connection string, at this moment, it seems we need a User ID (UID) and Password (PWD) to avoid problems. If we use Trusted_Connection = True, there can be problems. So in this case, I created a login XYZ and assign it as a db_owner user in [TestDB].

此处需要特别注意的是，第11行定义了一个连接字符串，此刻，我们似乎需要一个用户ID（UID）和密码（PWD）以避免问题。如果我们使用Trusted_Connection = True，则可能会出现问题。因此，在这种情况下，我创建了登录XYZ并将其分配为[TestDB]中的db_owner用户。

After this done, we can check what the new staging table looks like

完成此操作后，我们可以检查新登台表的外观

We notice that all columns are created using the original names in the source csv file with the proper data type.

我们注意到，所有列都是使用具有正确数据类型的源csv文件中的原始名称创建的。

After assigning @filepath = ‘c:/rdata/Most-Recent-Cohorts-Scorecard-Elements.csv’ , and re-running the script, we can check to see a new table [Most-Recent-Cohorts-Scorecard-Elements] is created with 122 columns as shown below.

分配@filepath = 'c：/rdata/Most-Recent-Cohorts-Scorecard-Elements.csv'并重新运行脚本后，我们可以检查以查看新表[Most-Recent-Cohorts-Scorecard-Elements]创建有122列，如下所示。

However, there is a problem for this csv file import because some csv columns are treated as integers, for example, when for [OPEID] and [OPEID6], they should be treated as a string instead because treating them as integers will drop the leading 0s.

但是，此csv文件导入存在问题，因为某些csv列被视为整数，例如，对于[OPEID]和[OPEID6]，应将它们视为字符串，因为将它们视为整数会删除前导0s。

When we see what is inside the table, we will notice that in such scenario, we cannot rely on the table auto creation.

当我们看到表中的内容时，我们会注意到在这种情况下，我们不能依赖表的自动创建。

To correct this, we have to give the instruction to R read.csv function by specifying the data type for the two columns as shown below

为了解决这个问题，我们必须通过指定两列的数据类型来向R read.csv函数提供指令，如下所示

We can now see the correct values for [OPEID] and [OPEID6] columns

现在，我们可以看到[OPEID]和[OPEID6]列的正确值

实施出口 (Implementation of Export)

If we want to dump the data out of a table to csv file. We need to define two input parameters, one is the destination csv file path and another is a query to select the table.

如果我们想将数据从表中转储到csv文件中。我们需要定义两个输入参数，一个是目标csv文件路径，另一个是选择表的查询。

The beautify of sp_execute_external_script is it can perform a query against table inside SQL Server via its @input_data_1 parameter, and then transfer the result to the R script as a named variable via its @input_data_1_name.

美化sp_execute_external_script是因为它可以通过其@ input_data_1参数对SQL Server中的表执行查询，然后通过其@ input_data_1_name将结果作为命名变量传输到R脚本。

So here is the details:

因此，这是详细信息：

Define the csv file full path (line 3), this information will be consumed by the embedded R script via an input parameter definition (line 11 & 12 and consumed in line 8)
定义csv文件的完整路径（第3行），该信息将由嵌入式R脚本通过输入参数定义使用（第11和12行，在第8行中使用）
Define a query to retrieve data inside table (line 4 and line 9)
定义查询以检索表内部的数据（第4行和第9行）
SrcTable, and it Is consumed in the embedded R script (line 8) SrcTable，并在嵌入式R脚本中使用（第8行）
write.csv to generate the csv file. write.csv生成csv文件。

We can modify @query to export whatever we want, such as a query with where clause, or just select some columns instead of all columns.

我们可以修改@query以导出所需的任何内容，例如带有where子句的查询，或者只选择某些列而不是所有列。

The complete T-SQL script is shown here:

完整的T-SQL脚本如下所示：


-- import data 1: import from csv file by using default configurations
-- the only input parameter needed is the full path of the source csv file
declare @filepath varchar(100) = 'c:/rdata/Demographic_Statistics_By_Zip_Code.csv'  -- using / to replace \
declare @tblname varchar(100);
declare @svr varchar(100) = @@servername;exec sp_execute_external_script @language = N'R'
, @script = N'
c <- read.csv(filepath, sep = ",", header = T)
filename <- basename(filepath)
filename <- paste("dbo.[", substr(filename,1, nchar(filename)-4), "]", sep="") #remove .csv suffix conn <- paste("SERVER=", server, "; DATABASE=", db, ";UID=xyz;PWD=h0rse;", sep = "")
destDB <- RxSqlServerData(table = filename, connectionString = conn);
rxDataStep(inData=c, outFile = destDB, rowsPerRead=1000, overwrite = T );
'
, @params = N'@filepath varchar(100), @server varchar(100), @db varchar(100)'
, @filepath = @filepath
, @server = @svr
, @db = 'TestDB';
go-- import data 2: assign data type to some columns using colClasses in read.csv function
-- the only input parameter needed is the full path of the source csv filedeclare @filepath varchar(100) = 'c:/rdata/Most-Recent-Cohorts-Scorecard-Elements.csv'  -- using / to replace \
declare @tblname varchar(100);
declare @svr varchar(100) = @@servername;exec sp_execute_external_script @language = N'R'
, @script = N'
c <- read.csv(filepath, sep = ",", header = T, colClasses = c("OPEID" = "character", "OPEID6"="character"))
filename <- basename(filepath)
filename <- paste("dbo.[", substr(filename,1, nchar(filename)-4), "]", sep="") #remove .csv suffix conn <- paste("SERVER=", server, "; DATABASE=", db, ";UID=xyz;PWD=h0rse;", sep = "")
destDB <- RxSqlServerData(table = filename, connectionString = conn);
rxDataStep(inData=c, outFile = destDB, rowsPerRead=1000, overwrite = T );
'
, @params = N'@filepath varchar(100), @server varchar(100), @db varchar(100)'
, @filepath = @filepath
, @server = @svr
, @db = 'TestDB';
go-- export data:
-- two input parameters are needed, one is the destination csv file path
-- and the other is a query to select the source tabledeclare @dest_filepath varchar(100), @query nvarchar(1000) ;select @dest_filepath = 'c:/rdata/Most-Recent-Cohorts-Scorecard-Elements_copy.csv' -- using / to replace \
, @query = 'select * from [TestDB].[dbo].[Most-Recent-Cohorts-Scorecard-Elements]'/*
-- for the Demographic table, using the followng setting to replace the above two linesselect @dest_filepath = 'c:/rdata/Demographic_Statistics_By_Zip_Code_copy.csv'
, @query = 'select * from [TestDB].[dbo].[Demographic_Statistics_By_Zip_Code]'*/exec sp_execute_external_script @language = N'R'
, @script = N'write.csv(SrcTable, file=dest_filepath, quote=F, row.names=F)'
, @input_data_1 = @query
, @input_data_1_name  = N'SrcTable'
, @params = N'@dest_filepath varchar(100)'
, @dest_filepath = @dest_filepathgo

After the run the whole script, I can find the new files are created

运行整个脚本后，我可以找到创建的新文件

摘要 (Summary)

In this article, we discussed how to execute a SQL Server table import / export via R inside T-SQL. This is totally different from our traditional approaches. This new method is easy and can handle tough CSV files, such as a CSV file with column values containing multiple lines.

在本文中，我们讨论了如何在T-SQL中通过R执行SQL Server表的导入/导出。这与我们的传统方法完全不同。这种新方法很容易，并且可以处理棘手的CSV文件，例如列值包含多行的CSV文件。

This new approach does not require any additional R packages, and the script can run in SQL Server 2016 with default R installation, which already contains RevoScaleR package.

这种新方法不需要任何其他R包，并且该脚本可以在具有默认R安装SQL Server 2016中运行，该脚本已经包含RevoScaleR包。

During my test with various CSV files, I notice that when reading big CSV file, we need very big memory, otherwise, there can be error. However, if run the R script directly, i.e. R script not embedded in T-SQL, like in RStudio, the memory requirement is still there, but R script can finish without error, while running the same R script inside sp_execute_external_script will fail.

在测试各种CSV文件的过程中，我注意到在读取大CSV文件时，我们需要非常大的内存，否则可能会出错。但是，如果直接运行R脚本，即未像RStudio一样在T-SQL中嵌入R脚本，则内存需求仍然存在，但是R脚本可以无错误完成，而在sp_execute_external_script中运行相同的R脚本将失败。

No doubt, the current R integration with T-SQL is just Version 1, and there are some wrinkles in the implementation. But it is definitely a great feature which opens another door for DBAs / developers to tackle lots works. It is worth our while to understand and learn it.

毫无疑问，当前与T-SQL的R集成仅为版本1，并且在实现方面存在一些缺陷。但这绝对是一个很棒的功能，这为DBA /开发人员处理大量工作打开了另一扇门。值得我们花时间去理解和学习它。

下一步 (Next Steps)

R has lots of useful 3rd packages (most of them are open-sourced), and we can do lots of additional work with these packages, such as importing / exporting Excel files (esp. those .xlsx files), or regular expressions etc. It is really fun to play with these packages, and I will share my exploration journey in future.

R有很多有用的第3个程序包（其中大多数是开源的），我们可以对这些程序包做很多额外的工作，例如导入/导出Excel文件（尤其是那些.xlsx文件）或正则表达式等。玩这些软件包真的很有趣，以后我将分享我的探索之旅。

翻译自: https://www.sqlshack.com/how-to-import-export-csv-files-with-r-in-sql-server-2016/