sql查询非ascii字符

One of the important steps in an ETL process involves the transformation of source data. This could involve looking up foreign keys, converting values from one data type into another, or simply conducting data clean-ups by removing trailing and leading spaces. One aspect of transforming source data that could get complicated relates to the removal of ASCII special characters such as new line characters and the horizontal tab. In this article, we take a look at some of the issues you are likely to encounter when cleaning up source data that contains ASCII special characters and we also look at the user-defined function that could be applied to successfully remove such characters.

ETL过程中的重要步骤之一涉及源数据的转换。 这可能涉及查找外键,将值从一种数据类型转换为另一种数据类型,或仅通过删除尾部和前导空格来进行数据清除。 转换可能会变得复杂的源数据的一方面涉及到删除ASCII特殊字符,例如换行符和水平制表符。 在本文中,我们介绍了清理包含ASCII特殊字符的源数据时可能遇到的一些问题,还介绍了可成功用于删除此类字符的用户定义函数。

替换ASCII可打印字符 (Replacing ASCII Printable Characters)

The American Standard Code for Information Interchange (ASCII) is one of the generally accepted standardized numeric codes for representing character data in a computer. For instance, the ASCII numeric code associated with the backslash (\) character is 92. Many of the software vendors abide by ASCII and thus represents character codes according to the ASCII standard. Likewise, SQL Server, which uses ANSI – an improved version of ASCII, ships with a built-in CHAR function that can be used to convert an ASCII numerical code back to its original character code (or symbol). Script 1 shows us an example of how an ASCII numeric code 92 can be converted back into a backslash character as shown in Figure 1.

美国信息交换标准代码 (ASCII)是一种用于在计算机中表示字符数据的公认标准数字代码之一。 例如,与反斜杠( \ 字符关联的ASCII数字代码为92。许多软件供应商都遵守ASCII,因此根据ASCII标准表示字符代码。 同样,使用ANSI(一种ASCII的改进版本)SQL Server附带一个内置的CHAR函数,可用于将ASCII数字代码转换回其原始字符代码(或符号)。 脚本1为我们提供了一个示例,说明如何将ASCII数字代码92转换回反斜杠字符, 如图1所示。

SELECT CHAR(92);

The backslash character falls into a category of ASCII characters that is known as ASCII Printable Characters – which basically refers to characters visible to the human eye. Table 1 shows a top 5 sample of ASCII Printable Characters.

反斜杠字符属于ASCII字符类别,称为ASCII可打印字符 -基本上是指人眼可见的字符。 表1显示了ASCII可打印字符的前5个示例。

Numeric Code Character Description
33 ! Exclamation Mark
35 # Number
36 $ Dollar
37 % Percent
38 & Ampersand
数字代码 字符 描述
33 感叹号
35
36 $ 美元
37 百分
38 &符

When it comes to addressing data quality issues in SQL Server, it’s easy to clean most of the ASCII Printable Characters by simply applying the REPLACE function. Say for instance that source data contains an email address for John Doe that has several invalid special characters as shown in Script 2.

在解决SQL Server中的数据质量问题时,只需应用REPLACE函数即可轻松清除大多数ASCII可打印字符 。 例如,假设源数据包含John Doe的电子邮件地址,该电子邮件地址具有几个无效的特殊字符,如脚本2所示。

DECLARE @email VARCHAR(55) = 'johndoe@a!b#c.com$';

We could eliminate such characters by applying the REPLACE T-SQL function as shown in Script 3.

我们可以通过应用如脚本3所示的REPLACE T-SQL函数来消除此类字符。

SELECT REPLACE(REPLACE(REPLACE(@email, '!', ''), '#', ''), '$', '');

Execution of Script 3 results into a correctly formatted email address that is shown in Figure 2.

执行脚本3会生成格式正确的电子邮件地址, 如图2所示。

替换ASCII控制字符 (Replacing ASCII Control Characters)

In addition to ASCII Printable Characters, the ASCII standard further defines a list of special characters collectively known as ASCII Control Characters. Such characters typically are not easy to detect (to the human eye) and thus not easily replaceable using the REPLACE T-SQL function. Table 2 shows a sample list of the ASCII Control Characters.

ASCII可打印字符外 ,ASCII标准还定义了一系列特殊字符,统称为ASCII控制字符 。 这样的字符通常不容易(人眼)察觉,因此使用REPLACE T-SQL函数不容易替换。 表2显示了ASCII控制字符的示例列表。

Numeric Code Character Description
0 NUL null
1 SOH start of header
2 STX start of text
3 ETX end of text
4 EOT end of transmission
数字代码 字符 描述
0 NUL 空值
1个 超高氧 标题开始
2 STX 文字开头
3 ETX 文字结尾
4 EOT 传输结束

To demonstrate the challenge of cleaning up ASCII Control Characters, I have written a C# Console application shown in Script 4 that generates an output.txt text file that contains different variations of John Doe’s email address (only the first line has John Doe’s email address in the correct format).

为了演示清理ASCII控制字符的挑战,我编写了一个Script 4中显示的C#Console应用程序,该应用程序生成一个output.txt文本文件,该文件包含John Doe电子邮件地址的不同变体(仅第一行在其中包含John Doe的电子邮件地址)。正确的格式)。

using (StreamWriter writer = new StreamWriter(@"C:\temp\output.txt")){string vd = "johndoe@abc.com";writer.WriteLine(vd);writer.WriteLine((char)1 + vd);writer.WriteLine((char)9 + vd + (char)1);writer.WriteLine((char)9 + vd);             }

A preview of the output.txt text file populated by Script 4 is shown using the Windows Notepad.exe program in Figure 3.

使用Windows Notepad.exe程序图3中显示脚本4填充的output.txt文本文件的预览。

As it can be seen, there seem to be spaces in email address 2-4 but it’s difficult to tell whether these spaces are created by the Tab character or the Space bar character. Furthermore, if you go back to Script 4, you will recall that for the 3rd email address, I included the start of header character at the end of the email address, but looking at the data in Figure 3, the start of header character is not easily visible at the end of that 3rd email address. In fact, it looks like the email address 3 and 4 have the same amount of characters – which is not true. Only using advanced text editors such as Notepad++ are we then able to visualize the special characters in the data, as shown in Figure 4.

可以看出,电子邮件地址2-4中似乎有空格,但是很难分辨这些空格是由Tab字符还是由空格键字符创建的。 此外,如果你回到脚本4,你会记得, 第三电子邮件地址,我包括标题文字电子邮件地址的末端开始 ,但在图3中, 标题文字的开始看数据在第三个电子邮件地址的末尾不容易看到。 实际上,看起来电子邮件地址3和4具有相同数量的字符-这是不正确的。 这样,只有使用高级文本编辑器(例如Notepad ++) ,我们才能可视化数据中的特殊字符, 如图4所示。

When it comes to SQL Server, the cleaning and removal of ASCII Control Characters are a bit tricky. For instance, say we have successfully imported data from the output.txt text file into a SQL Server database table. If we were to run the REPLACE T-SQL function against the data as we did in Script 3, we can already see in Figure 5 that the REPLACE function was unsuccessful as the length of data in the original column is exactly similar to the length calculated after having applied both REPLACE and TRIM functions.

对于SQL Server,清理和删除ASCII控制字符有些棘手。 例如,假设我们已经成功地将数据从output.txt文本文件导入到SQL Server数据库表中。 如果像脚本3中那样对数据运行REPLACE T-SQL函数,我们已经在图5中已经看到REPLACE函数是不成功的,因为原始列中的数据长度与计算出的长度完全相似同时应用了REPLACE和TRIM功能之后。

SELECT [id],[Column 0],LEN([Column 0]) OriginalLength,LEN(REPLACE(REPLACE(LTRIM(LTRIM([Column 0])), ' ', ''), '  ', '')) NewLength
FROM [SQLShack].[dbo].[OLE DB Destination];

So how do we replace what we cannot see?

那么,如何替换看不见的东西呢?

  1. Replace String using Character Codes

    使用字符代码替换字符串

    The simplest way to replace what we cannot see is that instead of hardcoding the string to replace into our REPLACE function, we should hardcode the string to be replaced by hardcoding its ASCII numerical code within the CHAR function. Thus, instead of providing an exclamation mark as the string to replace, we can hardcode the ASCII numerical code for exclamation mark – which is 33 and convert that numeric code back to character code using the CHAR function. Thus our script changes from:

    替换我们看不到的最简单的方法是,与其将要替换的字符串硬编码到REPLACE函数中,不如将其替换为CHAR函数中的ASCII数字代码,以对要替换的字符串进行硬编码。 因此,我们无需提供感叹号作为要替换的字符串,而是可以对感叹号的ASCII数字代码(即33)进行硬编码,然后使用CHAR函数将该数字代码转换回字符代码。 因此,我们的脚本从:

    DECLARE @email VARCHAR(55)= 'johndoe@a!bc.com';
    SELECT REPLACE(@email, '!', '');
    

    To using:

    要使用:

    DECLARE @email VARCHAR(55)= 'johndoe@a!bc.com';
    SELECT REPLACE(@email, CHAR(33), '');
    

    Script 6脚本6

    Now going back to cleaning email address data out of the output.txt text file, we can rewrite our script to what is shown in Script 7.

    现在回到从output.txt文本文件中清除电子邮件地址数据的方式,我们可以将脚本重写为脚本7中显示的内容。

    SELECT [id],[Column 0],LEN([Column 0]) OriginalLength,LEN(REPLACE(REPLACE([Column 0], CHAR(1), ''), CHAR(9), '')) NewLength
    FROM [SQLShack].[dbo].[OLE DB Destination];
    

    Script 7脚本7

    After executing Script 7, we can see in Figure 6 that the length of all email address rows matches back to the length of row 1 – which was originally the correct email address. Thus, we have successfully managed to remove “invincible” special characters.

    执行完脚本7后 ,我们可以在图6中看到所有电子邮件地址行的长度都与第1行的长度相匹配-第1行原来是正确的电子邮件地址。 因此,我们成功地删除了“无敌”特殊字符。

    Figure 6图6

  2. Dynamically Detect and Replace ASCII Characters

    动态检测和替换ASCII字符

    One noticeable limitation of Script 7 is that we have hard-coded the list of ASCII numerical values. This means if the email address data contained special characters with ASCII numerical value 8 then we wouldn’t have removed them as we had hardcoded our script to specifically look for CHAR(1) and CHAR(9). Therefore, there is a need for a mechanism that allows us to automatically detect ASCII Control Characters contained in a given string and then automatically replace them. Script 8 provides such a mechanism in a form of a While loop within a user-defined function that iteratively searches through a given string to identify and replace ASCII Control Characters.

    脚本7的一个明显的局限性是我们已经硬编码了ASCII数值列表。 这意味着,如果电子邮件地址数据包含具有ASCII数值8的特殊字符,那么我们就不会删除它们,因为我们已经对脚本进行了硬编码以专门查找CHAR(1)CHAR(9) 。 因此,需要一种允许我们自动检测给定字符串中包含的ASCII控制字符 ,然后自动替换它们的机制。 脚本8在用户定义的函数内以While循环的形式提供了这种机制,该机制可迭代搜索给定的字符串以识别和替换ASCII控制字符

    CREATE FUNCTION [dbo].[ReplaceASCII](@inputString VARCHAR(8000))
    RETURNS VARCHAR(55)
    ASBEGINDECLARE @badStrings VARCHAR(100);DECLARE @increment INT= 1;WHILE @increment <= DATALENGTH(@inputString)BEGINIF(ASCII(SUBSTRING(@inputString, @increment, 1)) < 33)BEGINSET @badStrings = CHAR(ASCII(SUBSTRING(@inputString, @increment, 1)));SET @inputString = REPLACE(@inputString, @badStrings, '');END;SET @increment = @increment + 1;END;RETURN @inputString;END;
    GO
    

    Script 8剧本8

    The application of the function is shown in Script 9.

    该功能的应用如脚本9所示。

    SELECT [id],[Column 0],LEN([Column 0]) OriginalLength,LEN([SQLShack].[dbo].[ReplaceASCII]([Column 0])) NewLength
    FROM [SQLShack].[dbo].[OLE DB Destination];
    

    Script 9脚本9

结论 (Conclusion)

Every now and then T-SQL developers are faced with cleaning the data they have imported by usually applying the REPLACE T-SQL function. However, when it comes to removing special characters, removal of ASCII Control Characters can be tricky and frustrating. Fortunately, SQL Server ships with additional built-in functions such as CHAR and ASCII that can assist in automatically detecting and replacing ASCII Control Characters.

T-SQL开发人员时不时要面对通常通过应用REPLACE T-SQL函数来清理已导入的数据的问题。 但是,在删除特殊字符时,删除ASCII控制字符可能会很棘手且令人沮丧。 幸运的是,SQL Server附带了其他内置函数,例如CHAR和ASCII,可以帮助自动检测和替换ASCII控制字符

参考资料 (References)

  • ASCII Function ASCII功能
  • Difference Between ANSI and ASCII ANSI和ASCII之间的区别
  • CHAR Function CHAR功能

翻译自: https://www.sqlshack.com/replace-ascii-special-characters-sql-server/

sql查询非ascii字符

sql查询非ascii字符_SQL替换:如何在SQL Server中替换ASCII特殊字符相关推荐

  1. sql查询非11位非数字_非生产环境SQL查询性能调优技巧

    sql查询非11位非数字 It is a common misconception that you need real production data, or production like dat ...

  2. SQL Server中替换函数stuff、replace的使用

    原文链接:SQL Server中替换函数STUFF.replace的使用 STUFF ( character_expression , start , length ,character_expres ...

  3. sql查询去除视图重复项_如何使用SQL查询视图,Postico使用技巧分享

    Postico凭借着简单易用的操作界面深受专业人员和新手的喜爱,小编也整理一点小技巧分享给大家,通过一次编辑多行节省时间,是你工作的好帮手,快来一起看看吧~ 如何使用SQL查询视图,Postico使用 ...

  4. 后台多条sql查询,json传前台,前台处理多条sql数据实例

    后台多条sql查询,json传前台,前台处理多条sql数据实例 前台jsp页面: <div class="yppp_2" style="margin-left:10 ...

  5. junit测试起名字规则_如何在JUnit 5中替换规则

    junit测试起名字规则 最近发布的JUnit 5(又名JUnit Lambda) alpha发行版引起了我的兴趣,在浏览文档时,我注意到规则以及运行程序和类规则都消失了. 根据文档,这些部分竞争的概 ...

  6. 如何在JUnit 5中替换规则

    最近发布的JUnit 5(又名JUnit Lambda) Alpha版本引起了我的兴趣,在浏览文档时,我注意到规则以及跑步者和阶级规则都消失了. 根据文档,这些部分竞争的概念已被单个一致的扩展模型取代 ...

  7. sql 查询数据库索引重建_SQL查询性能的杀手– –了解不良的数据库索引

    sql 查询数据库索引重建 Poor indexing is one of the top performance killers, and we will focus on them in this ...

  8. sql查询包含某个字符_MySQL DBA基本知识点梳理和查询优化

    本文主要是总结了工作中一些常用的操作,以及不合理的操作,在对慢查询进行优化时收集的一些有用的资料和信息,本文适合有一定MySQL基础的开发人员.一.索引相关 索引基数:基数是数据列所包含的不同值的数量 ...

  9. mysql 查询分析工具下载_SQL分析工具下载-SQL查询工具(DB Solo)下载v5.2.5官方版-西西软件下载...

    DB Solo是一款完美的数据库查询分析工具.软件优秀跨平台SQL查询功能,支持所有主要DBMS产品:主要用于POJO的J2EE代码生成器,EJB 3.0批注,使用DAO  模式的JDBC持久层,JU ...

最新文章

  1. 赌5毛钱,你解不出这道Google面试题
  2. Cesium中级教程9 - Advanced Particle System Effects 高级粒子系统效应
  3. python最简单的架构_Python实现简单状态框架的方法
  4. 如何在一个.c文件里调用另一个.c文件里的变量
  5. Bzoj4561 [JLoi2016]圆的异或并
  6. vue.js ui_UI / UX开发:考虑Vue.js
  7. 经典面试题(9):以下代码将输出什么?并解释你的答案。
  8. bootstrap-徽章-链接
  9. pandas内置数据集_如何用pandas划分数据集实现训练集和测试集
  10. 树莓派 —— 树莓派安装字体
  11. XAMPP的安装及配置使用教程
  12. pdf表格怎么转换成excel?
  13. 绘制函数z = x2 + y2所表示的三维网格图
  14. Retina显示屏-揭秘移动端的视觉稿通常会设计为传统PC的2倍
  15. 真正决定人生高度的,是你做事的速度
  16. 还在付费使用 XShell?我选择这款超牛逼的 SSH 客户端,完全免费!
  17. 重入锁:ReentrantLock
  18. 圣天诺HL加密锁(原HASP加密锁)快速入门
  19. 一个离开CV界多年的油腻中年男子的CV复兴之路
  20. 大光的妈妈给了大光100元,让他去超市买东西,牙膏5元/支,牙刷2元/支,肥皂3元/个,100元买这三种恰好花光,请问有多少种可能性

热门文章

  1. delphi switch语句例子_Python系列之常用语句
  2. 一个非常好的建立多层结构应用的例子--Infragistics Tracker Application
  3. [AT2558]Many Moves
  4. 根据控制点坐标对完成坐标转换
  5. ReactJS实用技巧(1):JSX与HTML的那些不同
  6. javascript 数组合并与去重
  7. 软件工程 speedsnail 第二次冲刺1次
  8. .net Thrift 之旅 (二) TServer
  9. eclipse plugins
  10. postgresql 先创建唯一主键 再分区_PostgreSQL 务实应用(三/5)分表复制