sed 详解

我觉得 sed 玩到最后，应该触及的最高难度的问题，有这些：

替换百万行文本，sed 的处理速度如何

sed 作为 ETL 工具，与 MySQL, Oracle 等连接起来，做交互式操作

sed 会有异常吗，那么如何处理：比如处理百万数据失效了

而这一切才刚刚开始！

Substitute - s 命令详解



sed

's/pattern/replacement/'
inputfile

经典的用法就是这样。

但实际运作起来，并非像我们想象的那样：


[root@centos00 _data]

# cat hw.txt

this

is
the profession tool 

on
the professional platform

this

is
the man 

on
the earth

[root@centos00 _data]

# sed 's/the/a/' hw.txt

this

is
a profession tool 

on
the professional platform

this

is
a man 

on
the earth

[root@centos00 _data]

#

虽然我们制定了 pattern, 但 replacement 只替换了每行第一次出现的指定文本。

所以有了这些 s 命令的衍生：


s/pattern/replacement/flag

数字：指定第几处符合指定模式的文本被替换；

g: 替换所有符合的模式文本；

p: 原先的内容文本先打印出来；

w filename: 将替换的结果写入到文件里面去

替换掉所有的符合模式条件的文本：


[root@centos00 _data]

# sed 's/the/a/g' hw.txt

this

is
a profession tool 

on
a professional platform

this

is
a man 

on
a earth

将结果写入到另一个文本文件：


[root@centos00 _data]

# sed 's/the/a/w dts.txt' hw.txt

this

is
a profession tool 

on
the professional platform

this

is
a man 

on
the earth

[root@centos00 _data]

# cat dts.txt

this

is
a profession tool 

on
the professional platform

this

is
a man 

on
the earth

[root@centos00 _data]

#

分隔符的替换：


[root@centos00 _data]

# sed 's!/bin/bash!/bin/csh!' /etc/passwd

root:

x:

:

:root

:/root

:/bin/csh

bin:

x:

1

:

1

:bin

:/bin

:/sbin/nologin

daemon:

x:

2

:

2

:daemon

:/sbin

:/sbin/nologin

adm:

x:

3

:

4

:adm

:/var/adm

:/sbin/nologin

lp:

x:

4

:

7

:lp

:/var/spool/lpd

:/sbin/nologin

sync:

x:

5

:

:sync

:/sbin

:/bin/sync

使用 ! 亦可以作为分隔符。因为 / 和路径分隔符重合，而转义的时候，会加很多 \ 符，因此不是很好读。

还可以用@ 作为分隔符


[root@centos00 _data]

# sed 's@/bin/bash@/bin/csh@' /etc/passwd

root:

x:

:

:root

:/root

:/bin/csh

bin:

x:

1

:

1

:bin

:/bin

:/sbin/nologin

daemon:

x:

2

:

2

:daemon

:/sbin

:/sbin/nologin

adm:

x:

3

:

4

:adm

:/var/adm

:/sbin/nologin

lp:

x:

4

:

7

:lp

:/var/spool/lpd

:/sbin/nologin

sync:

x:

5

:

:sync

:/sbin

:/bin/sync

不禁要问自己的问题是，到底还有多少符号可以用来作为分隔符？

参考官方文档，貌似任何的字符都可以作为分隔符，是根据s后面第一个遇到的符号作为分隔符：

https://www.gnu.org/software/sed/manual/html_node/The-_0022s_0022-Command.html


[root@centos00 _data]

# sed 's6a6the6g' dts.txt

this

is
the profession tool 

on
the professionthel plthetform

this

is
the mthen 

on
the etherth

[root@centos00 _data]

#

瞧，说的没错把。s 命令后面第一个字符，就是当做分隔符。

貌似这篇文章还有点深入的：


There are two levels of interpretation here: the shell, 

and
sed.

In
the shell, everything between single quotes 

is
interpreted literally, except 

for
single quotes themselves. You can effectively have a single quote between single quotes by writing 

'\'' (close single quote, one literal single quote, open single quote).

Sed uses basic regular expressions. 

In
a BRE, 

in
order 

to
have them treated literally, the characters $.*[\]^ need 

to
be quoted by preceding them by a backslash, except inside character sets ([…]). Letters, digits 

and
(){}+?| must 

not
be quoted (you can 

get
away 

with
quoting some of these 

in
some implementations). The sequences \(, \), \n, 

and

in
some implementations \{, \}, \+,  \?, \| 

and
other backslash+alphanumerics have special meanings. You can 

get
away 

with

not
quoting $^] 

in
some positions 

in
some implementations.

Furthermore, you need a backslash before / 

if
it 

is

to
appear 

in
the regex outside of bracket expressions. You can choose an alternative character as the delimiter by writing, e.g., s~/dir~/replacement~ 

or
\~/dir~p; you

'll need a backslash before the delimiter if you want to include it in the BRE. If you choose a character that has a special meaning in a BRE and you want to include it literally, you'll need three backslashes; I do not recommend this, as it may behave differently in some implementations.

In
a nutshell, 

for
sed 

's/…/…/':

Write the regex between single quotes.

Use 

'\'' to end up with a single quote in the regex.

Put a backslash before $.*/[\]^ 

and
only those characters (but 

not
inside bracket expressions).

Inside a bracket expression, 

for
- 

to
be treated literally, make sure it 

is
first 

or
last ([abc-] 

or
[-abc], 

not
[a-bc]).

Inside a bracket expression, 

for
^ 

to
be treated literally, make sure it 

is

not
first (use [abc^], 

not
[^abc]).

To
include ] 

in
the list of characters matched by a bracket expression, make it the first character (

or
first after ^ 

for
a negated 

set

): []abc] 

or
[^]abc] (

not
[abc]] nor [abc\]]).

In
the replacement text:

& 

and
\ need 

to
be quoted by preceding them by a backslash, as 

do
the delimiter (usually /) 

and
newlines.

\ followed by a digit has a special meaning. \ followed by a letter has a special meaning (special characters) 

in
some implementations, 

and
\ followed by some other character means \c 

or
c depending 

on
the implementation.

With
single quotes around the argument (sed 

's/…/…/'), use '\'' to put a single quote in the replacement text.

If
the regex 

or
replacement text comes from a shell variable, remember that

The regex 

is
a BRE, 

not
a literal 

string

.

In
the regex, a newline needs 

to
be expressed as \n (which will never match unless you have other sed code adding newline characters 

to
the pattern 

space

). But note that it won

't work inside bracket expressions with some sed implementations.

In
the replacement text, &, \ 

and
newlines need 

to
be quoted.

The delimiter needs 

to
be quoted (but 

not
inside bracket expressions).

Use double quotes 

for
interpolation: sed -e 

"s/$BRE/$REPL/"

.

使用寻址地址

行寻址：

第一种数字寻址：使用明确的行号，1,2,4 来标识需要匹配的行：


[root@centos00 _data]

# sed '1s6a6the6g' dts.txt

this

is
the profession tool 

on
the professionthel plthetform

this

is
a man 

on
the earth

[root@centos00 _data]

# sed '2s6a6the6g' dts.txt

this

is
a profession tool 

on
the professional platform

this

is
the mthen 

on
the etherth

[root@centos00 _data]

#

第二种使用正则，当然这种方法更为灵活：


[root@centos00 _data]

# sed '/platform/s6a6the6g' dts.txt

this

is
the profession tool 

on
the professionthel plthetform

this

is
a man 

on
the earth

命令执行：


[root@centos00 _data]

# sed '/platform/{

s6a6the6g

s6on6above6g

}

' dts.txt

this is the professiabove tool above the professiabovethel plthetform

this is a man on the earth

[root@centos00 _data]# sed '

/platform/

{s6a6the6g

s6on6above6g

}

' dts.txt

sed: -e expression #1, char 11: unknown command: `

'

[root@centos00 _data]

#

单行命令我已经描述过了，但多行命令应用到同一行还是有些不一样。比如{}的闭合就有说法，就像卡波蒂所说，一个标点符号的错位都有可能引起文章句意的不同。这里还是要注意。

官方文档有篇文章，介绍 sed 是如何工作的，我觉得蛮有意思：

6.1 How sed Works
sed maintains two data buffers: the active pattern space, and the auxiliary hold space. Both are initially empty.

sed operates by performing the following cycle on each line of input: first, sed reads one line from the input stream, removes any trailing newline, and places it in the pattern space. Then commands are executed; each command can have an address associated to it: addresses are a kind of condition code, and a command is only executed if the condition is verified before the command is to be executed.

When the end of the script is reached, unless the -n option is in use, the contents of pattern space are printed out to the output stream, adding back the trailing newline if it was removed.8 Then the next cycle starts for the next input line.

Unless special commands (like ‘D’) are used, the pattern space is deleted between two cycles. The hold space, on the other hand, keeps its data between cycles (see commands ‘h’, ‘H’, ‘x’, ‘g’, ‘G’ to move data between both buffers).

sed 按行处理文本时，会开辟两块缓冲区，pattern 空间和 hold 空间。

pattern 空间是保留去行首尾换行符之后的所有文本。一旦对这行文本处理完毕，就“倒掉” pattern 空间中的文本，换一下行。作为临时性的贮存区，每一次的换行都将清除 pattern 空间中的文本数据。

而 hold 空间则是保留了每次换行之后，前一行的数据。

接下来的进阶版文章中，会逐渐引入 pattern space, hold space 的概念。

sed 进阶

#### 多行命令

在整个文本文件中寻找模式，就需要考虑多行（跨行）的问题。因为模式可能不会存在单行上，或被分割成相邻的两行，或模式寻找的范围更广，需要将整篇文章作为搜索对象。所以多行就变成了必须。

硬编码的多行，用 n;n;… 来表示的例子：


[root@centos00 _data]

# sed  '{/professional/{n;d}}' dts.txt

this

is
a profession tool 

on
the professional platform

this

is
a man 

on
the earth

i like better man

[root@centos00 _data]

#

定位到含有 professional 那行，并且删除下面一行。

这里 n; 仅仅是为了可以定位更加机动化。试想如果不用 n;想要删除其中的空行，那么使用 ^


不能识别此Latex公式:
就将移除所有的空行：



[

root@

centos00 _data]# sed  

'{/^$/d}'
dts.txt

this

is
a profession tool on the professional platform

this

is
a man on the earth

i like better man

[

root@

centos00 _data]#

这里用到了正则，说明下：

正则表达式是用模式匹配来过滤文本的工具。

在 Linux 中，正则表达式引擎有两种：



BRE - 基本正则表达式引擎（Basic Regular Expressions）



ERE - 扩展正则表达式引擎（Extentional Regular Expressions）

sed 使用的是 BRE 引擎，而且用的还是 BRE 引擎中更小的一部分表达式，因此速度超快，但功能受限；

gawk 使用的是 ERE 引擎，重武器库型编辑工具（实际上具有可编程性），因此表达式丰富，但是速度可能较慢。

锚定字符：

行首定位 ^

行尾定位
不能识别此Latex公式:
空行：^

多行匹配



[

root@

centos00 Documents]# sed 

'/first/{N;s/\n/ /;s/line/user/g}'
MultiLine.txt

this

is
the header line

this

is
the first user  

this

is
the second user

this

is
the third line

this

is
the end

[

root@

centos00 Documents]# sed 

'/first/{N;s/\n/ /;s/first.*second/user/g}'
MultiLine.txt

this

is
the header line

this

is
the user line

this

is
the third line

this

is
the end

[

root@

centos00 Documents]#

第一个例子，我们先找有 first 存在的那行，接着将下一行的文本也附加到找到的这行来（其实是存在于 pattern space）,然后对于这行中的换行符(\n)做了替换处理，要不两行还是显示两行，替换了换行符，将所有 line 文本替换为 user;

第二个例子更有意思，除了连接符合条件行的两行之外，还用“.”通配符，替换了整个包含符合条件的文本，从而实现了两行搜索。

当然还可以连着搜索三行：



[

root@

centos00 Documents]# sed 

'/first/{N;N;s/\n/ /g;s/first.*third/user/g}'
MultiLine.txt

this

is
the header line

this

is
the user line

this

is
the end

[

root@

centos00 Documents]#

这里可以想象如果是整个文本文件呢？

反转文本顺序

要实现文本文件的行顺序反转，需要用到两个概念：

Hold space 保持空间



排除命令！

Hold space 的概念很有意思，和 pattern space 一样的是他们都被 sed 用来存储临时数据，不一样的是 hold space 保留的数据，时效性更长一些，而 pattern space 的数据在存储下一行数据之前，会被清空。且两种空间之间的数据可以互相交换。

sed 编辑器的 hold space 命令：

命令解释 h 将模式空间复制到保持空间 H 将模式空间附加到保持空间 g 将保持空间复制到模式空间 G 将保持空间附加到模式空间 x 交换模式空间和保持空间的内容将文件中内容按行倒序：



[

root@

centos00 Documents]# cat seqnumber.txt

1

2

3

4

5

6

[

root@

centos00 Documents]# sed -n 

'{G;h;s/\n//g;$p}'
seqnumber.txt

654321

[

root@

centos00 Documents]#

在本例中，G;h;就是利用了 pattern, hold space 的命令，做出两空间中数据的移动。

这里特别要注意的是

p 中的应用。每个单字命令前面都可以带地址空间寻址，就是寻到最后一行数据。

排除命令：

有两个作用，一是对符合条件的行不执行命令，二是对不符合条件的那些行则坚决执行这些命令


[root@centos00 Documents]

# sed -n '{G;h;$p}' seqnumber.txt

6

5

4

3

2

1

[root@centos00 Documents]

# sed -n '{1!G;h;$p}' seqnumber.txt

6

5

4

3

2

1

[root@centos00 Documents]

#

1！G就表示仅在第一行排除使用 G 命令，因为第一行读取时，hold space 并没有内容，是空值（看第一个结果，末尾有个空行），只执行 h; 而其他行都会一次执行 G;h;, 最后一行还会执行 p 的操作。

改变流：

跳转命令：



[address]

b

[label]

[address] 是定位表达式，label 是用来表示特定的一组命令的标记。


[

root@

centos00 Documents]# cat MultiLine.txt

this

is
the header line

this

is
the first line

this

is
the second line

this

is
the third line

this

is
the end

[

root@

centos00 Documents]# sed 

'{ /second/bchg;s/[ ]is[ ]/ was /g;:chg s/line/user/ }'
MultiLine.txt

this
was the header user

this
was the first user

this

is
the second user

this
was the third user

this
was the end

[

root@

centos00 Documents]#

值得注意的是，所有的命令都会被依次执行，但符合条件的行只被执行标记出来的命令。以上代码中， is 被替换成 was 只有在行内容中没有 second 的那些行，才执行。而所有的行，都会执行替换 line 成 user 的操作。

当然，为了阅读美观性，[address]b [label]之间可以加一个空格：


[

root@

centos00 Documents]# sed 

'{ /second/b chg;s/[ ]is[ ]/ was /g;:chg s/line/user/ }'
MultiLine.txt

this
was the header user

this
was the first user

this

is
the second user

this
was the third user

this
was the end

[

root@

centos00 Documents]#

如果在跳转命令后面什么标识(label)都不注明，那么符合条件的这行将跳过所有的命令，知道末尾退出，什么都不做！


[

root@

centos00 Documents]# sed 

'{ /second/b;s/[ ]is[ ]/ was /g;:chg s/line/user/ }'
MultiLine.txt

this
was the header user

this
was the first user

this

is
the second line

this
was the third user

this
was the end

[

root@

centos00 Documents]#

除了放在末尾外，label 也可以放在首部命令的位置，这样就造成了调用 label 命令时的循环：


[root@centos00 Documents]

# echo 'this,is,a,header,line,' | sed ':rmc s/,/ / ; b rmc ;' 

^C

[root@centos00 Documents]

# echo 'this,is,a,header,line,' | sed ':rmc s/,/ / ; /,/b rmc ;' 

this

is
a header line

[root@centos00 Documents]

#

为了防止死循环，加上判断，比如是否还有满足条件的情况（还有逗号）可以有效停止循环。

测试命令：


[root@centos00 Documents]

# cat sed_t.sed

{

s/second/sec/

t

s

/[ ]is[ ]/ was /
;

}

[root@centos00 Documents]

# sed -f sed_t.sed MultiLine.txt

this was the header line

this was the first line

this is the sec line

this was the third line

this was the end

[root@centos00 Documents]

#

测试命令，完成了 if-then-else-then 的结构：



if
s

/second/sec/

else
s

/[ ]is[ ]/
was /

如果没有完成 s/second/sec/ 的替换，那么执行 s/[ ]is[ ]/ was / 的替换。

t 和 b 的引用风格也一样 :



[address]

t

[label]

但这里[address]是替换成了s/// 的替换命令：


[

s/second/sec/

]t [label]

完整的写起来是这么回事，前面例子省却了 label, 则自动跳转到命令脚本末尾，即什么也不发生。


[root@centos00 Documents]

# cat sed_t_header.sed

{
s

/header/beginning/

t chg
s

/line/user/

:chg
s

/beginning/beginning header/

}

[root@centos00 Documents]

# sed -f sed_t_header.sed MultiLine.txt

this

is
the beginning header line

this

is
the first user

this

is
the second user

this

is
the third user

this

is
the end

[root@centos00 Documents]

#

值得注意的是，t 的脚本中，命令也是依次执行的， chg 的命令同样也会作用于每一行上，只是不起作用而已。

模式替代

and(&) 操作符


[root@centos00 Documents]

# echo 'the cat is sleeping in his hat' | sed 's/.at/"&"/g'

the 

"cat"

is
sleeping 

in
his 

"hat"

[root@centos00 Documents]

#

“.”指代任意一个字符，所以 cat, hat 都匹配的上。用 & 标识整个模式匹配的上的字符串，将其前后加上双引号。

（）指定子模式替代字符串


[

root@centos00 Documents

]

# sed 's/this\(.*

line

\)/that\1/;p;' -n MultiLine.txt

that 

is
the header line

that 

is
the first line

that 

is
the second line

that 

is
the third line

this

is
the end

[

root@centos00 Documents

]

#

有意思的事情是， \1, \2, \3, \n 标识了每个用 () 标记起来的模式子字符串，在替换命令中，使用了 \1,\2… 指代符的维持原来内容不变，而没有 \1, \2… 标记起来的内容，则全部替换。

案例：

给每行加个行号：


[root@centos00 Documents]

# cat MultiLine.txt

this

is
the header line

this

is
the first line

this

is
the second line

this

is
the third line

this

is
the end

[root@centos00 Documents]

# sed ' = ' MultiLine.txt | sed 'N;s/\n//g' 

1

this

is
the header line

2

this

is
the first line

3

this

is
the second line

4

this

is
the third line

5

this

is
the end

6

7

[root@centos00 Documents]

#

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/31553767/viewspace-2214255/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/31553767/viewspace-2214255/

另类的 ETL 工具 sed 进阶相关推荐

系统设计与架构笔记:ETL工具开发和设计的建议
好久没写博客了,不是自己偷懒,的确是没有时间哦. 最近项目组里想做一个ETL数据抽取工具,这是一个研发项目,但是感觉公司并不是特别重视,不重视不是代表它不重要,而是可能不会对这个项目要求太高,能满足我 ...
oracle视图能用etl工具_（转）使用kettle作为sqlserver2008和oracle10G之间的ETL工具
转 http://blog.sina.com.cn/s/blog_664558d30100qga9.html 实际工作过程中,常常会遇到将数据从一个数据库迁入到另外一个数据库,以sqlserver20 ...
ETL工具框架开源软件
http://www.oschina.net/project/tag/453/etl 开源ETL工具 Kettle Talend KETL CloverETL Apatar Scriptella ET ...
ETL工具调度之中美PK
ETL调度工具中美PK ( TASKCTL VS Control-M) 美方:Control-M ( www.bmc.com) 中方:TASKCTL ( www.taskctl.com ) 毫无疑 ...
六种主流ETL 工具的比较(DataPipeline，Kettle，Talend，Informatica，Datax ，Oracle Goldengate)...
六种主流ETL 工具的比较(DataPipeline,Kettle,Talend,Informatica,Datax ,Oracle Goldengate) 比较维度\产品 DataPipeline ...
【ETL】ETL----如何决定是否采用ETL工具
原文链接:https://blog.csdn.net/cormier_an/article/details/12349533?utm_source=blogxgwz1 ETL工具还是手工编码 (购买工 ...
【ETL】ETL介绍与ETL工具比较
本文转载自:http://blog.csdn.net/u013412535/article/details/43462537 ETL,是英文 Extract-Transform-Load 的缩写,用来 ...
ETL工具Kettle使用
1.下载kettle:https://sourceforge.net/projects/pentaho/files/Data%20Integration/7.0/pdi-ce-7.0.0.0-25.z ...
ETL工具大全，你了解多少
这些年,几乎都与ETL打交道,接触过多种ETL工具.现将这些工具做个整理,与大家分享. 一 ETL工具 [国外] 1. datastage 点评:最专业的ETL工具,价格不菲,使用难度一般下载地址: ...

另类的 ETL 工具 sed 进阶