Linux之旅 9：正则表达式与文件格式化

2021年8月16日 1529点热度 0人点赞 0条评论

常用正则表达式

什么是正则表达式

正则表达式（简称为正则）可以看做是一种微型标记语言，通过定义一系列符号来灵活地设定一种匹配模式，对目标字符串进行匹配，匹配出你想要获取的部分，然后进行下一步处理。

其目的相当明确，就是字符串匹配，当然，往往使用正则的程序也会在匹配的基础上提供替换或者删除的功能，但那些都可以看做是通过正则匹配出结果之后的动作。

我在前边说了，通配符通常可以看做一个精简版的正则，因为他们的目的相同，都是设定一个匹配模式进行匹配，不过前者要简单的多，而后者有完整而复杂的语义，老实说，初学者要把正则搞明白并不是件容易的事。

正则有什么用

但是正则的确相当有用，如果使用正则你会省不少事，比如说网页编程最常见的，需要判断某个输入框的合法性，是不是合法的电话，是不是合法的身份证，等等，如果从头开始编写一个字符串判断和处理的函数简直不可想象，而如果你懂正则，只需要按一定规则编写一段正则表达式即可，或者更多的是去网上copy一段。

再比如我经常用的一个小工具EasyPub，可以将txt电子书转换为kindle的专有格式mobi，转换的时候可以自动识别章节进行切割，但是有时候有些txt章节并非常见的第一章 xxx，而是### 1 xxx ###，默认的章节分割就不好使了，但庆幸的是该工具支持正则表达式切割章节，只要你会正则，只要txt中的章节标题都有迹可循，任何内容都不在话下。

如何学习

本篇文章的切入点是从Bash中支持正则表达式的命令入手，介绍一些简单的正则表达式在Bash命令中的实际使用，如果想系统地学习正则，各大厂商都有自己的正则教程或者参考手册，可以挑一个自己喜欢的：

如果想检查自己写的正则哪里出错了，可以使用这个工具debuggex。

基础正则表达式

正则表达式的语法分为基础部分和扩展部分（可以看做是高级语法），虽然说我使用过的大部分主流编程语言，如javascript、Java、PHP、Python中都支持完整的正则语法，但是Linux中某些应用，比如grep仅支持基础语法，扩展语法需要使用额外参数grep -E才能支持。所以这里关于正则表达式的语法介绍也分为基础和扩展两部分。

编码对正则表达式的影响

正则的某些语法是和具体编码直接相关的，比如[a-z]，指的是编码表上从a字符到z字符的所有字符中的某一个。在标准的ASCII编码中，其范围自然是全部的小写字母，因为ASCII中小写字母是连续的（对应编号为97~122），相应的，对于兼容ASCII的字符集也会是相同的结果，比如各种UTF-8编码。但如果你使用的是某个不兼容ASCII的字符集，比如big5这种繁体字字符集，其编码的顺序是：

0 1 2 3 4 ... a A b B c C d D ... z Z

如果在这种编码下使用正则，[a-z]的范围就会变成a A b B c C d D ... z，显然是不正确的。

关于更多ASCII字符集的内容，可以阅读ASCII。

幸运的是GB2312或者big5这些不兼容ASCII的老式编码现在已经越来越少用了，从操作系统到应用都越来越多地使用UTF-8这种新一代统一规范且兼容ASCII的编码，所以我们大部分时间都无需对此担心。

除了使用[a-z]这种方式指定小写字母，还可以使用一些正则预定义的符号集：

特殊字符	定义
`[:lower:]`	小写字母（Lower-case letters），即a-z
`[:upper:]`	大写字母（Upper-case letters），即A-Z
`[:alpha:]`	所有字母，即a-zA-Z（可能有人知道Google的母公司叫`alphabet`，即字母表，那么`alpha`的意思就是字母）
`[:digit:]`	数字，即0-9
`[:alnum:]`	字母和数字（即alpha+number的意思）
`[:blank:]`	空白符（blank characters），包含空格与制表符
`[:cntrl:]`	键盘上的控制按键，包括`CR, LF, Tab, Del`等
`[:punct:]`	标点符号（Punctuation characters），比如`\# $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ' { \|} ~`等
`[:graph:]`	图形字符（Graphical characters），包括`[:alnum:]` 和`[:punct:]`，也就是字母数字和标点符号（可以理解为有图形的字符（相对于空白符等），对于UTF字符集来说可能会包含一些奇奇怪怪的字符，比如笑脸或者象棋棋子）
`[:print:]`	可打印字符（Printable characters），包括`[:alnum:]`、`[:punct:]`和空格
`[:space:]`	空格符（Space characters），包括制表符、换行符、垂直制表符、换页、回车、空格
`[:xdigit:]`	16进制数字（Hexadecimal digits），包括`0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f`

需要说明的是，这些特殊字符之所以能使用，是因为字符集本身会自带一个数据库，在数据库中标记哪些字符属于数字，哪些字符属于字母，哪些又是可打印的，诸如此类，所以实际使用效果根据字符集的不同可能会有不同的结果。

因为上边的原因，使用特殊字符相对于0-9或a-z这样的写法有个额外的好处，即就算当前字符集中小写字母不是连续编码，中间插入了大写字母，也不会影响使用特殊字符（比如:lower:）的匹配结果。

grep的一些进阶选项

之前我们介绍过grep的一些用法，可以正选或者反选出匹配的行，此外，grep还可以打印匹配结果的行号：

[icexmoon@xyz ~]$ dmesg | grep -n ens33
1821:[   19.585777] IPv6: ADDRCONF(NETDEV_UP): ens33: link is not ready
1822:[   19.620521] IPv6: ADDRCONF(NETDEV_UP): ens33: link is not ready
1828:[   21.601791] e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
1829:[   21.608352] IPv6: ADDRCONF(NETDEV_CHANGE): ens33: link becomes ready
1830:[   21.708661] e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

甚至可以额外打印出匹配结果的上下文：

[icexmoon@xyz ~]$ dmesg | grep -n ens33 -A 2 -B2
1819-[   16.342351] Bluetooth: BNEP socket layer initialized
1820-[   19.334452] ip6_tables: (C) 2000-2006 Netfilter Core Team
1821:[   19.585777] IPv6: ADDRCONF(NETDEV_UP): ens33: link is not ready
1822:[   19.620521] IPv6: ADDRCONF(NETDEV_UP): ens33: link is not ready
1823-[   19.626516] Ebtables v2.0 registered
1824-[   19.820016] Netfilter messages via NETLINK v0.30.
--
1826-[   20.609962] nf_conntrack version 0.5.0 (7778 buckets, 31112 max)
1827-[   21.259096] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
1828:[   21.601791] e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
1829:[   21.608352] IPv6: ADDRCONF(NETDEV_CHANGE): ens33: link becomes ready
1830:[   21.708661] e1000: ens33 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
1831-[   26.904983] tun: Universal TUN/TAP device driver, 1.6
1832-[   26.904986] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>

本来匹配到的前两行是1821和1822行，但通过使用-A（After）和-B（Before）参数，在结果中出现了1829、1820以及1823、1824行，这样有助于我们在某些情况下结合匹配结果的上下文分析问题。

基础正则表达式练习

这里使用《鸟哥的私房菜》提供的文本进行练习：

[icexmoon@xyz ~]$ cd /tmp
[icexmoon@xyz tmp]$ wget http://linux.vbird.org/linux_basic/0330regularex/regular_express.txt
--2021-08-16 15:38:50--  http://linux.vbird.org/linux_basic/0330regularex/regular_express.txt
正在解析主机 linux.vbird.org (linux.vbird.org)... 140.116.44.180
正在连接 linux.vbird.org (linux.vbird.org)|140.116.44.180|:80... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：650 [text/plain]
正在保存至: “regular_express.txt”

100%[===========================================================================================>] 650         --.-K/s 用时 0s

2021-08-16 15:38:56 (21.5 MB/s) - 已保存 “regular_express.txt” [650/650])

简单查找

[icexmoon@xyz tmp]$ grep -n the regular_express.txt
8:I can't finish the test.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
18:google is the best tools for search keyword.

反选

[icexmoon@xyz tmp]$ grep -nv the regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
4:this dress doesn't fit me.
5:However, this dress is about $ 3183 dollars.
6:GNU is free air not free beer.
7:Her hair is very beauty.
9:Oh! The soup taste good.
10:motorcycle is cheap than car.
11:This window is clear.
13:Oh!  My god!
14:The gd software is a library for drafting programs.
17:I like dog.
19:goooooogle yes!
20:go! go! Let's go.
21:# I am VBird
22:

使用[]

正则中可以使用[]指定一个属于某个范围的字符，比如查找出tast或test的字符串：

[icexmoon@xyz tmp]$ grep -n 't[ae]st' regular_express.txt
8:I can't finish the test.
9:Oh! The soup taste good.
[icexmoon@xyz tmp]$

t[ae]st的意思是第二个字符是a或e，也就是匹配tast或test。

实际测试中发现如果上边的命令不使用'包裹正则，就只能匹配出一行结果，所以使用正则的时候最好使用引号进行包裹。

查找包含oo的结果：

[icexmoon@xyz tmp]$ grep -n 'oo' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!
[icexmoon@xyz tmp]$

如果要排除goo这样的结果，可以：

[icexmoon@xyz tmp]$ grep -n '[^g]oo' regular_express.txt
2:apple is my favorite food.
3:Football game is not use feet only.
18:google is the best tools for search keyword.
19:goooooogle yes!
[icexmoon@xyz tmp]$

可以注意到匹配结果中有一个goooooogle，这是因为虽然goo不会被匹配，但是gooo就会被匹配，因为ooo本身就是满足条件的。

如果要匹配出oo前连接的不是一个小写字母的结果：

[icexmoon@xyz tmp]$ grep -n '[^a-z]oo' regular_express.txt
3:Football game is not use feet only.

除了使用a-z这种方式指定小写字母，也可以使用我们之前说过的特殊字符：

[icexmoon@xyz tmp]$ grep -n '[^[:lower:]]oo' regular_express.txt
3:Football game is not use feet only.

匹配任意的数字：

[icexmoon@xyz tmp]$ grep -n '[0-9]' regular_express.txt
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

类似的，我们也可以使用特殊字符替换0-9：

[icexmoon@xyz tmp]$ grep -n '[[:digit:]]' regular_express.txt
5:However, this dress is about $ 3183 dollars.
15:You are the best is mean you are the no. 1.

使用定位点

^和$在正则中被称为定位点，^用于标记被匹配的字符串的开头，$标记结尾。

比如我们匹配以the字符串开头的行：

[icexmoon@xyz tmp]$ grep -n '^the' regular_express.txt
12:the symbol '*' is represented as start.

grep是按行进行匹配的，所以^the这样的字符串会匹配出以the为开头的行。

类似的，我们可以匹配出以小写字母开头的行：

[icexmoon@xyz tmp]$ grep -n '^[[:lower:]]' regular_express.txt
2:apple is my favorite food.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
12:the symbol '*' is represented as start.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

当然，也可以：

[icexmoon@xyz tmp]$ grep -n '^[a-z]' regular_express.txt
2:apple is my favorite food.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
12:the symbol '*' is represented as start.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

如果想匹配出不以字母开头的行：

[icexmoon@xyz tmp]$ grep -n '^[^a-zA-Z]' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
21:# I am VBird

也可以：

[icexmoon@xyz tmp]$ grep -n '^[^[:alpha:]]' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
21:# I am VBird

使用特殊字符套娃[^[:alpha:]]这样的写法有点难以理解，只要知道最外边的[]的用途是指定这里有一个字符，而里边[:alpha:]表示的是一个字母范围的符号集就容易理解了。

如果要匹配出以英文句号.结尾的行：

[icexmoon@xyz tmp]$ grep -n '\.$' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
4:this dress doesn't fit me.
10:motorcycle is cheap than car.
11:This window is clear.
12:the symbol '*' is represented as start.
15:You are the best is mean you are the no. 1.
16:The world <Happy> is the same with "glad".
17:I like dog.
18:google is the best tools for search keyword.
20:go! go! Let's go.

因为.这个字符本身是正则语法的一部分，被正则定义为“任意一个字符”，所以如果我们要匹配英文句号，就要使用转义符\进行转义。

比较奇怪的是有几行明明也是以.结尾，但是没有匹配出来：

[icexmoon@xyz tmp]$ cat -An regular_express.txt | head -n 9 | tail -n 5
     5  However, this dress is about $ 3183 dollars.^M$
     6  GNU is free air not free beer.^M$
     7  Her hair is very beauty.^M$
     8  I can't finish the test.^M$
     9  Oh! The soup taste good.^M$

查看特殊字符就能明白，这几行真正的结束字符（不包含换行符\n）是\r，因为\r\n是Windows下的换行标识，而Linux下是\n，所以preg按行切分处理的时候只会过滤掉\n，而后边还留着一个\r，所以这几行是以\r结尾而非.。

如果要匹配出空行：

[icexmoon@xyz tmp]$ grep -n '^$' regular_express.txt
22:

^$意味着开头和结尾中间啥都没有（包括各种看不到的空字符）。

我们再看这么一个文件：

[icexmoon@xyz tmp]$ cat -n /etc/rsyslog.conf
     1  # rsyslog configuration file
     2
     3  # For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html
     4  # If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html
     5
     6  #### MODULES ####
     7
     8  # The imjournal module bellow is now used as a message source instead of imuxsock.
     9  $ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
    10  $ModLoad imjournal # provides access to the systemd journal
    11  #$ModLoad imklog # reads kernel messages (the same are read from journald)
    12  #$ModLoad immark  # provides --MARK-- message capability
    13
    14  # Provides UDP syslog reception
    15  #$ModLoad imudp

包含很多空行与注释行（#开头），如果我们只想看“干货”，可以：

[icexmoon@xyz tmp]$ cat /etc/rsyslog.conf | grep -v '^$' | grep -v '^#'
$ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
$ModLoad imjournal # provides access to the systemd journal
$WorkDirectory /var/lib/rsyslog
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat
$IncludeConfig /etc/rsyslog.d/*.conf
$OmitLocalLogging on
$IMJournalStateFile imjournal.state
*.info;mail.none;authpriv.none;cron.none                /var/log/messages
authpriv.*                                              /var/log/secure
mail.*                                                  -/var/log/maillog
cron.*                                                  /var/log/cron
*.emerg                                                 :omusrmsg:*
uucp,news.crit                                          /var/log/spooler
local7.*                                                /var/log/boot.log

注意不要使用cat -n，因为加入行号以后^#与^$就失效了，因为grep要处理的每一行开头都是数字行号。

.与*

之前说过.是正则表达式语法的一部分，事实上.作为占位符，表示这里会有一个任意字符，其作用相当于通配符中的?。

*是一个数量词，数量词用来表示其前边的字符或分组出现的次数，而*表示出现任意多次（包括0次）。

事实上正则的语法中还有其它数量词，但都属于扩展语法而非基础语法，所以我们在后边再讨论。

现在我们匹配gxxd这样的字符串：

[icexmoon@xyz tmp]$ grep -n 'g..d' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
9:Oh! The soup taste good.
16:The world <Happy> is the same with "glad".

如果我们想匹配出至少两个连续的o，比如oo或者ooo或者oooo，可以：

[icexmoon@xyz tmp]$ grep -n 'ooo*' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

需要注意的是，这里*表示任意多次，即0次或多次，而我们的下限是连续的两个o，所以这里的正则是ooo*，而非oo*，因为后者的下限是一个o。

类似的，如果我们要匹配gog、goog或gooog，可以：

[icexmoon@xyz tmp]$ grep -n 'goo*g' regular_express.txt
18:google is the best tools for search keyword.
19:goooooogle yes!

如果我们想匹配出以g开头以g结尾，中间是任意字符或者干脆没有字符（比如gg这样的字符串），可以：

[icexmoon@xyz tmp]$ grep -n 'g.*g' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
14:The gd software is a library for drafting programs.
18:google is the best tools for search keyword.
19:goooooogle yes!
20:go! go! Let's go.

这里.*表示有一个任意字符.出现零次或多次*。

｛m,n｝

在正则中{m,n}也是一个数量词，表示前边的字符或分组出现m-n次，也可以写作{m}，表示刚好出现m次，而{m,}则表示出现m以上次（包括m次）。

我们看实际使用，比如我们要匹配出正好出现两次o的字符串：

[icexmoon@xyz tmp]$ grep -n 'o\{2\}' regular_express.txt
1:"Open Source" is a good mechanism to develop programs.
2:apple is my favorite food.
3:Football game is not use feet only.
9:Oh! The soup taste good.
18:google is the best tools for search keyword.
19:goooooogle yes!

需要注意的是'o{2}'这种写法虽然是正确的正则表达式，在其它编程语言中也可以正常执行，但是在Bash中是无法执行的，因为{}在Bash中有特殊的用途（标记变量名），所以我们需要使用转义符。

如果我们需要匹配出goog这样的，g中间有2~5个o的结果：

[icexmoon@xyz tmp]$ grep -n 'go\{2,5\}g' regular_express.txt
18:google is the best tools for search keyword.

可以看到19行中间的o超过了5个，所以被排除了。

如果要匹配出超过2个o的：

[icexmoon@xyz tmp]$ grep -n 'go\{2,\}g' regular_express.txt
18:google is the best tools for search keyword.
19:goooooogle yes!

sed

sed（stream editor）是一个管道命令，可以用于分析和处理stdin，可以对内容进行取代、删除、新增等。

我们看实际使用：

[icexmoon@xyz tmp]$ nl /etc/passwd | sed '2,5d' | head -n 10
     1  root:x:0:0:root:/root:/bin/bash
     6  sync:x:5:0:sync:/sbin:/bin/sync
     7  shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
     8  halt:x:7:0:halt:/sbin:/sbin/halt
     9  mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
    10  operator:x:11:0:operator:/root:/sbin/nologin
    11  games:x:12:100:games:/usr/games:/sbin/nologin
    12  ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
    13  nobody:x:99:99:Nobody:/:/sbin/nologin
    14  systemd-network:x:192:192:systemd Network Management:/:/sbin/nologin

这里使用sed读取nl添加好行号的/etc/passwd内容，然后删除2~5行（2,5d）后输出到屏幕。

除了d可以用于删除指定行外，还有其它参数值可选：

a：追加，即在指定行之后插入新的行
c：替换，即替换指定行的内容
d：删除指定行
i：插入，即在指定行之前插入新的行
p：打印指定行，通常需要和-n参数一起使用
s：替换指定行的指定字符串，可以结合正则表达式使用，如1,20s/old/new/g

此外，上边的参数值实际上对应的参数是-e，正规的写法应当是sed -e '2,5d'这样，不过一般情况下不使用-e也可以执行。

下面展示在指定行后插入：

[icexmoon@xyz tmp]$ nl /etc/passwd | sed "2a drink tea" | head -n 10
     1  root:x:0:0:root:/root:/bin/bash
     2  bin:x:1:1:bin:/bin:/sbin/nologin
drink tea
     3  daemon:x:2:2:daemon:/sbin:/sbin/nologin
     4  adm:x:3:4:adm:/var/adm:/sbin/nologin
     5  lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
     6  sync:x:5:0:sync:/sbin:/bin/sync
     7  shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
     8  halt:x:7:0:halt:/sbin:/sbin/halt
     9  mail:x:8:12:mail:/var/spool/mail:/sbin/nologin

在指定行前插入：

[icexmoon@xyz tmp]$ nl /etc/passwd | sed "2i drink tea" | head -n 10
     1  root:x:0:0:root:/root:/bin/bash
drink tea
     2  bin:x:1:1:bin:/bin:/sbin/nologin
     3  daemon:x:2:2:daemon:/sbin:/sbin/nologin
     4  adm:x:3:4:adm:/var/adm:/sbin/nologin
     5  lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
     6  sync:x:5:0:sync:/sbin:/bin/sync
     7  shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
     8  halt:x:7:0:halt:/sbin:/sbin/halt
     9  mail:x:8:12:mail:/var/spool/mail:/sbin/nologin

插入多行：

[icexmoon@xyz tmp]$ nl /etc/passwd | sed "2i drink tea\nand coffee" | head -n 10
     1  root:x:0:0:root:/root:/bin/bash
drink tea
and coffee
     2  bin:x:1:1:bin:/bin:/sbin/nologin
     3  daemon:x:2:2:daemon:/sbin:/sbin/nologin
     4  adm:x:3:4:adm:/var/adm:/sbin/nologin
     5  lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
     6  sync:x:5:0:sync:/sbin:/bin/sync
     7  shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
     8  halt:x:7:0:halt:/sbin:/sbin/halt

这里\n是换行符。

替换指定行的内容：

[icexmoon@xyz tmp]$ nl /etc/passwd | sed "2,5c 2~5 lines contents disapeared" | head -n 10
     1  root:x:0:0:root:/root:/bin/bash
2~5 lines contents disapeared
     6  sync:x:5:0:sync:/sbin:/bin/sync
     7  shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
     8  halt:x:7:0:halt:/sbin:/sbin/halt
     9  mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
    10  operator:x:11:0:operator:/root:/sbin/nologin
    11  games:x:12:100:games:/usr/games:/sbin/nologin
    12  ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
    13  nobody:x:99:99:Nobody:/:/sbin/nologin

显示指定行的内容：

[icexmoon@xyz tmp]$ nl /etc/passwd | sed "2,5p" -n
     2  bin:x:1:1:bin:/bin:/sbin/nologin
     3  daemon:x:2:2:daemon:/sbin:/sbin/nologin
     4  adm:x:3:4:adm:/var/adm:/sbin/nologin
     5  lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin

这里需要注意的是要使用-n参数，这个参数的意思是不再向stdout输出原有内容。因为sed默认会在对stdin输入的内容进行加工后输出到stdout，而2,5p这个参数值只是让sed将原有内容的2~5行输出一遍，此时如果没有-n参数，实际上的效果是在输出原有全部内容的基础上再输出一遍2~5行的内容，显然不是我们希望的结果，所以需要使用-n参数屏蔽掉原始内容的输出。

我们可以使用sed结合正则表达式完成一些复杂的匹配和替换工作：

sed "1,20s/要被取代的字符串/新的字符串/g"

其中要被取代的字符串可以使用正则表达式表示，此外这里前边的s表示是sed的替换操作，后边的g属于正则中的一种附加模式：全局查找，也就是说会查找并替换所有匹配的字符串。

下面看实际例子：

[icexmoon@xyz tmp]$ ifconfig ens33
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.105  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 2409:8a7a:8ca0:c030:c1f:b27c:56dc:a1b  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::d602:a3fd:7e74:dc5e  prefixlen 64  scopeid 0x20<link>
        ether 00:0c:29:72:06:e1  txqueuelen 1000  (Ethernet)
        RX packets 4625  bytes 404999 (395.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2709  bytes 522061 (509.8 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[icexmoon@xyz tmp]$ ifconfig ens33 | grep inet
        inet 192.168.1.105  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 2409:8a7a:8ca0:c030:c1f:b27c:56dc:a1b  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::d602:a3fd:7e74:dc5e  prefixlen 64  scopeid 0x20<link>
[icexmoon@xyz tmp]$ ifconfig ens33 | grep inet | head -n 1
        inet 192.168.1.105  netmask 255.255.255.0  broadcast 192.168.1.255
[icexmoon@xyz tmp]$ ifconfig ens33 | grep inet | head -n 1 | sed '1s/^.*inet //g'
192.168.1.105  netmask 255.255.255.0  broadcast 192.168.1.255
[icexmoon@xyz tmp]$ ifconfig ens33 | grep inet | head -n 1 | sed -e '1s/^.*inet //g' -e '1s/ *netmask.*//g'
192.168.1.105

这里从ifconfig命令输出的信息中筛选除了Linux主机IP所在的数据行，然后使用sed，通过正则^.*inet和*netmask.*匹配出IP前边和后边的字符串，并且用空字符串进行替换，也就是删除，最后就只有IP了。

使用正则的时候可以像上面那样，通过逐步试探来调整正则，以完成最终匹配。

再看一个例子：

[icexmoon@xyz tmp]$ cat /etc/man_db.conf | grep MAN
# MANDATORY_MANPATH                     manpath_element
# MANPATH_MAP           path_element    manpath_element
# MANDB_MAP             global_manpath  [relative_catpath]
# every automatically generated MANPATH includes these fields
#MANDATORY_MANPATH                      /usr/src/pvm3/man
MANDATORY_MANPATH                       /usr/man
MANDATORY_MANPATH                       /usr/share/man
MANDATORY_MANPATH                       /usr/local/share/man
# set up PATH to MANPATH mapping
#               *PATH*        ->        *MANPATH*
MANPATH_MAP     /bin                    /usr/share/man
MANPATH_MAP     /usr/bin                /usr/share/man
MANPATH_MAP     /sbin                   /usr/share/man
MANPATH_MAP     /usr/sbin               /usr/share/man
MANPATH_MAP     /usr/local/bin          /usr/local/man
MANPATH_MAP     /usr/local/bin          /usr/local/share/man
MANPATH_MAP     /usr/local/sbin         /usr/local/man
MANPATH_MAP     /usr/local/sbin         /usr/local/share/man
MANPATH_MAP     /usr/X11R6/bin          /usr/X11R6/man
MANPATH_MAP     /usr/bin/X11            /usr/X11R6/man
MANPATH_MAP     /usr/games              /usr/share/man
MANPATH_MAP     /opt/bin                /opt/man
MANPATH_MAP     /opt/sbin               /opt/man
#               *MANPATH*     ->        *CATPATH*
MANDB_MAP       /usr/man                /var/cache/man/fsstnd
MANDB_MAP       /usr/share/man          /var/cache/man
MANDB_MAP       /usr/local/man          /var/cache/man/oldlocal
MANDB_MAP       /usr/local/share/man    /var/cache/man/local
MANDB_MAP       /usr/X11R6/man          /var/cache/man/X11R6
MANDB_MAP       /opt/man                /var/cache/man/opt

我们匹配出/etc/man_db.conf这个文件中包含MAN的行，然后删除注释行：

[icexmoon@xyz tmp]$ cat /etc/man_db.conf | grep MAN | sed -e '/^#.*$/d'
MANDATORY_MANPATH                       /usr/man
MANDATORY_MANPATH                       /usr/share/man
MANDATORY_MANPATH                       /usr/local/share/man
MANPATH_MAP     /bin                    /usr/share/man
MANPATH_MAP     /usr/bin                /usr/share/man
MANPATH_MAP     /sbin                   /usr/share/man
MANPATH_MAP     /usr/sbin               /usr/share/man
MANPATH_MAP     /usr/local/bin          /usr/local/man
MANPATH_MAP     /usr/local/bin          /usr/local/share/man
MANPATH_MAP     /usr/local/sbin         /usr/local/man
MANPATH_MAP     /usr/local/sbin         /usr/local/share/man
MANPATH_MAP     /usr/X11R6/bin          /usr/X11R6/man
MANPATH_MAP     /usr/bin/X11            /usr/X11R6/man
MANPATH_MAP     /usr/games              /usr/share/man
MANPATH_MAP     /opt/bin                /opt/man
MANPATH_MAP     /opt/sbin               /opt/man
MANDB_MAP       /usr/man                /var/cache/man/fsstnd
MANDB_MAP       /usr/share/man          /var/cache/man
MANDB_MAP       /usr/local/man          /var/cache/man/oldlocal
MANDB_MAP       /usr/local/share/man    /var/cache/man/local
MANDB_MAP       /usr/X11R6/man          /var/cache/man/X11R6
MANDB_MAP       /opt/man                /var/cache/man/opt

可以看到d也可以结合正则来使用，以删除特定的行。

此外，sed也可以直接处理并替换掉文件内容：

[icexmoon@xyz tmp]$ sed -i -e 's/\.$/!/g' regular_express.txt
[icexmoon@xyz tmp]$ cat regular_express.txt | grep '!$'
"Open Source" is a good mechanism to develop programs!
apple is my favorite food!
Football game is not use feet only!
this dress doesn't fit me!
motorcycle is cheap than car!
This window is clear!
the symbol '*' is represented as start!
Oh!     My god!
You are the best is mean you are the no. 1!
The world <Happy> is the same with "glad"!
I like dog!
google is the best tools for search keyword!
goooooogle yes!
go! go! Let's go!
[icexmoon@xyz tmp]$ cat regular_express.txt | grep '\.$'

这样做比较危险，如果是处理重要文件，请先进行备份。

可以看到，使用-i参数就会将处理后的内容回写回文件。

正则表达式的扩展语法

简单介绍几个重要的：

符号	含义
`+`	数量词，表示1个以上，也就是说至少有1个
`?`	数量词，0个或1个
`\|`	或，比如`abc\|123`表示匹配结果是`abc`或者`123`
`()`	分组，在`()`中可以使用子正则表达式，比如`g(abc\|123)d`，匹配结果是`gabcd`或`g123d`

正则的扩展语法要搭配支持扩展语法的命令使用：

[icexmoon@xyz tmp]$ grep 'go+g' regular_express.txt
[icexmoon@xyz tmp]$ egrep 'go+g' regular_express.txt
google is the best tools for search keyword!
goooooogle yes!

grep命令默认是不支持扩展语法的，所以匹配不到我们想要的结果，使用egrep就可以。

或者：

[icexmoon@xyz tmp]$ grep -E 'go+g' regular_express.txt
google is the best tools for search keyword!
goooooogle yes!

文件格式化

这里使用一个案例文件，我放到了gitee，可以使用下面的方式下载到本地：

[icexmoon@xyz tmp]$ git clone https://gitee.com/icexmoon/linux_roads.git
正克隆到 'linux_roads'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), done.

坑爹的gitee就是没有Github好用。

printf

文件students以空格分隔字段：

[icexmoon@xyz tmp]$ cd linux_roads/
[icexmoon@xyz linux_roads]$ cat students
Name Chinese English Math Average
DmTsai 80 60 92 77.33
VBird 75 55 80 70.00
Ken 60 90 70 73.33

不是很美观，我们使用printf命令进行格式化：

[icexmoon@xyz linux_roads]$ cat students | xargs printf '%s\t%s\t%s\t%s\t%s\n'
Name    Chinese English Math    Average
DmTsai  80      60      92      77.33
VBird   75      55      80      70.00
Ken     60      90      70      73.33

整个printf的逻辑就是将数据按行进行处理，并以空格或者制表符自动识别出字段，然后按用户指定的格式进行输出。

这里%s是格式化符号，表示一个字符串，类似的还有：

%s：字符串
%i：整数
%f：浮点数

事实上从命名到参数，printf和C语言中的同名函数都是极为相似的。

用户可以利用这些格式化符号将每行的字段进行拼接，以输出自己想要的效果。

此外，因为printf并不是管道命令，不能处理stdin，所以这里使用了xargs。

虽然现在的打印结果已经很不错了，不过我们还可以整点别的花样：

[icexmoon@xyz linux_roads]$ cat students | tail -n 3 | xargs printf '%10s %5i %5i %5i %5.2f\n'
    DmTsai    80    60    92 77.33
     VBird    75    55    80 70.00
       Ken    60    90    70 73.33

这里使用了%ns这样的写法，其中n可以指定字段占据的宽度，%10s就表明这个字段是字符串，并且占据10个字符宽度，而对于浮点数，%5.2f表示整个浮点数占5个字符宽度，小数部分占2个字符宽度，小数点占1个，那么整数部分就只占2个字符宽度了（这个逻辑似乎与Python中的不同）。

最后我们将表头和内容整合一下：

[icexmoon@xyz linux_roads]$ cat students | head -n 1 | xargs printf '%10s %10s %10s %10s %10s\n';\
> cat students | tail -n 3 | xargs printf '%10s %10i %10i %10i %10.2f\n'
      Name    Chinese    English       Math    Average
    DmTsai         80         60         92      77.33
     VBird         75         55         80      70.00
       Ken         60         90         70      73.33

这里为了美观，调整了一下内容的字段宽度。

除了格式化显示，prinf还可以使用字符编码输出对应的字符：

[icexmoon@xyz linux_roads]$ printf '\x45\n'
E

x45（16进制）编码在ASCII中对应的字符就是E，我使用的UTF-8是兼容ASCII的，所以结果也是E。

awk

awk可以进行数据处理，我们之前介绍的sed会对整行进行删除或替换，而awk会自动对每一行进行“分段”，就像printf那样，根据空白符进行切分，预先处理成多个字段，然后用户可以针对字段进行数据处理。

整个命令的格式为：

 awk '条件类型1{动作1} 条件类型2{动作2} ...' filename

看具体示例：

[icexmoon@xyz linux_roads]$ last -n 5
icexmoon pts/0        icexmoon-book    Mon Aug 16 15:14   still logged in
reboot   system boot  3.10.0-1160.el7. Mon Aug 16 13:55 - 19:17  (05:21)
icexmoon pts/2        :0               Sun Aug 15 15:03 - 15:03  (00:00)
icexmoon :0           :0               Sun Aug 15 14:56 - crash  (22:59)
icexmoon tty2                          Sun Aug 15 14:54 - 14:54  (00:00)

wtmp begins Sat Jul 24 14:47:46 2021
[icexmoon@xyz linux_roads]$ last -n 5 | awk '{print $1 "\t" $3}'
icexmoon        icexmoon-book
reboot  boot
icexmoon        :0
icexmoon        :0
icexmoon        Sun

wtmp    Sat

这里使用awk仅显示了last中的第一列和第三列的数据，并且用\t进行分隔。其中$1就表示第一个字段，$3表示第三个字段，除了这些，awk中还可以使用以下变量：

$0：整行数据
$n：第n个字段
NF：所在行的字段总数
NR：程序正在处理的行的行号
FS：当前使用的分隔符（默认为空字符）

如果需要同时打印行号和字段数，可以：

[icexmoon@xyz linux_roads]$ last -n 5 | awk '{print "No." NR "\t" "fields " NF "\t" $1 "\t" $3}'
No.1    fields 10       icexmoon        icexmoon-book
No.2    fields 11       reboot  boot
No.3    fields 10       icexmoon        :0
No.4    fields 10       icexmoon        :0
No.5    fields 9        icexmoon        Sun
No.6    fields 0
No.7    fields 7        wtmp    Sat

awk还支持逻辑运算：

[icexmoon@xyz linux_roads]$ cat /etc/passwd | awk '{FS=":"} $3<10 {print $1 "\t" $3}'
root:x:0:0:root:/root:/bin/bash
bin     1
daemon  2
adm     3
lp      4
sync    5
shutdown        6
halt    7
mail    8

在这个例子中我们先通过{FS=":"}让awk使用的分隔符变成了:，可以分割passwd文件，然后使用条件语句$3<10对数据行进行筛选，满足条件的数据行才会执行后边的{print}进行格式化输出。

似乎逻辑说的通，但是奇怪的是第一行数据不对，没有格式化输出。

这是因为程序在处理第一行的时候FS变量是默认的空白符，在对第一行进行分割后才会执行{FS=":"}，所以产生了奇怪的效果。

如果要想达到预想的效果，需要这样做：

[icexmoon@xyz linux_roads]$ cat /etc/passwd | awk 'BEGIN {FS=":"} $3<10 {print $1 "\t" $3}'
root    0
bin     1
daemon  2
adm     3
lp      4
sync    5
shutdown        6
halt    7
mail    8

BEGIN后的语句会在对行分割前执行，所以就不会出现上边的问题了。

再看个复杂点的例子：

现在有这么一个文件：

[icexmoon@xyz linux_roads]$ cat pay.txt
Name    1st     2nd     3th
VBird   23000   24000   25000
DMTsai  21000   20000   23000
Bird2   43000   42000   41000

假如我们要使用awk给这个表添加一个求和的列：

[icexmoon@xyz linux_roads]$ cat pay.txt|awk 'NR==1 {printf "%10s %10s %10s %10s %10s\n",$1,$2,$3,$4,"TOTAL"} NR>1 {total=$2+$3+$4; printf "%10s %10i %10i %10i %10i\n",$1,$2,$3,$4,total}'
      Name        1st        2nd        3th      TOTAL
     VBird      23000      24000      25000      72000
    DMTsai      21000      20000      23000      64000
     Bird2      43000      42000      41000     126000

简直...碉堡了，需要注意的是如果{}中需要执行多个字语句，就需要使用;进行分隔。

档案比对工具

diff

如果你用过svn或者git之类的代码版本控制工具，肯定对diff操作并不陌生，Linux下的diff命令的用途就是这个，比对不同版本的文本文件的差异。

看实际案例：

[icexmoon@xyz tmp]$ cp /etc/passwd ./passwd.old
[icexmoon@xyz tmp]$ cat passwd.old | sed -e '4d' -e '6c 6 line is disapeared' > passwd.new
[icexmoon@xyz tmp]$ diff passwd.old passwd.new
4d3
< adm:x:3:4:adm:/var/adm:/sbin/nologin
6c5
< sync:x:5:0:sync:/sbin:/bin/sync
---
> 6 line is disapeared

这里diff比对的结果中4d3表明左边文件的第四行被删除，对比基准是右边文件的第三行。6c5表明左边文件的第六行被替换，对比基准是右边文件的第五行。

除了比对文件以外，diff还可以比对目录：

[icexmoon@xyz tmp]$ diff /etc/rc0.d /etc/rc5.d
只在 /etc/rc0.d 存在：K90network
只在 /etc/rc5.d 存在：S10network

cmp

diff可以比对不同版本的文本文件的差异，cmp则可以比对二进制文件的差异，也就是根据两个文件的字节进行比对：

[icexmoon@xyz tmp]$ cmp passwd.old passwd.new
passwd.old passwd.new 不同：第 106 字节，第 4 行

patch

在使用svn的时候我们通常会对代码进行还原操作，即让代码回滚到代码库保留的某个历史版本上，diff结合patch也可以起到相同的效果。

要进行还原操作之前必须先用diff产生一个差异性存档：

[icexmoon@xyz tmp]$ diff -Naur passwd.old passwd.new > passwd.patch
[icexmoon@xyz tmp]$ cat passwd.patch
--- passwd.old  2021-08-16 19:54:28.119225249 +0800
+++ passwd.new  2021-08-16 19:55:58.301230385 +0800
@@ -1,9 +1,8 @@
 root:x:0:0:root:/root:/bin/bash
 bin:x:1:1:bin:/bin:/sbin/nologin
 daemon:x:2:2:daemon:/sbin:/sbin/nologin
-adm:x:3:4:adm:/var/adm:/sbin/nologin
 lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
-sync:x:5:0:sync:/sbin:/bin/sync
+6 line is disapeared
 shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
 halt:x:7:0:halt:/sbin:/sbin/halt
 mail:x:8:12:mail:/var/spool/mail:/sbin/nologin

产生的差异性存档文件passwd.patch与我们使用svn比对代码的时候的文件记录极为相似，都是通过+与-等符号记录哪些行发生了增加，哪些行被删除，那些行被修改等。

下面使用这个存档进行还原操作，在这之前需要先安装patch：

[icexmoon@xyz tmp]$ sudo yum install patch

我们先使用存档将passwd.old的内容更新到passwd.new的内容：

[icexmoon@xyz tmp]$ patch -p0 < passwd.patch
patching file passwd.old
[icexmoon@xyz tmp]$ ll passwd*
-rw-r--r--. 1 icexmoon icexmoon 2271 8月  15 18:50 passwd
-rw-rw-r--. 1 icexmoon icexmoon 2271 8月  15 18:50 passwd2
-rw-rw-r--. 1 icexmoon icexmoon 2223 8月  16 19:55 passwd.new
-rw-r--r--. 1 icexmoon icexmoon 2223 8月  16 20:10 passwd.old
-rw-rw-r--. 1 icexmoon icexmoon  489 8月  16 20:06 passwd.patch

然后再将passwd.old的内容回滚到之前的内容：

[icexmoon@xyz tmp]$ patch -R -p0 < passwd.patch
patching file passwd.old
[icexmoon@xyz tmp]$ ll passwd*
-rw-r--r--. 1 icexmoon icexmoon 2271 8月  15 18:50 passwd
-rw-rw-r--. 1 icexmoon icexmoon 2271 8月  15 18:50 passwd2
-rw-rw-r--. 1 icexmoon icexmoon 2223 8月  16 19:55 passwd.new
-rw-r--r--. 1 icexmoon icexmoon 2271 8月  16 20:11 passwd.old
-rw-rw-r--. 1 icexmoon icexmoon  489 8月  16 20:06 passwd.patch

使用patch只需要指定存档文件就可以执行更新或回滚操作，无需指定相应的文件名，因为在存档文件中记录着关联的两个新旧文件的文件名。

-p0的意思是新旧文件在同一个目录下的意思。

pr

pr（printing）可以在打印文件内容的同时增加上一些诸如页码之类的辅助信息：

[icexmoon@xyz tmp]$ pr /etc/man_db.conf


2018-10-31 04:26                /etc/man_db.conf                 第 1 页


#
#
# This file is used by the man-db package to configure the man and cat paths.
# It is also used to provide a manpath for those without one by examining
# their PATH environment variable. For details see the manpath(5) man page.
#
# Lines beginning with `#' are comments and are ignored. Any combination of