Boost::Regex 使用方法 (英文)
摘自:Beyond.the.C.plus.plus.Standard.Library.An.Introduction.to.Boos Usage The first thing you need to do is to declare a variable of type basic_regex. This is one of the core classes in the library, and it's the one that stores the regular expression. Creating one is simple; just pass a string to the constructor containing the regular expression you want to use. boost::regex reg("(A.*)"); This regular expression contains three interesting features of regular expressions. The first is the enclosing of a subexpression within parenthesesthis makes it possible to refer to that subexpression later on in the same regular expression or to extract the text that matches it. We'll talk about this in detail later on, so don't worry if you don't yet see how that's useful. The second feature is the wildcard character, the dot. The wildcard has a very special meaning in regular expressions; it matches any character. Finally, the expression uses a repeat, *, called the Kleene star, which means that the preceding expression may match zero or more times. This regular expression is ready to be used in one of the algorithms, like so: bool b=boost::regex_match( As you can see, you pass the regular expression and the string to be parsed to the algorithm regex_match. The result of calling the function is true if there is an exact match for the regular expression; otherwise, it is false. In this case, the result is false, because regex_match only returns true when all of the input data is successfully matched by the regular expression. Do you see why that's not the case for this code? Look again at the regular expression. The first character is a capital A, so that's obviously the first character that could ever match the expression. So, a part of the input"A and beyond."does match the expression, but it does not exhaust the input. Let's try another input string. bool b=boost::regex_match( This time, regex_match returns true. When the regular expression engine matches the A, it then goes on to see what should follow. In our regex, A is followed by the wildcard, to which we have applied the Kleene star, meaning that any character is matching any number of times. Thus, the parsing starts to consume the rest of the input string, and matches all the rest of the input. Next, let's see how we can put regexes and regex_match to work with data validation. Validating Input boost::regex reg("\\d{3}"); Note that we need to escape the escape character, so the shortcut \d becomes \\d in our string. This is because the compiler consumes the first backslash as an escape character; we need to escape the backslash so a backslash actually appears in the regular expression string. Next, we need a way to define a wordthat is, a sequence of characters, ended by any character that is not a letter. There is more than one way of accomplishing this, but we will do it using the regular expression features character classes (also called character sets) and ranges. A character class is an expression enclosed in square brackets. For example, a character class that matches any one of the characters a, b, and c, looks like this: [abc]. Using a range to accomplish the same thing, we write it like so: [a-c]. For a character class that encompasses all characters, we could go slightly crazy and write it like [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ], but we won't; we'll use ranges instead: [a-zA-Z]. It should be noted that using ranges like this can make one dependent on the locale that is currently in use, if the basic_regex::collate flag is turned on for the regular expression. Using these tools and the repeat +, which means that the preceding expression can be repeated, but must exist at least once, we're now ready to describe a word. boost::regex reg("[a-zA-Z]+"); That regular expression works, but because it is so common, there is an even simpler way to represent a word: \w. That operator matches all word characters, not just the ASCII ones, so not only is it shorter, it is better for internationalization purposes. The next character should be exactly one of any character, which we know is the purpose of the dot. boost::regex reg("."); The next part of the input is 2 digits or the string "N/A." To match that, we need to use a feature called alternatives. Alternatives match one of two or more subexpressions, with each alternative separated from the others by |. Here's how it looks: boost::regex reg("(\\d{2}|N/A)"); Note that the expression is enclosed in parentheses, to make sure that the full expressions are considered as the two alternatives. Adding a space to the regular expression is simple; there's a shortcut for it: \s. Putting together everything we have so far gives us the following expression: boost::regex reg("\\d{3}[a-zA-Z]+.(\\d{2}|N/A)\\s"); Now things get a little trickier. We need a way to validate that the next word in the input data exactly matches the first word (the one we capture using the expression [a-zA-Z]+). The key to accomplish this is to use a back reference, which is a reference to a previous subexpression. For us to be able to refer to the expression [a-zA-Z]+, we must first enclose it in parentheses. That makes the expression ([a-zA-Z]+) the first subexpression in our regular expression, and we can therefore create a back reference to it using the index 1. That gives us the full regular expression for "3 digits, a word, any character, 2 digits or the string "N/A," a space, then the first word again": boost::regex reg("\\d{3}([a-zA-Z]+).(\\d{2}|N/A)\\s\\1"); Good work! Here's a simple program that makes use of the expression with the algorithm regex_match, validating two sample input strings. #include <iostream> int main() { std::string correct="123Hello N/A Hello"; assert(boost::regex_match(correct,reg)==true); The first string, 123Hello N/A Hello, is correct; 123 is 3 digits, followed by any character (a space), Hello is a word, then another space, and finally the word Hello is repeated. The second string is incorrect, because the word Hello is not repeated exactly. By default, regular expressions are case-sensitive, and the back reference therefore does not match. One of the keys in crafting regular expressions is successfully decomposing the problem. When looking at the final expression that you just created, it can seem quite intimidating to the untrained eye. However, when decomposing the expression into smaller components, it's not very complicated at all. Searching boost::regex reg("(new)|(delete)"); There are two reasons for us to enclose the subexpressions in parentheses: one is that we must do so in order to form the two groups for our alternatives. The other reason is that we will want to refer to these subexpressions when calling regex_search, to enable us to determine which of the alternatives was actually matched. We will use an overload of regex_search that also accepts an argument of type match_results. When regex_search performs its matching, it reports subexpression matches through an object of type match_results. The class template match_results is parameterized on the type of iterator that applies to the input sequence. template <class Iterator, typedef match_results<const char*> cmatch; We will use std::string, and are therefore interested in the typedef smatch, which is short for match_results<std::string::const_iterator>. When regex_search returns true, the reference to match_results that is passed to the function contains the results of the subexpression matches. Within match_results, there are indexed sub_matches for each of the subexpressions in the regular expression. Let's see what we have so far that can help our confused programmer assess the calls to new and delete. boost::regex reg("(new)|(delete)"); if (boost::regex_search(s,m,reg)) { The preceding program searches the input string for new or delete, and reports which one it finds first. By passing an object of type smatch to regex_search, we gain access to the details of how the algorithm succeeded. In our expression, there are two subexpressions, and we can thus get to the subexpression for new by the index 1 of match_results. We then hold an instance of sub_match, which contains a Boolean member, matched, that tells us whether the subexpression participated in the match. So, given the preceding input, running this code would output "The expression (new) matched!\n". Now, you still have some more work to do. You need to continue applying the regular expression to the remainder of the input, and to do that, you use another overload of regex_search, which accepts two iterators denoting the character sequence to search. Because std::string is a container, it provides iterators. Now, for each match, you must update the iterator denoting the beginning of the range to refer to the end of the previous match. Finally, add two variables to hold the counts for new and delete. Here's the complete program: #include <iostream> int main() { while (boost::regex_search(it,end,m,reg)) { if (new_counter!=delete_counter) Note that the program always sets the iterator it to m[0].second. match_results[0] returns a reference to the submatch that matched the whole regular expression, so we can be sure that the end of that match is always the correct location to start the next run of regex_search. Running this program outputs "Leak detected!", because there are two occurrences of new, and only one of delete. Of course, one variable could be deleted twice, there could be calls to new[] and delete[], and so forth. By now, you should have a good understanding of how subexpression grouping works. It's time to move on to the final algorithm in Boost.Regex, one that is used to perform substitutions. Replacing In the introduction to this chapter, I gave you the example of changing the British spelling of colour to the U.S. spelling of color. Changing the spelling without using regular expressions is very tedious, and extremely error prone. The problem is that there might be different capitalization, and a lot of words that are affectedfor example, colourize. To properly attack this problem, we need to split the regular expression into three subexpressions. boost::regex reg("(Colo)(u)(r)", We have isolated the villainthe letter uin order to surgically remove it from any matches. Also note that this regex is case-insensitive, which we achieve by passing the format flag boost::regex::icase to the constructor of regex. Note that you must also pass any other flags that you want to be in effect. A common user error when setting format flags is to omit the ones that regex turns on by default, but that don't workyou must always apply all of the flags that should be set. When calling regex_replace, we are expected to provide a format string as an argument. This format string determines how the substitution will work. In the format string, it's possible to refer to subexpression matches, and that's precisely what we need here. You want to keep the first matched subexpression, and the third, but let the second (u), silently disappear. The expression $N, where N is the index of a subexpression, expands to the match for that subexpression. So our format string becomes "$1$3", which means that the replacement text is the result of the first and the third subexpressions. By referring to the subexpression matches, we are able to retain any capitalization in the matched text, which would not be possible if we were to use a string literal as the replacement text. Here's a complete program that solves the problem. #include <iostream> int main() { std::string s="Colour, colours, color, colourize"; s=boost::regex_replace(s,reg,"$1$3"); The output of running this program is "Color, colors, color, colorize". regex_replace is enormously useful for applying substitutions like this. A Common User Misunderstanding boost::regex reg("\\d*"); Rest assured that this call never results in a successful match. All of the input must be consumed for regex_match to return TRue! Almost all of the users asking why this doesn't work should use regex_search rather than regex_match. boost::regex reg("\\d*"); This most definitely yields TRue. It is worth noting that it's possible to make regex_search behave like regex_match, using special buffer operators. \A matches the start of a buffer, and \Z matches the end of a buffer, so if you put \A first in your regular expression, and \Z last, you'll make regex_search behave exactly like regex_matchthat is, it must consume all input for a successful match. The following regular expression always requires that the input be exhausted, regardless of whether you are using regex_match or regex_search. boost::regex reg("\\A\\d*\\Z"); Please understand that this does not imply that regex_match should not be used; on the contrary, it should be a clear indication that the semantics we just talked aboutthat all of the input must be consumedare in effect. About Repeats and Greed boost::regex reg("(.*)(\\d{2})"); This regular expression succeeds, but it might not match the subexpressions that you think it should! The expression .* happily eats everything that following subexpressions don't match. Here's a sample program that exhibits this behavior: int main() { In this program, we are using another parameterization of match_results, tHRough the type cmatch. It is a typedef for match_results<const char*>, and the reason we must use it rather than the type smatch we've been using before is that we're now calling regex_search with a string literal rather than an object of type std::string. What do you expect the output of running this program to be? Typically, users new to regular expressions first think that both m[1].matched and m[2].matched will be TRue, and that the result of the second subexpression will be "31". Next, after realizing the effects of greedy repeatsthat they consume as much input as possiblethey tend to think that only the first subexpression can be TRuethat is, the .* has successfully eaten all of the input. Finally, new users come to the conclusion that the expression will match both subexpressions, but that the second expression will match the last possible sequence. Here, that means that the first subexpression will match "Note that I'm 31 years old, not" and the second will match "32". So, what do you do when you actually want is to use a repeat and the first occurrence of another subexpression? Use non-greedy repeats. By appending ? to the repeat, it becomes non-greedy. This means that the expression tries to find the shortest possible match that doesn't prevent the rest of the expression from matching. So, to make the previous regex work correctly, we need to update it like so. boost::regex reg("(.*?)(\\d{2})"); If we change the program to use this regular expression, both m[1].matched and m[2].matched will still be true. The expression .*? consumes as little of the input as it can, which means that it stops at the first character 3, because that's what the expression needs in order to successfully match. Thus, the first subexpression matches "Note that I'm" and the second matches "31". A Look at regex_iterator boost::regex reg("(\\d+),?"); Adding the repeat ? (match zero or one times) to the end of the regular expression ensures that the last digit will be successfully parsed, even if the input sequence does not end with a comma. Further, we are using another repeat, +. This repeat ensures that the expression matches one or more times. Now, rather than doing multiple calls to regex_search, we create a regex_iterator, call the algorithm for_each, and supply it with a function object to call with the result of dereferencing the iterator. Here's a function object that accepts any form of match_results due to its parameterized function call operator. All work it performs is to add the value of the current match to a total (in our regular expression, the first subexpression is the one we're interested in). class regex_callback { template <typename T> void operator()(const T& what) { int sum() const { You now pass an instance of this function object to std::for_each, which results in an invocation of the function call operator for every dereference of the iterator itthat is, it is invoked every time there is a match of a subexpression in the regex. int main() { boost::sregex_iterator it(s.begin(),s.end(),reg); regex_callback c; As you can see, the past-the-end iterator passed to for_each is simply a default-constructed instance of regex_iterator. Also, the type of it and end is boost::sregex_iterator, which is a typedef for regex_iterator<std::string::const_iterator>. Using regex_iterator this way is a much cleaner way of matching multiple times than what we did previously, where we manually had to advance the starting iterator and call regex_search in a loop. Splitting Strings with regex_token_iterator boost::regex reg("/"); The regex matches the separator of items. To use it for splitting the input, simply pass the special index 1 to the constructor of regex_token_iterator. Here is the complete program: int main() { assert(vec.size()==std::count(s.begin(),s.end(),'/')+1); Similar to regex_iterator, regex_token_iterator is a template class parameterized on the iterator type for the sequence it wraps. Here, we're using sregex_token_iterator, which is a typedef for regex_token_iterator<std::string::const_iterator>. Each time the iterator it is dereferenced, it returns the current sub_match, and when the iterator is advanced, it tries to match the regular expression again. These two iterator types, regex_iterator and regex_token_iterator, are very useful; you'll know that you need them when you are considering to call regex_search multiple times! More Regular Expressions boost::regex reg1("\\d{5}"); The first regex matches exactly 5 digits. The second matches 2, 3, or 4 digits. The third matches 2 or more digits, without an upper limit. Another important regular expression feature is to use negated character classes using the metacharacter ^. You use it to form character classes that match any character that is not part of the character class; the complement of the elements you list in the character class. For example, consider this regular expression. boost::regex reg("[^13579]"); It contains a negated character class that matches any character that is not one of the odd numbers. Take a look at the following short program, and try to figure out what the output will be. int main() { while (it!=end) Did you figure it out? The output is "02468"that is, all of the even numbers. Note that this character class does not only match even numbershad the input string been "AlfaBetaGamma," that would have matched just fine too. The metacharacter we've just seen, ^, serves another purpose too. It is used to denote the beginning of a line. The metacharacter $ denotes the end of a line. Bad Regular Expressions If all of your regular expressions are hardcoded into your application, you may be safe from having to deal with bad expressions, but if you're accepting user input in the form of regexes, you must be prepared to handle errors. Here's a program that prompts the user to enter a regular expression, followed by a string to be matched against the regex. As always, when there's user input involved, there's a chance that the input will be invalid. int main() { std::getline(std::cin,s); if (boost::regex_match(s,reg)) To protect the application and the user, a try/catch block ensures that if boost::regex throws upon construction, an informative message will be printed, and the application will shut down gracefully. Putting this program to the test, let's begin with some reasonable input. Enter a regular expression: Now, here's grief coming your way, in the form of a very poor attempt at a regular expression. Enter a regular expression: An exception is thrown when the regex reg is constructed, because the regular expression cannot be compiled. Consequently, the catch handler is invoked, and the program prints an error message and exits. There are only three places where you need to be aware of potential exceptions being thrown. One is when constructing a regular expression, similar to the example you just saw; another is when assigning regular expressions to a regex, using the member function assign. Finally, the regex iterators and the algorithms can also throw exceptionsif memory is exhausted or if the complexity of the match grows too quickly. |
转载于:https://www.cnblogs.com/cy163/archive/2010/11/18/1881270.html
Boost::Regex 使用方法 (英文)相关推荐
- C++中三种正则表达式比较(C regex,C ++regex,boost regex)
原文地址:https://www.cnblogs.com/pmars/archive/2012/10/24/2736831.html 工作需要用到C++中的正则表达式,所以就研究了以上三种正则. 1, ...
- 深入浅出C/C++中的正则表达式库(二)——Boost.Regex
写在前面:本文是<深入浅出C/C++中的正则表达式库>系列的第二篇,如果对本文感兴趣,相信你也会对<深入浅出C/C++中的正则表达式库--GNU Regex Library>感 ...
- 用Boost:regex库进行网页分析源代码【转】
经过今天时间对bnoost的学习,特别是对regex 库的熟悉,完成了批量下载的boost.regex版(此前还有字符匹配+线程类版本,shell版本),功能很强,但如果需要完成具体任务,需要添加不少 ...
- ASP.net:Regex.Match 方法 中应该注意的几个问题
一.概述 Regex.Match 方法 在输入字符串中搜索正则表达式的匹配项,并将精确结果作为单个 Match 对象返回. 重载列表 (1) 在指定的输入字符串中搜索 Regex 构造函数 ...
- boost::regex模块实现以编程方式生成代码片段,以便剪切并粘贴到正则表达式源中测试程序
boost::regex模块实现以编程方式生成代码片段,以便剪切并粘贴到正则表达式源中测试程序 实现功能 C++实现代码 实现功能 boost::regex模块实现以编程方式生成代码片段,以便剪切并粘 ...
- boost::regex模块用于测试特定于语言环境的表达式的帮助程序类
boost::regex模块用于测试特定于语言环境的表达式的帮助程序类 实现功能 C++实现代码 实现功能 boost::regex模块用于测试特定于语言环境的表达式的帮助程序类 C++实现代码 #i ...
- boost::regex模块通用对象缓存的测试代码
boost::regex模块通用对象缓存的测试代码 实现功能 C++实现代码 实现功能 boost::regex模块通用对象缓存的测试代码 C++实现代码 #include <boost/reg ...
- boost::regex模块实现config_info 来打印正则表达式库配置信息的测试程序
boost::regex模块实现config_info 来打印正则表达式库配置信息的测试程序 实现功能 C++实现代码 实现功能 boost::regex模块实现config_info 来打印正则表达 ...
- boost::regex模块实现吐出链接的 URL的测试程序
boost::regex模块实现吐出链接的 URL的测试程序 实现功能 C++实现代码 实现功能 boost::regex模块实现吐出链接的 URL的测试程序 C++实现代码 #include < ...
最新文章
- 软件开发环境-按模型及方法分类
- android系统各个输出log对应层次文件
- 一个例子看懂kotlin的集合和序列
- mysql bigint 运算_mysql中int、bigint、smallint 和 tinyint的区别详细介绍
- delphi7存储过程传入数组_C++中的指针、数组指针与指针数组、函数指针与指针函数...
- APP应用下载站源码-带后台
- 用bbp公式计算pi_家用配电箱里设计几个回路合适?用这个公式,计算起来很方便...
- 云计算基础概念 笔记
- guzz 1.3.0大版本发布,支持Spring事务
- Ubuntu 手动挂载U盘
- [渝粤教育] 中国农业大学 大学计算机基础 参考 资料
- linux控制pwm输出个数,使用PWM控制来实现电压的变化控制
- python用哪种字体比较好_女生练哪种字体比较好?适合女生写的漂亮字体推荐
- STM32笔记 (十)定时器(基本定时器)利用基本定时器实现毫秒延时
- javaweb qq空间(博客)项目超详细开发套路原理分析
- matlab 火柴人_小波分析检测信号奇异点matlab代码
- windows下安装配置mycat
- 摹客导入html,导入摹客RP
- 四川农业大学计算机专业课程资料
- 北京工商大学计算机考研资料汇总
热门文章
- js修改本地json文件_Flutter加载本地JSON文件教程建议收藏
- flex java red5_使用 Flex 和Java servlets 将文件上传到 RED5 服务器的步骤
- python视频网站分类_用Python爬取b站视频
- if __name__ == __main__:什么意思_好冷的Python if __name__==__main__是啥东东
- html5音频文件生成波形图代码,HTML5/D3.js 可视音频波形柱状图
- linux进程管理fork,Linux -- 进程管理之 fork() 函数
- html用css画多边形,Sass绘制多边形_Preprocessor, Sass, SCSS, clip-path, CSS处理器, 会员专栏 教程_W3cplus...
- python数据清理的实践总结_Python数据清洗实践
- bfc是什么_关于margin的两个经典bug,以及bfc简述
- java 数字表示什么意思是什么,读取Java字节码指令:数字是什么意思?