scala正则表达式 findFirstIn findAllIn findFirstMatchIn findAllMatchIn Match MatchData 提取分组

项目github地址：bitcarmanlee easy-algorithm-interview-and-practice
欢迎大家star，留言，一起学习进步

0.引子

节前最后一个工作日，在编写一个简单的正则表达式的时候，卡了比较长的时间。后来总结发现，还是对正则表达式的理解不是很深刻，于是利用假期的时间，特意比较详细地看了一下正则表达式相关内容并加以记录。

1.findFirstIn findFirstMatchIn

正则表达式中常用的方法包括findFirstIn，findFirstMatchIn等类似的方法。先来看个例子，通过例子我们来看两者区别。

  @Testdef test() = {val s = "你好，今天是2021年1月2日18点30分"val pattern = """今天是\d+年\d+月\d+日""".rval result1 = pattern.findFirstIn(s)println(result1)val result2 = pattern.findFirstMatchIn(s) match {case Some(data) => {println("data type is: " + data.getClass.getSimpleName)data group 0}case _ => "empty"}println(result2)}

输出结果：

Some(今天是2021年1月2日)
data type is: Match
今天是2021年1月2日

简单看下源码

  /** Return an optional first matching string of this `Regex` in the given character sequence,*  or None if there is no match.**  @param source The text to match against.*  @return       An [[scala.Option]] of the first matching string in the text.*  @example      {{{"""\w+""".r findFirstIn "A simple example." foreach println // prints "A"}}}*/def findFirstIn(source: CharSequence): Option[String] = {val m = pattern.matcher(source)if (m.find) Some(m.group) else None}

firdFirstIn是scala.util.matching.Regex的方法。该方法的输入是一个source，source类型为CharSequence接口，最常见的实现类为字符串。
返回值为Option[String]。在我们的例子中，因为匹配上了，所以返回的值为Some[String]。

  /** Return an optional first match of this `Regex` in the given character sequence,*  or None if it does not exist.**  If the match is successful, the [[scala.util.matching.Regex.Match]] can be queried for*  more data.**  @param source The text to match against.*  @return       A [[scala.Option]] of [[scala.util.matching.Regex.Match]] of the first matching string in the text.*  @example      {{{("""[a-z]""".r findFirstMatchIn "A simple example.") map (_.start) // returns Some(2), the index of the first match in the text}}}*/def findFirstMatchIn(source: CharSequence): Option[Match] = {val m = pattern.matcher(source)if (m.find) Some(new Match(source, m, groupNames)) else None}

findFirstMatchIn看源码与firdFirstIn差别不大，最大的不同在于返回的类型为Option[Match]。

2.Match MatchData

看下Match的源码

  /** Provides information about a successful match. */class Match(val source: CharSequence,private[matching] val matcher: Matcher,val groupNames: Seq[String]) extends MatchData {/** The index of the first matched character. */val start = matcher.start/** The index following the last matched character. */val end = matcher.end/** The number of subgroups. */def groupCount = matcher.groupCountprivate lazy val starts: Array[Int] =((0 to groupCount) map matcher.start).toArrayprivate lazy val ends: Array[Int] =((0 to groupCount) map matcher.end).toArray/** The index of the first matched character in group `i`. */def start(i: Int) = starts(i)/** The index following the last matched character in group `i`. */def end(i: Int) = ends(i)/** The match itself with matcher-dependent lazy vals forced,*  so that match is valid even once matcher is advanced.*/def force: this.type = { starts; ends; this }}

第一行注释非常关键，告诉了我们Match类最重要的作用：Provides information about a successful match。如果匹配成功，这个类会给我们提供一些匹配成功的信息，包括匹配成功的起始位置等。
Match类继承了MatchData，我们再看看MatchData的源码

 trait MatchData {/** The source from which the match originated */val source: CharSequence/** The names of the groups, or an empty sequence if none defined */val groupNames: Seq[String]/** The number of capturing groups in the pattern.*  (For a given successful match, some of those groups may not have matched any input.)*/def groupCount: Int/** The index of the first matched character, or -1 if nothing was matched */def start: Int/** The index of the first matched character in group `i`,*  or -1 if nothing was matched for that group.*/def start(i: Int): Int.../** The matched string in group `i`,*  or `null` if nothing was matched.*/def group(i: Int): String =if (start(i) >= 0) source.subSequence(start(i), end(i)).toStringelse null.../** Returns the group with given name.**  @param id The group name*  @return   The requested group*  @throws   NoSuchElementException if the requested group name is not defined*/def group(id: String): String = nameToIndex.get(id) match {case None => throw new NoSuchElementException("group name "+id+" not defined")case Some(index) => group(index)}

MatchData里面用得最多，最重要的方法应该就是group了，group最大的作用，就是用来提起分组。

3.提取分组

  @Testdef test() = {val s = "你好，今天是2021年1月2日18点30分"val pattern = """今天是(\d+)年(\d+)月(\d+)日""".rval result = pattern.findFirstMatchIn(s)val year = result match {case Some(data) => data group 1case _ => "-1"}println(year)  // 结果为 2021}

上面的例子就是提取分组的一个典型例子，就是利用findFirstMatchIn的group方法，提取匹配结果的第一个分组，就得到了年份数据。

4.提取分组的另外一种方式

实际中提取分组还有另外一种常用方式。

  @Testdef test() = {val s = "你好，今天是2021年1月2日18点30分"val pattern = """今天是(\d+)年(\d+)月(\d+)日""".rval pattern(year, month, day) = sprintln(s"year is $year.\n" +f"month is $month.\n" + raw"day is $day")}

上面的代码看起来很正常，完全没毛病，但实际上却会报错有问题，本人就是在这里被卡了很长时间。

scala.MatchError: 你好，今天是2021年1月2日18点30分 (of class java.lang.String)at com.xiaomi.mifi.pdata.common.T4.t8(T4.scala:114)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)...

当时百思不得其解，不知道问题出在哪里。仔细看了源码以后，才明白什么情况。如果我们在IDE中点击
val pattern(year, month, day) = s
这一行查看源码，会发现调用的其实是unapplySeq方法。

  def unapplySeq(s: CharSequence): Option[List[String]] = s match {case null => Nonecase _    =>val m = pattern matcher sif (runMatcher(m)) Some((1 to m.groupCount).toList map m.group)else None}

这个方法上面有一段关键的注释

  /** Tries to match a [[java.lang.CharSequence]].**  If the match succeeds, the result is a list of the matching*  groups (or a `null` element if a group did not match any input).*  If the pattern specifies no groups, then the result will be an empty list*  on a successful match.**  This method attempts to match the entire input by default; to find the next*  matching subsequence, use an unanchored `Regex`.

这个方法默认是匹配整个输出，如果是要匹配子串，需要用unanchored这种方式。

将上面的代码稍作改动

  @Testdef test() = {val s = "你好，今天是2021年1月2日18点30分"val pattern = """今天是(\d+)年(\d+)月(\d+)日""".r.unanchoredval pattern(year, month, day) = sprintln(s"year is $year.\n" +f"month is $month.\n" + raw"day is $day")}

可以得到我们预期的结果

year is 2021.
month is 1.
day is 2

5.findAllIn findAllMatchIn

findAllIn与firdFirstIn对应，findAllMatchIn与findFirstMatchIn对应，表示所有匹配结果。
先来看一个例子

  @Testdef t9() = {val dateRegex =  """(\d{4})-(\d{2})-(\d{2})""".rval dates = "dates in history: 2004-01-20, 2005-02-28, 1998-01-15, 2009-10-25"val result =  dateRegex.findAllIn(dates)val array =  for (each <- result) yield eachprintln(array)println(array.mkString("\t"))}

non-empty iterator
2004-01-20  2005-02-28  1998-01-15  2009-10-25

findAllIn的方法签名如下：

  /** Return all non-overlapping matches of this `Regex` in the given character *  sequence as a [[scala.util.matching.Regex.MatchIterator]],*  which is a special [[scala.collection.Iterator]] that returns the*  matched strings but can also be queried for more data about the last match,*  such as capturing groups and start position.....def findAllIn(source: CharSequence) = new Regex.MatchIterator(source, this, groupNames)

返回的是一个MatchIterator，根据注释信息可以看出来MatchIterator是scala.collection.Iterator的一个特例，所以直接println(array)得到的信息是一个non-empty iterator。

如果我们想得到所有能匹配上的年份，则可以使用findAllMatchIn方法。该方法可以得到先得到所有的Match对象，然后再分组提取出年份即可。

  @Testdef t10() = {val dateRegex =  """(\d{4})-(\d{2})-(\d{2})""".rval dates = "dates in history: 2004-01-20, 2005-02-28, 1998-01-15, 2009-10-25"val result = dateRegex.findAllMatchIn(dates)val array = for(each <- result) yield each.group(1)println(array)println(array.mkString("\t"))}

最后的输出结果为

non-empty iterator
2004    2005    1998    2009