断句规则 Segmentation Rule
每个语言有自己的断句规则。
例如拉丁语系使用句号(.),问号(?),冒号(: ),感叹号(!)断句; 而中文使用句号(。),问号(?),冒号(:),感叹号(!)断句

例如:
The Chinese nation is a great nation. With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization. After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before. The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness. Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation.

拆分为:

The Chinese nation is a great nation. 中华民族是世界上伟大的民族
With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization. 有着5000多年源远流长的文明历史,为人类文明进步作出了不可磨灭的贡献。
After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before. 1840年鸦片战争以后,中国逐步成为半殖民地半封建社会
The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness. 国家蒙辱、人民蒙难、文明蒙尘,中华民族遭受了前所未有的劫难
Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation. 从那时起,实现中华民族伟大复兴,就成为中国人民和中华民族最伟大的梦想。

常见英文断句规则

  • Full Stop
分隔符
. 非打印字符(包括空格)
.+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]* \s

例外:Lower-case letter exception

分隔符
. \s\p{Ll}
.+[\p{Pe}\p{Pf}\p{Po}"]* \s\p{Ll}
  • Other
    |前|分隔符|后
    |-|-|-|
    ||?!|非打印字符(包括空格)|
[!?]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]* \s

例外:Lower-case letter exception

分隔符
. \s\p{Ll}
.+[\p{Pe}\p{Pf}\p{Po}"]* \s\p{Ll}
  • Colon
分隔符
: 非打印字符(包括空格)
[:]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]* \s

例外:Lower-case letter exception

分隔符
. \s\p{Ll}
.+[\p{Pe}\p{Pf}\p{Po}"]* \s\p{Ll}

SRX 2.0 April 7, 2008 | GALA Global (gala-global.org)
SRX 是记录断句规则的文档,是XML文件
XML Schema for SRX

<?xml version="1.0" encoding="UTF-8"?><!--Document        : srx20.xsdVersion             : 2.0Created on      : December 26, 2006Authors            : dpooley@sdl.com   rmraya@maxprograms.comDescription     : This XML Schema defines the structure of SRX 2.0Status               : OSCAR recommendationCopyright © The Localisation Industry Standards Association [LISA] 2006. All Rights Reserved.--><!-- History of modifications (latest first):Jul-08-2008 by RMR: made foreign elements optional in <header>Jan-13-2008 by RMR: Permitted elements from foreign namespaces in <header> elementDec-26-2006 by RMR: Fixed namespace handlingChanged version to "2.0"Removed "cascade" attribute from <languagemap>Removed <maprule> elementAdjusted attributes to match the specification document    Jun-21-2006 by DRP: Change version number to "1.2" in readiness to move to "2.0"Make the cascade attribute mandatory (required) on the <header> elementAdd enumerations where necessary and some brief documentation for elements and attributesJun-15-2006 by DRP: Change version number to "1.1" in readiness to move to "2.0"Mar-10-2006 by DRP: Add "cascade" attribute to <header>, <maprule> and <languagemap> elementsApr-21-2004 by DRP: Convert to version 1.0.Mar-22-2004 by DRP: Eighth draft version.Ensure the <excludeexception> element is removedUpdate version numberMar-17-2004 by DRP: Seventh draft version.Remove <exceptions>, <exception>, <endrules>, <endrule> and <excludeexception> elementsAdd <rule> elementUpdate version numberFeb-02-2004 by DRP: Sixth draft version.Update version numberOct-27-2003 by DRP: Fifth draft version.Removed includeformatting attribute from <header> elementAdded <formathandle> element to the <header>Removed priority attribute from <endrule> and <exception> elementsAdded name attribute to <exception> elementAdded <excludeexception> element to the <endrule> elementOct-10-2003 by DRP: Fourth draft version.Removed <classdefinitions> and <classdefinition> elementsRemoved classdefinitionname attributeRemoved <digitcharacters>, <whitespacecharacters> and <wordcharacters>Added priority attribute to <endrule> and <exception> elementsAdded includeformatting attribute to <header> elementJul-24-2003 by DRP: Third draft version.Removed <charsets> and <charset> to be replaced with <classdefinitions> and <classdefinition>Renamed <digits> to <digitcharacters>Renamed <whitespace> to <whitespacecharacters>Renamed <wordchars> to <wordcharacters><digitcharacters>, <whitespacecharacters> and <wordcharacters> are now optionalRenamed <langrules> to <languagerules>Renamed <langrule> to <languagerule>Renamed <langmap> to <languagemap>Renamed langrulename to languagerulenameRenamed langpattern to languagepatternJun-19-2003 by DRP: Second draft version.Removed the <codepage> element.Added <header> and <body> elements.Nov-22-2002 by DRP: First draft version--><xs:schema xmlns:srx="http://www.lisa.org/srx20" targetNamespace="http://www.lisa.org/srx20" xml:lang="en" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"><xs:import namespace="http://www.w3.org/XML/1998/namespace"schemaLocation="http://www.w3.org/2001/xml.xsd"/><xs:element name="afterbreak"><xs:annotation><xs:documentation>Contains the regular expression to match before the segmentbreak</xs:documentation></xs:annotation><xs:complexType mixed="true"/></xs:element><xs:element name="beforebreak"><xs:annotation><xs:documentation>Contains the regular expression to match after the segmentbreak</xs:documentation></xs:annotation><xs:complexType mixed="true"/></xs:element><xs:element name="body"><xs:annotation><xs:documentation>SRX body</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:languagerules"/><xs:element ref="srx:maprules"/></xs:sequence></xs:complexType></xs:element><xs:element name="formathandle"><xs:annotation><xs:documentation>Determines which side of the segment break that formattinginformation goes</xs:documentation></xs:annotation><xs:complexType><xs:attribute name="include" use="required"><xs:annotation><xs:documentation>A value of "no" indicates that the format code does not belongto the segment being created. A value of "yes" indicates that the format codebelongs to the segment being created.</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="yes"/><xs:enumeration value="no"/></xs:restriction></xs:simpleType></xs:attribute><xs:attribute name="type" use="required"><xs:annotation><xs:documentation>The type of format for which behaviour is being defined. Can be"start", "end" or "isolated".</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="start"/><xs:enumeration value="end"/><xs:enumeration value="isolated"/></xs:restriction></xs:simpleType></xs:attribute></xs:complexType></xs:element><xs:element name="header"><xs:annotation><xs:documentation>SRX header</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:formathandle" minOccurs="0" maxOccurs="3"/><xs:any minOccurs="0" maxOccurs="unbounded" namespace="##other" processContents="lax"/></xs:sequence><xs:attribute name="segmentsubflows" use="required"><xs:annotation><xs:documentation>Determines whether text subflows should besegmented</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="yes"/><xs:enumeration value="no"/></xs:restriction></xs:simpleType></xs:attribute><xs:attribute name="cascade" use="required"><xs:annotation><xs:documentation>Determines whether a matching &lt;languagemap&gt; elementshould terminate the search</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="yes"/><xs:enumeration value="no"/></xs:restriction></xs:simpleType></xs:attribute></xs:complexType></xs:element><xs:element name="languagemap"><xs:annotation><xs:documentation>Maps one or more languages to a set of rules</xs:documentation></xs:annotation><xs:complexType><xs:attribute name="languagerulename" type="xs:string" use="required"><xs:annotation><xs:documentation>The name of the language rule to use when the languagepatternregular expression is satisfied</xs:documentation></xs:annotation></xs:attribute><xs:attribute name="languagepattern" type="xs:string" use="required"><xs:annotation><xs:documentation>The regular expression pattern match for the languagecode</xs:documentation></xs:annotation></xs:attribute></xs:complexType></xs:element><xs:element name="languagerule"><xs:annotation><xs:documentation>A set of rules for a logical set of languages</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:rule" minOccurs="1" maxOccurs="unbounded"/></xs:sequence><xs:attribute name="languagerulename" type="xs:string" use="required"><xs:annotation><xs:documentation>The name of the language rule</xs:documentation></xs:annotation></xs:attribute></xs:complexType></xs:element><xs:element name="languagerules"><xs:annotation><xs:documentation>Contains all the logical sets of rules</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:languagerule" minOccurs="1" maxOccurs="unbounded"/></xs:sequence></xs:complexType></xs:element><xs:element name="maprules"><xs:annotation><xs:documentation>A set of language maps</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:languagemap" minOccurs="1" maxOccurs="unbounded"/></xs:sequence></xs:complexType></xs:element><xs:element name="rule"><xs:annotation><xs:documentation>A break/no break rule</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:beforebreak" minOccurs="0"/><xs:element ref="srx:afterbreak" minOccurs="0"/></xs:sequence><xs:attribute name="break"><xs:annotation><xs:documentation>Determines whether this is a segment break or an exceptionrule</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="yes"/><xs:enumeration value="no"/></xs:restriction></xs:simpleType></xs:attribute></xs:complexType></xs:element><xs:element name="srx"><xs:annotation><xs:documentation>OSCAR Segmentation Rules eXchange</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:header"/><xs:element ref="srx:body"/></xs:sequence><xs:attribute name="version" use="required"><xs:annotation><xs:documentation>The version of SRX</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="2.0"/></xs:restriction></xs:simpleType></xs:attribute></xs:complexType></xs:element></xs:schema>

断句规则 Segmentation Rule相关推荐

  1. [转载]英语语音断句规则

    [转载]英语语音断句规则 (2012-05-19 19:20:20) 转载▼ 标签: 转载   原文地址:英语语音断句规则作者: 星梦港湾 英语语音断句规则 1. 有标点的地方常是断句所在. 如: J ...

  2. cαr怎么发音_【英语拼读发音规则】连读、略读、重读、断句、语调……

    在说汉语的时候,我们讲究的是"字正腔圆.掷地有声",而英语恰恰相反,它需要将一句话中的某些重点词汇突出,弱化非重点词汇. 所以,对已经习惯了"字正腔圆"的我们来 ...

  3. word-break|overflow-wrap|word-wrap——CSS英文断句浅析

    ---恢复内容开始--- word-break|overflow-wrap|word-wrap--CSS英文断句浅析 一 问题引入 今天在再次学习 overflow 属性的时候,查看效果时,看到如下结 ...

  4. C不会断句?【前后置,位,移位操作符详解】 b = ++c, c++, ++a, a++

    介绍:首先明确基础知识 b=a++   把a赋给b  然后a+1 c=--a     先a-1     然后把结果赋给a和c 逗号表达式的优先级较低,从左往右算 整数的原反补相同 内存中存储的都是二进 ...

  5. 英文断句:理解 word-wrap、word-break用法

    1.定义与属性 1.1.word-wrap 允许长单词换行到下一行 语法 word-wrap: normal|break-word; 值 描述 normal 只在允许的断字点换行(浏览器保持默认处理) ...

  6. 北理工校友发明文言文“填词大师”,断句、造词都能做,高考文言文满分靠它了...

    萧箫 发自 凹非寺 量子位 报道 | 公众号 QbitAI 都说GPT-3能接人话,补充上下文关系,中文版的"填词大师"你见过没? 不仅是中文版,这个"填词大师" ...

  7. 详解IIS中URL重写工具的规则条件(Rule conditions)

    本文结合官方文档和相关示例,详细记录了在IIS中URL重写工具下的规则条件(Rule conditions)的相关说明.规则条件允许我们通过额外的逻辑规则来过滤和匹配规则模式( rule patter ...

  8. 什么是百度竞价创意断句符

    百度创意断句符是用来确定创意中标题和描述的截断或折行位置,当且仅当推广结果在右侧推广链接位置展现时有效,断句符用符号"^"(不含引号)表示.可以在创意的标题和描述中插入" ...

  9. shell技巧(sed 断句、读取指定行) 【ZT】

    1.断句,(同行有多个字段需要读取时特别管用) 原文本: Cell 04 - Address: 14:E6:E4:E3:E8:68                     Protocol:802.1 ...

最新文章

  1. 算法在ros中应用_烟火检测算法——中伟视界人工智能算法AI在智慧工地、石油中的应用_腾讯新闻...
  2. 利用gulp,当引入文件改动时,版本号自动更新~
  3. 对一道面试题的总结与扩展思考(关于一笔画问题的数学分析)
  4. Linq to sql 消除列重复 去重复
  5. 从0开始架构一个IOS程序—— 05— NavigationBar 搭建首页面
  6. Mr.J-- jQuery学习笔记(九)--事件绑定移除冒泡
  7. CAN笔记(7) CAN协议(二)
  8. 用神经网络例子讲解TF运行方式~人工智能入门编程例子讲解
  9. 深圳惊现“马云网络有限公司” 网友:你好 我是马云公司CEO
  10. python 3.5 import theano ::hypot error
  11. lnmp部署 -----1
  12. python 图像倾斜校正_边缘投影法对文本图像矫正——python
  13. CSS display 常用属性小结
  14. ROST情感分析的语法规则_从词法分析角度聊 Go 代码组成
  15. 第六讲 复数和复指数
  16. IP前缀列表配置实验
  17. (首页上一页下一页尾页 + 下拉框跳转)分页功能
  18. 阿根廷夺冠!梅西圆梦!历届世界杯还有哪些数据看点?
  19. 面向在校学生的谷歌编程实习项目(GSoC2021)
  20. c语言遍历枚举,C# Enum 类型遍历

热门文章

  1. Win10下怎么查看WIFI密码
  2. R语言绘制生存曲线估计|生存分析|如何R作生存曲线图
  3. java判断txt文件的编码格式
  4. linux 向终端 发送消息,Linux向不同的连接终端窗口发送消息
  5. 因为在此系统上禁止运行脚本。有关详细信息,请参阅 https:/go.microsoft.com/fwlink/?Link ID=135170 中的 about_Execution_Policies
  6. 【Natural Language Processing】语言模型训练工具Srilm的安装及使用简介
  7. eclipse里面运行tomcat显示无法显示页面
  8. 根据网络画板(画线)分析一下思路
  9. mysql查询条件格式_条件格式
  10. Problem C: 点在圆内吗?