断句规则 Segmentation Rule

断句规则 Segmentation Rule
每个语言有自己的断句规则。
例如拉丁语系使用句号(.)，问号(?)，冒号(: )，感叹号(!)断句; 而中文使用句号(。)，问号(？)，冒号(：)，感叹号(！)断句

例如：
The Chinese nation is a great nation. With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization. After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before. The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness. Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation.

拆分为：

The Chinese nation is a great nation.	中华民族是世界上伟大的民族
With a history of more than 5,000 years, China has made indelible contributions to the progress of human civilization.	有着5000多年源远流长的文明历史，为人类文明进步作出了不可磨灭的贡献。
After the Opium War of 1840, however, China was gradually reduced to a semi-colonial, semi-feudal society and suffered greater ravages than ever before.	1840年鸦片战争以后，中国逐步成为半殖民地半封建社会
The country endured intense humiliation, the people were subjected to great pain, and the Chinese civilization was plunged into darkness.	国家蒙辱、人民蒙难、文明蒙尘，中华民族遭受了前所未有的劫难
Since that time, national rejuvenation has been the greatest dream of the Chinese people and the Chinese nation.	从那时起，实现中华民族伟大复兴，就成为中国人民和中华民族最伟大的梦想。

常见英文断句规则

Full Stop

前	分隔符	后
	.	非打印字符（包括空格）

前	后
.+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*	\s

例外：Lower-case letter exception

前	分隔符	后
	.	\s\p{Ll}

前	后
.+[\p{Pe}\p{Pf}\p{Po}"]*	\s\p{Ll}

Other
|前|分隔符|后
|-|-|-|
||?!|非打印字符（包括空格）|

前	后
[!?]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*	\s

例外：Lower-case letter exception

前	分隔符	后
	.	\s\p{Ll}

前	后
.+[\p{Pe}\p{Pf}\p{Po}"]*	\s\p{Ll}

Colon

前	分隔符	后
	:	非打印字符（包括空格）

前	后
[:]+[\p{Pe}\p{Pf}\p{Po}"-[\u002C\u003A\u003B\u055D\u060C\u061B\u0703\u0704\u0705\u0706\u0707\u0708\u0709\u07F8\u1363\u1364\u1365\u1366\u1802\u1804\u1808\u204F\u205D\u3001\uA60D\uFE10\uFE11\uFE13\uFE14\uFE50\uFE51\uFE54\uFE55\uFF0C\uFF1A\uFF1B\uFF64]]*	\s

例外：Lower-case letter exception

前	分隔符	后
	.	\s\p{Ll}

前	后
.+[\p{Pe}\p{Pf}\p{Po}"]*	\s\p{Ll}

SRX 2.0 April 7, 2008 | GALA Global (gala-global.org)
SRX 是记录断句规则的文档，是XML文件
XML Schema for SRX

<?xml version="1.0" encoding="UTF-8"?><!--Document        : srx20.xsdVersion             : 2.0Created on      : December 26, 2006Authors            : dpooley@sdl.com   rmraya@maxprograms.comDescription     : This XML Schema defines the structure of SRX 2.0Status               : OSCAR recommendationCopyright © The Localisation Industry Standards Association [LISA] 2006. All Rights Reserved.--><!-- History of modifications (latest first):Jul-08-2008 by RMR: made foreign elements optional in <header>Jan-13-2008 by RMR: Permitted elements from foreign namespaces in <header> elementDec-26-2006 by RMR: Fixed namespace handlingChanged version to "2.0"Removed "cascade" attribute from <languagemap>Removed <maprule> elementAdjusted attributes to match the specification document    Jun-21-2006 by DRP: Change version number to "1.2" in readiness to move to "2.0"Make the cascade attribute mandatory (required) on the <header> elementAdd enumerations where necessary and some brief documentation for elements and attributesJun-15-2006 by DRP: Change version number to "1.1" in readiness to move to "2.0"Mar-10-2006 by DRP: Add "cascade" attribute to <header>, <maprule> and <languagemap> elementsApr-21-2004 by DRP: Convert to version 1.0.Mar-22-2004 by DRP: Eighth draft version.Ensure the <excludeexception> element is removedUpdate version numberMar-17-2004 by DRP: Seventh draft version.Remove <exceptions>, <exception>, <endrules>, <endrule> and <excludeexception> elementsAdd <rule> elementUpdate version numberFeb-02-2004 by DRP: Sixth draft version.Update version numberOct-27-2003 by DRP: Fifth draft version.Removed includeformatting attribute from <header> elementAdded <formathandle> element to the <header>Removed priority attribute from <endrule> and <exception> elementsAdded name attribute to <exception> elementAdded <excludeexception> element to the <endrule> elementOct-10-2003 by DRP: Fourth draft version.Removed <classdefinitions> and <classdefinition> elementsRemoved classdefinitionname attributeRemoved <digitcharacters>, <whitespacecharacters> and <wordcharacters>Added priority attribute to <endrule> and <exception> elementsAdded includeformatting attribute to <header> elementJul-24-2003 by DRP: Third draft version.Removed <charsets> and <charset> to be replaced with <classdefinitions> and <classdefinition>Renamed <digits> to <digitcharacters>Renamed <whitespace> to <whitespacecharacters>Renamed <wordchars> to <wordcharacters><digitcharacters>, <whitespacecharacters> and <wordcharacters> are now optionalRenamed <langrules> to <languagerules>Renamed <langrule> to <languagerule>Renamed <langmap> to <languagemap>Renamed langrulename to languagerulenameRenamed langpattern to languagepatternJun-19-2003 by DRP: Second draft version.Removed the <codepage> element.Added <header> and <body> elements.Nov-22-2002 by DRP: First draft version--><xs:schema xmlns:srx="http://www.lisa.org/srx20" targetNamespace="http://www.lisa.org/srx20" xml:lang="en" xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"><xs:import namespace="http://www.w3.org/XML/1998/namespace"schemaLocation="http://www.w3.org/2001/xml.xsd"/><xs:element name="afterbreak"><xs:annotation><xs:documentation>Contains the regular expression to match before the segmentbreak</xs:documentation></xs:annotation><xs:complexType mixed="true"/></xs:element><xs:element name="beforebreak"><xs:annotation><xs:documentation>Contains the regular expression to match after the segmentbreak</xs:documentation></xs:annotation><xs:complexType mixed="true"/></xs:element><xs:element name="body"><xs:annotation><xs:documentation>SRX body</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:languagerules"/><xs:element ref="srx:maprules"/></xs:sequence></xs:complexType></xs:element><xs:element name="formathandle"><xs:annotation><xs:documentation>Determines which side of the segment break that formattinginformation goes</xs:documentation></xs:annotation><xs:complexType><xs:attribute name="include" use="required"><xs:annotation><xs:documentation>A value of "no" indicates that the format code does not belongto the segment being created. A value of "yes" indicates that the format codebelongs to the segment being created.</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="yes"/><xs:enumeration value="no"/></xs:restriction></xs:simpleType></xs:attribute><xs:attribute name="type" use="required"><xs:annotation><xs:documentation>The type of format for which behaviour is being defined. Can be"start", "end" or "isolated".</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="start"/><xs:enumeration value="end"/><xs:enumeration value="isolated"/></xs:restriction></xs:simpleType></xs:attribute></xs:complexType></xs:element><xs:element name="header"><xs:annotation><xs:documentation>SRX header</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:formathandle" minOccurs="0" maxOccurs="3"/><xs:any minOccurs="0" maxOccurs="unbounded" namespace="##other" processContents="lax"/></xs:sequence><xs:attribute name="segmentsubflows" use="required"><xs:annotation><xs:documentation>Determines whether text subflows should besegmented</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="yes"/><xs:enumeration value="no"/></xs:restriction></xs:simpleType></xs:attribute><xs:attribute name="cascade" use="required"><xs:annotation><xs:documentation>Determines whether a matching &lt;languagemap&gt; elementshould terminate the search</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="yes"/><xs:enumeration value="no"/></xs:restriction></xs:simpleType></xs:attribute></xs:complexType></xs:element><xs:element name="languagemap"><xs:annotation><xs:documentation>Maps one or more languages to a set of rules</xs:documentation></xs:annotation><xs:complexType><xs:attribute name="languagerulename" type="xs:string" use="required"><xs:annotation><xs:documentation>The name of the language rule to use when the languagepatternregular expression is satisfied</xs:documentation></xs:annotation></xs:attribute><xs:attribute name="languagepattern" type="xs:string" use="required"><xs:annotation><xs:documentation>The regular expression pattern match for the languagecode</xs:documentation></xs:annotation></xs:attribute></xs:complexType></xs:element><xs:element name="languagerule"><xs:annotation><xs:documentation>A set of rules for a logical set of languages</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:rule" minOccurs="1" maxOccurs="unbounded"/></xs:sequence><xs:attribute name="languagerulename" type="xs:string" use="required"><xs:annotation><xs:documentation>The name of the language rule</xs:documentation></xs:annotation></xs:attribute></xs:complexType></xs:element><xs:element name="languagerules"><xs:annotation><xs:documentation>Contains all the logical sets of rules</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:languagerule" minOccurs="1" maxOccurs="unbounded"/></xs:sequence></xs:complexType></xs:element><xs:element name="maprules"><xs:annotation><xs:documentation>A set of language maps</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:languagemap" minOccurs="1" maxOccurs="unbounded"/></xs:sequence></xs:complexType></xs:element><xs:element name="rule"><xs:annotation><xs:documentation>A break/no break rule</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:beforebreak" minOccurs="0"/><xs:element ref="srx:afterbreak" minOccurs="0"/></xs:sequence><xs:attribute name="break"><xs:annotation><xs:documentation>Determines whether this is a segment break or an exceptionrule</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="yes"/><xs:enumeration value="no"/></xs:restriction></xs:simpleType></xs:attribute></xs:complexType></xs:element><xs:element name="srx"><xs:annotation><xs:documentation>OSCAR Segmentation Rules eXchange</xs:documentation></xs:annotation><xs:complexType><xs:sequence><xs:element ref="srx:header"/><xs:element ref="srx:body"/></xs:sequence><xs:attribute name="version" use="required"><xs:annotation><xs:documentation>The version of SRX</xs:documentation></xs:annotation><xs:simpleType><xs:restriction base="xs:string"><xs:enumeration value="2.0"/></xs:restriction></xs:simpleType></xs:attribute></xs:complexType></xs:element></xs:schema>

断句规则 Segmentation Rule相关推荐

[转载]英语语音断句规则
[转载]英语语音断句规则 (2012-05-19 19:20:20) 转载▼ 标签: 转载原文地址:英语语音断句规则作者: 星梦港湾英语语音断句规则 1. 有标点的地方常是断句所在. 如: J ...
cαr怎么发音_【英语拼读发音规则】连读、略读、重读、断句、语调……
在说汉语的时候,我们讲究的是"字正腔圆.掷地有声",而英语恰恰相反,它需要将一句话中的某些重点词汇突出,弱化非重点词汇. 所以,对已经习惯了"字正腔圆"的我们来 ...
word-break|overflow-wrap|word-wrap——CSS英文断句浅析
---恢复内容开始--- word-break|overflow-wrap|word-wrap--CSS英文断句浅析一问题引入今天在再次学习 overflow 属性的时候,查看效果时,看到如下结 ...
C不会断句？【前后置，位，移位操作符详解】 b = ++c, c++, ++a, a++
介绍:首先明确基础知识 b=a++ 把a赋给b 然后a+1 c=--a 先a-1 然后把结果赋给a和c 逗号表达式的优先级较低,从左往右算整数的原反补相同内存中存储的都是二进 ...
英文断句：理解 word-wrap、word-break用法
1.定义与属性 1.1.word-wrap 允许长单词换行到下一行语法 word-wrap: normal|break-word; 值描述 normal 只在允许的断字点换行(浏览器保持默认处理) ...
北理工校友发明文言文“填词大师”，断句、造词都能做，高考文言文满分靠它了...
萧箫发自凹非寺量子位报道 | 公众号 QbitAI 都说GPT-3能接人话,补充上下文关系,中文版的"填词大师"你见过没? 不仅是中文版,这个"填词大师" ...
详解IIS中URL重写工具的规则条件(Rule conditions)
本文结合官方文档和相关示例,详细记录了在IIS中URL重写工具下的规则条件(Rule conditions)的相关说明.规则条件允许我们通过额外的逻辑规则来过滤和匹配规则模式( rule patter ...
什么是百度竞价创意断句符
百度创意断句符是用来确定创意中标题和描述的截断或折行位置,当且仅当推广结果在右侧推广链接位置展现时有效,断句符用符号"^"(不含引号)表示.可以在创意的标题和描述中插入" ...
shell技巧（sed 断句、读取指定行）【ZT】
1.断句,(同行有多个字段需要读取时特别管用) 原文本: Cell 04 - Address: 14:E6:E4:E3:E8:68 Protocol:802.1 ...

断句规则 Segmentation Rule

断句规则 Segmentation Rule相关推荐

最新文章

热门文章