最近在读一本神书,牛津大学出版的《Scientific Writing and Communication》,边阅读边检查书里面说的那些英文科学论文写作中经常出现的很俗的桥段,很多余的短语有没有出现在我正在写的科学论文中。检查的过程中发现书本第一章的内容非常有指导意义,只需要逐条搜索规则中出现的短语,删除或者替换即可。这样搜索替换了一阵后,感觉比较繁琐,寻思可以将规则收集起来,放在字典里,然后使用程序来全面检查论文内容。这样也可以帮到更多科研人员。花了一个晚上,将书中第一二章关于短语的规则搜集起来,制作了一个检查器。后面的章节更关注句子,段落和文章的结构,规则不容易建立,需要自然语言处理和深度学习训练。不过万事开头难,先迈出第一步,以后有时间慢慢填坑。

更新:书本前两章涵盖了第一到第六条原则, 建议删除冗余词,替换复杂和不准确的词。这些规则的实现收集在rule1to6_words.py中。最新版额外实现了位于第三章的第十条规则。这条规则判断一个语句是否头重脚轻,主语过长。为了判断主语的长度,程序使用了自然语言处理中的语法/依存分析。

更新:最新版本中加入了语法,单词拼写检查。

更新:加入对Latex文件的预处理。跳过方程,图片和表格,仅分析摘要,正文,图表描述等文字部分。LongGang Pang / proofread​gitlab.com

第一部分:简介

使用方法:

git clone https://gitlab.com/snowhitiger/proofread.git

cd proofread

python proofread.py path/*.tex

结果示例:

在默认情况下,proofread.py 仅对规则字典中要求删除或替换的词进行检查。

如果想要开启其他功能,可以在命令行添加选项。使用如下命令查看可用的选项,

python proofread.py --help

输出结果如下usage: proofread.py [-h] [--report_long_subjects] [--report_interruptions]

[--report_noun_clusters] [--check_grammars] [--brief]

f [f ...]

Proofread scientific papers to make it brief and simple

positional arguments:

f one file or a list of files for analysis

optional arguments:

-h, --help 显示帮助信息并退出

--report_long_subjects Principle 10: 避免过长主语,默认未开启

--report_interruptions Principle 11: 避免主语谓语距离过长,默认未开启

--report_noun_clusters Principle 18: 避免长串名词,默认未开启

--check_grammars 语法,拼写检查,默认未开启

示例,如果想对文件 example.tex 开启语法检查, 并报告过长的主语,使用

python proofread.py example.tex --report_long_subjects --check_grammars

请注意,语法检查速度比较慢,请谨慎开启。最新版本会对Latex文件自动分析,跳过方程,图,表格等与文字无关的内容。

替换字典:

这本书的本意是让我们在大脑中记住这些规则,潜移默化的改变我们的写作过程,而不是让我们写完之后修改。这里把搜集的字典文件放出来,大家可以先记在脑中,写作的过程中尽量避免。如果不想费力,或者不想按字典一条条检查自己学生写的初稿,就直接使用上面的 proofread.py 脚本来自动化搜索/建议修改过程。字典是一个 json 文件,第一个版本只有一些手动放进去的规则,

{"to_remove": ["actually",

"appear",

"absolutely",

"at all times",

"commonly",

"clearly",

"certainly",

"completely",

"much",

"usually",

"perhaps",

"possibly",

"definitely",

"basically",

"essentially",

"fairly",

"very",

"just",

"really",

"seem",

"practically",

"quite",

"rather",

"virtually",

"as it is well known",

"for all intents and purposes",

"it is well known",

"it is shown that",

"it is shown to",

"it is observed that",

"it is reasonable to assume that",

"evidence is presented that shows that",

"it is speculated that",

"it is found that",

"it is demonstrated that",

"it is reported that",

"it has long been known that",

"is observed to",

"is determined to",

"note that",

"needless to say",

"it is noted that",

"it is also worth noting",

"it is worth noting",

"respectively",

"so-called",

"putative",

"in shape",

"in a previous study, it is demonstrated that",

"it is important to realize",

"it is important to note",

"i believe",

"we believe",

"in my opinion",

"in our opinion",

"there are many papers stating"],

"to_replace": {

"a large majority of":"most",

"a very large number of":"many",

"a large number of":"many",

"a considerable number of":"many",

"an adequate amount of":"enough",

"are in agreement":"agree",

"an example of this is the fact that":"for example",

"as a consequence of":"because",

"at no time":"never",

"at the present time":"at present",

"already existing":"existing",

"all of the":"all the",

"as to whether":"whether",

"basic fundamentals":"fundamentals",

"blue in color":"blue",

"based on the fact that":"because",

"by means of":"by",

"considerable amount of":"much",

"cold temperature":"cold",

"currently underway":"underway",

"complete eliminate":"eliminate",

"conduct research":"research",

"make a decision":"decide",

"do not overlook":"note",

"despite the fact that":"although",

"due to the fact that":"due to",

"during the time that":"while/when",

"employ":"use",

"elucidate":"show",

"etiology":"cause",

"each individual":"each",

"end result":"result",

"evolve over time":"evolve",

"first of all":"first",

"for the purpose of":"to",

"has the capability of":"can/is able to",

"has the capacity to":"can",

"Has the ability to":"can",

"in a careful manner":"carefully",

"in light of the fact that":"because",

"in many cases":"often",

"in order to":"to",

"in some cases":"sometimes",

"in the absence of":"without",

"in the event that":"if",

"in terms of":"regarding",

"in view of the fact that":"because/since",

"it is interest to note that":"note that",

"it is often the case that":"often",

"in the event that":"if",

"in the process of":"while/when",

"it is possible that":"restructure using can/could/may/might",

"the fact that":"delete and restructure",

"may indicate":"suggest",

"majority of":"most",

"methodology":"method",

"no later than":"by",

"not different":"similar",

"not infrequently":"frequently",

"not many":"few",

"not the same":"different",

"not unimportant":"important",

"on the basis of":"by",

"referred to as":"called",

"regardless of the fact that":"even though",

"so as to":"to",

"small in size":"small",

"large in size":"large",

"take advantage of":"use",

"take into consideration":"consider",

"make a decision":"decide",

"the question as to whether":"whether",

"there is no doubt but that":"doubtless",

"this is a subject that":"this subject",

"here is":"Delete and restructure to create a stronger active subject/verb.",

"ultilize":"use",

"used for * purposes":"used for *",

"utilization":"use",

"with reference to":"about (or omit)",

"with regard to":"about (or omit)",

"with the exception of":"except",

"whether or not":"whether",

"prior to":"before",

"past history":"past",

"plays a key role in":"is essential to",

"prioritized":"given priority",

"subsequent to":"after",

"performed an analysis":"analyzed",

"period of time":"period",

"at this point in time":"now",

"due to the fact that":"because",

"in the event that":"if",

"a new initiative":"an initiative",

"nearly unique":"unique/rare",

"both * were equally affected":"the * were equally affected",

"in spite of the fact that":"though",

"owing to the fact that":"because",

"enhanced":"increased",

"man":"humans",

"quite sufficiently large":"large",

"reflect deviations":"deviate",

"estimated roughly at":"estimated at",

"first and for most":"first",

"final outcome":"outcome",

"frequently":"make it more precise, e.g., frequently ---> every 12 hours",

"future plans":"plans",

"main essentials":"essentials",

"still persists":"persists",

"never before":"never",

"reason is because":"reason is",

"situation":"can be omitted, perhaps",

"several":"make it more precise, e.g., several hours ---> 6 hours",

"true facts":"facts",

"the nature of":"can be omitted, perhaps"},

"other_rules":{"there is":"Delete and restructure to create a stronger active subject/verb. E.g., There have been many discussions among the scientific community about ethical boundaries in gene-splicing research. → The scientific community has frequently discussed the ethical boundaries in gene-splicing research.",

"even":"sometimes can omit",

"often":"sometimes can omit",

"with respect to":"about"}

}

第二部分:实现方法与思路

基于规则的替换建议

这部分使用规则匹配来建议删除多余词,替换不准确词。举例说明,

1. "As it is well known":建议删除。

2. "It is shown that": 建议删除。

3. "is related to": 建议修改成 "is associated with"。

4. "provides a review": 建议修改成 "reviews"。

这些规则会有很多变体,基于规则的部分会尽量考虑所有的可能并提出相应的替换方案。比如,在字典中只会出现 “As it is well known“,程序中会通过数据增广,生成新的规则,如 "As it was well know" , "As it has been well know" , "As it have been well know" , "As it had been well know" 。

rule1to6_words.py 实现了所有词的替换和删除建议。直接运行 proofread.py 会默认使用定义在此脚本中的所有规则。

基于自然语言处理的语法判断

避免主语太长或者主语之前有过多的限定词

写作的过程应尽量使主语简短,接近谓语,否则会带来理解上的困难。这是《Scientific Writing And Communication》书中提到的第10条科学写作原则。

举例说明:

反例:Due to the nonlinear and hence complex nature of ocean currents, modeling these currents in the tropical pacific is difficult.

正例: Modeling ocean currents in the tropical Pacific is difficult due to their nonlinear and hence complex nature.

目前使用Spacy库对文章的每句话进行语法分析,找到主语谓语。spacy 库是与nltk类似的自然语言处理 python 库。如果谓语之前的部分过长,影响理解,则提出修改请求。具体实现在 rule10_long_subject.py 中。

proofread.py 默认不会对主语过长进行检查,需要在命令行中加入--report_long_subjects选项开启,

python proofread.py example.txt --report_long_subjects

如果 Spycy 库没有安装,第一次开启了 --report_long_subjects 之后,python会自动调用命令 pip install spycy --user 安装spycy 并下载其需要的英文语法库。

基于自然语言处理的语义分析

ToDo

对于单词和短语来说,有些需要语义判断的地方,可能会用到额外的自然语言处理技术。比如 since 一般被用作与时间相关的语句,但有时候它也会被用作 “because”,如果我们想将所有与时间无关的用例都替换成 "because" 或者 "as", 那么就需要使用一定的机器学习或人工智能,来判断语句是否表达了自某一时刻起。这个时候,如果使用上下文相关的词向量,当可判断“since”是否有时间的分量在里面。(注:此处BERT的词向量不可用,因为有位置嵌入)。

对于句法,段落以及文章结构,自然语言处理的方法必不可少。比如下面这些规则,将新的信息放在语句的最后

审慎使用消极或被动语态

确保段落的第一句与最后一句匹配

将名词换成动词,使语句看起来更加积极

这些规则的构建都需要对语义的理解和判断。比如如何判断一段话中哪句话引入了新的信息,哪里的消极或被动语态可以被同义词替换。可能用到的方法有简单的隐马尔可夫链/条件随机场,复杂点的RNN,或者是更复杂的BERT。虽然困难,但也不是做不了,随着自然语言理解的发展,这些有关句法段落的规则可能也会慢慢的被构建进代码。

对于句法和段落来说,最难的是训练语料的获取。如果预训练的语言模型,如BERT,能够将微调时需要的训练语料降低到10个或100个左右,那么就可以使用少量的样例,来训练一个神经网络,一条条履行这些事关语法和段落的规则。

规则资源

用python制作英文字典的分析_分享一个自己做的英文科学写作检查器相关推荐

  1. python制作数据增长动图_手把手教你做一个python+matplotlib的炫酷的数据可视化动图...

    #第1步:导出模块,固定 importpandas as pdimportmatplotlib.pyplot as pltimportmatplotlib.ticker as tickerimport ...

  2. python提取发票信息发票识别_分享一个电子发票信息提取工具(Python)

    电子发票太多,想统计下总额异常困难,网上工具不好用,花了2个小时实现一份,测试过中石油.京东开具的电子发票还行,部分发票名称失败有问题不影响统计,有需要的小伙伴自己拿去改吧. import cmd i ...

  3. mysql数据库新建一个递增的_分享一个mysql实验—基于数据库事务隔离级别RR及RC的测试...

    概述 今天主要分享一个最近做的实验,主要是测试mysql数据库在RR和RC不同事务隔离级别下的表现. MySQL使用可重复读来作为默认隔离级别的主要原因是语句级的Binlog.RR能提供SQL语句的写 ...

  4. 计算机bios设置中英文翻译,主板BIOS界面全英文翻译 分享一个主板BIOS设置英文对照表大全...

    我们经常会使用主板BIOS,例如设置CPU超频.硬件电压.开机硬盘启动项等等.我们一般是开机之后,就按"Delete键"进行BIOS,不少用户进入到主板BIOS就傻眼了,全英文界面 ...

  5. python制作英语字典_如何用Python,制作一个属于自己的、独一无二的英文字典?...

    此例用到了Python中的列表和字典等数据类型,以及二分算法. 二分算法,是基于已经排好顺序的元素:初始化你的英文字典. 显示函数:把你字典中所有单词和释义都打印出来. 输入函数:分别输入单词和释义, ...

  6. python制作英语字典_python如何制作英文字典

    本文实例为大家分享了python制作英文字典的具体代码,供大家参考,具体内容如下 功能有添加单词,多次添加单词的意思,查询,退出,建立单词文件. keys=[] dic={} def rdic(): ...

  7. python制作英语字典下载手机版_python如何制作英文字典

    本文实例为大家分享了python制作英文字典的具体代码,供大家参考,具体内容如下 功能有添加单词,多次添加单词的意思,查询,退出,建立单词文件. keys=[] dic={} def rdic(): ...

  8. python英文文本情感分析_舆情信息浩如烟海?看看如何用Python进行英文文本的情感分析...

    市场每天都在生成海量的舆情信息,这些信息可以帮助我们识别市场情绪的转变.如果只是人工地去跟踪大量的舆论和研报显然缺乏效率.我们可以试试让机器来完成这个工作. 数据科学在数值领域中很常见,但这个不断壮大 ...

  9. python细粒度情感分析_用SenticNet库做细粒度情感分析

    细粒度情感分析 说细粒度情感分析,先说传统的情感分析(即粗粒度分析). 粗粒度情感分析只是简单的积极或消极情感的划分,并计算出情感的强度.但是情绪更细的维度层次,应该还可以对正负情绪进行划分.比如,积 ...

最新文章

  1. 如何使用charles对Android Https进行抓包
  2. 系统消息发现有新的未读消息弹框提示
  3. linux的常用操作——程序调试gdb
  4. 《机器学习》周志华 习题答案9.4
  5. UVa 1153 Keep the Customer Satisfied 【贪心 优先队列】
  6. DHCP服务器异常,上不了网解决办法
  7. AcWing1090.绿色通道(单调队列DP)
  8. 【小白的刷题之路】字符统计
  9. oracle数据泵导入 锁,Oracle数据泵expdp导出,impdp导入
  10. telnet不起作用
  11. 全球免费开放的电子图书馆
  12. 搭建Windows red5流媒体服务器详解
  13. linux视频播放器软件下载,360万能播放器Linux版
  14. CCF 区块链国际会议 统计 有哪些接收区块链论文的会议 (最全)
  15. 学生学籍管理系统~~登录界面(Java、SQL)
  16. SAP MI01、MI04、MI07、MI10 批量盘点凭证创建+盘点数量+差异过账
  17. Android客户端 和 pc服务器 建立socket连接并发送数据
  18. 国际手机号码的正则表达式
  19. 【手把手教你】用backtrader量化回测海龟交易策略
  20. Python爬虫实例:从百度贴吧下载多页话题内容

热门文章

  1. 场效应管 - MOSFET
  2. C语言指针(二重指针)
  3. wim文件怎么安装系统 wim文件安装系统教程
  4. 2021-03-31
  5. 滚动轴承频谱分析详解
  6. 在 Linux shell 字符串中,百分比符号是什么?
  7. 如何更换荣耀MagicBook Pro开机启动logo
  8. 用心邂逅美好,用爱守护童真
  9. 在CAD制图软件中标注数学公式的操作技巧
  10. Windows 8.1下释放磁盘空间的指南