python建立英文语料库_使用NLTK创建新的语料库

经过几年的研究之后，下面是更新的教程

如何使用文本文件目录创建NLTK语料库？

主要思想是利用nltk.corpu.Reader包裹。中有一个文本文件目录的情况下英语，英国的，英国人的，最好使用PlaintextCorposReader.

如果您有一个如下所示的目录：newcorpus/

file1.txt

file2.txt ...

只需使用这些代码行，您就可以得到一个语料库：import osfrom nltk.corpus.reader.plaintext import PlaintextCorpusReadercorpusdir = 'newcorpus/' # Directory of corpus.newcorpus = PlaintextCorpusReader(corpusdir, '.*')

注：认为PlaintextCorpusReader将使用默认的nltk.tokenize.sent_tokenize()和nltk.tokenize.word_tokenize()要将你的课文分成句子和单词，并且这些功能是为英语而建立的，它可以不为所有语言工作。

下面是创建测试文本文件的完整代码，以及如何使用NLTK创建一个语料库，以及如何在不同级别访问该语料库：import osfrom nltk.corpus.reader.plaintext import PlaintextCorpusReader# Let's create a corpus with 2 texts in different textfile.txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""corpus = [txt1,txt2]# Make new dir for the corpus.corpusdir = 'newcorpus/'if not os.path.isdir(corpusdir):

os.mkdir(corpusdir)# Output the files into the directory.filename = 0for text in corpus:

filename+=1

with open(corpusdir+str(filename)+'.txt','w') as fout:

print>>fout, text# Check that our corpus do exist and the files are correct.assert os.path.isdir(corpusdir)for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):

assert open(corpusdir+infile,'r').read().strip() == text.strip()# Create a new corpus by specifying the parameters# (1) directory of the new corpus# (2) the fileids of the corpus# NOTE: in this case the fileids are simply the filenames.newcorpus = PlaintextCorpusReader('newcorpus/', '.*')# Access each file in the corpus.for infile in sorted(newcorpus.fileids()):

print infile # The fileids of each file.

with newcorpus.open(infile) as fin: # Opens the file.

print fin.read().strip() # Prints the content of the fileprint# Access the plaintext; outputs pure string/basestring.print newcorpus.raw().strip()print # Access paragraphs in the corpus. (list of list of list of strings)# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and # nltk.tokenize.word_tokenize.## Each element in the outermost list is a paragraph, and# Each paragraph contains sentence(s), and# Each sentence contains token(s)print newcorpus.paras()print# To access pargraphs of a specific fileid.print newcorpus.paras(newcorpus.fileids()[0])# Access sentences in the corpus. (list of list of strings)# NOTE: That the texts are flattened into sentences that contains tokens.print newcorpus.sents()print# To access sentences of a specific fileid.print newcorpus.sents(newcorpus.fileids()[0])# Access just tokens/words in the corpus. (list of strings)print newcorpus.words()# To access tokens of a specific fileid.print newcorpus.words(newcorpus.fileids()[0])

最后，要以其他语言读取文本目录并创建nltk语料库，您必须首先确保具有python可调用性。字标记化和句子标记化接受字符串/基字符串输入并产生这样的输出的模块：>>> from nltk.tokenize import sent_tokenize, word_tokenize>>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus.""">>> sent_tokenize(txt1)['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.']>>> word_tokenize(sent_tokenize(txt1)[0])['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']

python建立英文语料库_使用NLTK创建新的语料库相关推荐

python 编程环境微信_微信开发之新浪SAE上配置WeRoBot微信机器人，python，Mac环境...
本文将带你了解微信开发新浪SAE上配置WeRoBot微信机器人,python,Mac环境,希望本文对大家学微信有所帮助. 废话不多说,直接上过程. 注册新浪sae,选择创建云应用,创建新应用,填好相关 ...
cannot set up a python sdk 3.8_anaconda+pycharm环境下创建新的虚拟环境报错Cannot set up a py...
anaconda+pycharm环境下创建新的虚拟环境报错Cannot set up a py anaconda+pycharm环境下创建新的虚拟环境报错Cannot set up a python ...
python字典的建立和输出_字典的创建和使用
直接创建 d = {'age': 23, 'name': 'Daniel', 'sex': 1} 输出结果: {'age': 23, 'name': 'Daniel', 'sex': 1} dict函 ...
abaqus python 建立节点集合_在Python中创建Abaqus集
我想用Python在Abaqus中创建一个带边的几何集.我不会事先知道边的数目.尝试将边放入数组中,然后创建集合.你知道吗myEdgesForSet = [] for i in range(0, le ...
python建立空矩阵_创建空矩阵Python
首先,您应该在最里面的列表中插入一些内容(比如None).其次,当您在最外层列表中使用乘法时,它会将引用复制到内部列表,因此当您更改一个元素时,您也会在所有其他列表中更改此元素:>> pa ...
python建立回归模型_简单线性回归的Python建模方法
简单线性回归,就是两个随机变量存在一定大小的相关系数的前提下,结合散点图观察,采用最小二乘OLS方法,尝试建立一条回归直线,使得误差平方和SSE最小.OLS是一种参数方法,通过确定直线的斜率b和截距a ...
python django 动态网页_使用Django创建动态页面
将 URL 映射到视图那么概括起来,该视图函数返回了包含当前日期和时间的一段 HTML 页面.但是如何告诉 Django 使用这段代码呢?这就是 URLconfs 粉墨登场的地方了. URLconf ...
python调用ping命令_我可以创建一个脚本来测试是否可以在python服务器上远程使用SSH和PING命令吗？ - python...
我可以创建脚本来测试是否可以在python中远程使用服务器上的SSH和PING命令吗? 参考方案好的,这是Python3和Linux的示例.只需记住,您的目标主机/服务器必须正在运行SSH服务器才能 ...
python建立矩阵原理_怎么用python建立矩阵-问答-阿里云开发者社区-阿里云
通过观察Python的自有数据类型,我们可以发现Python原生并不提供多维数组的操作,那么为了处理矩阵,就需要使用第三方提供的相关的包. NumPy 是一个非常优秀的提供矩阵操作的包.NumPy的主 ...
python建立空集合_「python」集合类型及操作
目录: 集合类型定义集合操作符集合处理方法集合类型应用场景 1 集合类型定义集合是多个元素的无序组合集合用大括号 {} 表示,元素间用逗号分隔建立集合类型用 {} 或 set() 建立空集 ...

python建立英文语料库_使用NLTK创建新的语料库

python建立英文语料库_使用NLTK创建新的语料库相关推荐

最新文章

热门文章