自然语言处理----处理原始文本

本文主要介绍编程访问网络文本的几种方式。

1. 访问网络资源

>>> from urllib import urlopen
>>> url='http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.astype.html'
>>> raw=urlopen(url).read()
>>> type(raw)
<type 'str'>
>>> len(raw)
16429
>>> raw[:75]
'\n\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n  "http://'

View Code

如果Python无法正确自动检测出Internet代理，可以使用下面方法手动指定。

>>> proxies={'http': 'http://www.someproxy.com:3128'}
>>> raw=urlopen(url, proxies=proxies).read（）

2. 访问博客

在Universal Feed Parser的第三方python库的帮助下，可以访问博客的内容。

>>> import feedparser
>>> llog=feedparser.parse('http://weibo.com/ttarticle/p/show?id=2309404116343489194022')
>>> llog.keys()
['feed', 'status', 'version', 'encoding', 'bozo', 'headers', 'href', 'namespaces', 'entries', 'bozo_exception']
>>> type(llog['feed'])
<class 'feedparser.FeedParserDict'>
>>> llog['feed'].keys()
['meta', 'summary']
>>> llog['feed']['meta']
{'content': u'text/html; charset=gb2312', 'http-equiv': u'Content-type'}
>>> llog['feed']['summary']
u'<span id="message"></span>\n\n&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;&amp;'

View Code

3. 处理html

一般有三种方式：正则匹配， nltk.clean_html(), BeautifulSoup. 正则表达式比较繁琐，而nltk.clean_html（）现在已经不支持了，比较简单常用的是用BeautifulSoup包。

from bs4 import BeautifulSouphtml_doc=''' <html><head><title>The Document's story</title></head><html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p></body></html>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
content=soup.get_text()
print content

运行结果如下：

runfile('D:/my project/e_book/XXMLV-2/4.Python_代码/test.py', wdir='D:/my project/e_book/XXMLV-2/4.Python_代码')The Document's story
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...

转载于:https://www.cnblogs.com/no-tears-girl/p/6964600.html

自然语言处理----处理原始文本相关推荐

【Python自然语言处理】读书笔记：第三章：处理原始文本
本章原文链接:https://usyiyi.github.io/nlp-py-2e-zh/3.html 3 处理原始文本 import nltk, re, pprint from nltk impor ...
自然语言处理(2)之文本资料库
自然语言处理(2)之文本资料库 1.获取文本资料库本章首先给出了一个文本资料库的实例:nltk.corpus.gutenberg,通过gutenberg实例来学习文本资料库.我们用help来查看它的 ...
『NLP自然语言处理』中文文本的分词、去标点符号、去停用词、词性标注
利用Python代码实现中文文本的自然语言处理,包括分词.去标点符号.去停用词.词性标注&过滤. 在刚开始的每个模块,介绍它的实现.最后会将整个文本处理过程封装成 TextProcess 类. ...
【自然语言处理】【文本生成】UniLM：用于自然语言理解和生成的统一语言模型预训练
UniLM:用于自然语言理解和生成的统一语言模型预训练 <Unified Language Model Pre-training for Natural Language Understandi ...
【自然语言处理】【文本生成】CRINEG Loss：学习什么语言不建模
CRINEG Loss:学习什么语言不建模 <The CRINGE Loss:Learning what language not to model> 论文地址:https://arxiv ...
自然语言处理NLP之文本蕴涵、智能问答、语音识别、对话系统、文本分类、情感计算
自然语言处理NLP之文本蕴涵.智能问答.语音识别.对话系统.文本分类.情感计算目录
自然语言处理NLP之文本摘要、机器翻译、OCR、信息检索、信息抽取、校对纠错
自然语言处理NLP之文本摘要.机器翻译.OCR.信息检索.信息抽取.校对纠错目录
【自然语言处理概述】文本词频分析
[自然语言处理概述]文本词频分析作者简介:在校大学生一枚,华为云享专家,阿里云专家博主,腾云先锋(TDP)成员,云曦智划项目总负责人,全国高等学校计算机教学与产业实践资源建设专家委员会(TIPCC) ...
【自然语言处理】【文本生成】Transformers中使用约束Beam Search指导文本生成
Transformers中使用约束Beam Search指导文本生成原文地址:https://huggingface.co/blog/constrained-beam-search 相关博客 [自然 ...

自然语言处理----处理原始文本

自然语言处理----处理原始文本相关推荐

最新文章

热门文章