Crawl Short Sentence

爬取一些优美的中英双语短句

找到一个网站

http://www.siandian.com/haojuzi/1574.html

用上面的网站链接做例子

# 通过url获取网页
import urllib.requestdef get_html(url):# 要设置请求头，让服务器不知道是程序user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'headers = {'User-Agent': user_agent}req = urllib.request.Request(url=url, headers=headers)response = urllib.request.urlopen(req)html_content = response.read().decode("gbk")return html_content

导入了urllib，使用环境是Python3, 在python3中urllib和urllib2已经合并
headers的设置是防止服务器对我们的请求屏蔽，模拟正常的用户请求
最后返回网页的内容

分析网页

找到需要抓取内容的特征，上面网站的特征：

<p><br />1、我的世界不允许你的消失,不管结局是否完美。<br />No matter the ending is perfect or not, you cannot disappear from my world.</p>
<p>2、爱情是一个精心设计的谎言。<br />Love is a carefully designed lie.</p>

短语都是以”< p>”开始，以”< /p>”结尾

获取语句

import re
def get_sentence(content):content_list = re.findall('<p>.*?</p>', content, re.S)

导入了re模块，re模块使Python语言拥有了全部的正则表达式功能
使用了findAll函数，第一个参数为 正则表达式 ，第二个参数为 需要匹配的内容 ，第三个参数为 Flag
本次使用的正则表达式比较简单,匹配以”\
“开始，以”\

“结尾的内容，”.*?”的含义为：
1. ’.’ 表示匹配任意字符
2. ‘*’ 表示匹配前一个字符0至无限次
3. ‘?’ 表示非贪婪模式，在满足条件的情况下尽可能少的匹配

分析获取的语句

type1:

'<p>\r\n\t<br />\r\n\t1、我的世界不允许你的消失,不管结局是否完美。<br />\r\n\tNo matter the ending is perfect or not, you cannot disappear from my world.</p>'

type2:

恋爱中，干傻事总是让人感到十分美妙。\r\n\tIn love folly is always sweet."" data-snippet-id="ext.74a355223e48433da5a7cce13eabd2b6" data-snippet-saved="false" data-codota-status="done">"<p>\r\n\t66、<a href='http://www.siandian.com/lianaijiqiao/' target='_blank'><u>恋爱</u></a>中，干傻事总是让人感到十分美妙。<br />\r\n\tIn love folly is always sweet.</p>"

清洗数据

def clean_sentence(item_temp):item_temp = item_temp.replace("<p>\r\n\t<br />", "").replace("<br />\r\n\t", "&&").replace("</p>", "").replace("<p>", "").replace("\r\n\t", "")item_temp = item_temp.split('、')if len(item_temp) == 2:item_temp = item_temp[1]else:# print(item_temp)return ''if "<a href=" not in item_temp:return item_temp + " &$\n"return ''

清洗后的语句为(添加&& 和 &$用于之后拆分中英文语句)：

我的世界不允许你的消失,不管结局是否完美。&&No matter the ending is perfect or not, you cannot disappear from my world. &$

完整代码

.*?

', content, re.S)sentence_list = []for item_loop in content_list:item_loop = clean_sentence(item_loop)if len(item_loop) > 0:sentence_list.append(item_loop)for show in sentence_list:print(show)return sentence_list# 清洗语句
def clean_sentence(item_temp):item_temp = item_temp.replace("\r\n\t", "").replace("\r\n\t", "&&")\.replace("

", "").replace("", "").replace("\r\n\t", "")item_temp = item_temp.split('、')if len(item_temp) == 2:item_temp = item_temp[1]else:# print(item_temp)return ''if "# -*- coding: UTF-8 -*-
import re
import urllib.requestwebsites = ["http://www.siandian.com/haojuzi/1574.html"]# 通过url获取网页
def get_html(url):# 要设置请求头，让服务器知道不是机器人user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'headers = {'User-Agent': user_agent}req = urllib.request.Request(url=url, headers=headers)response = urllib.request.urlopen(req)html_content = response.read().decode("gbk")return html_content# 通过正则表达式来获取语句
def get_sentence(content):content_list = re.findall('<p>.*?</p>', content, re.S)sentence_list = []for item_loop in content_list:item_loop = clean_sentence(item_loop)if len(item_loop) > 0:sentence_list.append(item_loop)for show in sentence_list:print(show)return sentence_list# 清洗语句
def clean_sentence(item_temp):item_temp = item_temp.replace("<p>\r\n\t<br />", "").replace("<br />\r\n\t", "&&")\.replace("</p>", "").replace("<p>", "").replace("\r\n\t", "")item_temp = item_temp.split('、')if len(item_temp) == 2:item_temp = item_temp[1]else:# print(item_temp)return ''if "<a href=" not in item_temp:return item_temp + " &$\n"return ''if __name__ == '__main__':html = get_html(websites[0])get_sentence(html)

Python使用（一）从网页爬取并清洗一些优美的中英双语短句相关推荐

python使用Cookie登录网页爬取信息（超简单）
python使用Cookie登录网页爬取信息(超简单) 因为工作原因,需要爬取一个医疗行业的网站的信息,而且目标网站还要登录才给你展现完整的页面,所以有了本文-- 看了好多爬取需要登录的博客,过程不表 ...
python通过xpath解析网页爬取高清大图和王者荣耀英雄海报
python通过xpath解析网页 xpath XPath,全称 XML Path Language,即 XML 路径语言,它是一门在 XML 文档中查找信息的语言.最初是用来搜寻 XML 文档的,但 ...
爬虫小实战（selenium）数据小分析（pywebio、pyecharts）python分析写在网页爬取2021年世界500强企业
爬取数据通过selenium爬取2021年世界500强企业数据 import time import requests import csv from selenium import webdri ...
Python爬虫: 单网页所有静态网页动态网页爬取
Python爬虫: 单网页所有静态网页动态网页爬取前言:所有页代码主干均来自网上!!!感谢大佬们. 其实我对爬虫还挺感兴趣的,因为我玩instagram(需要科学上网),上过IG的人都知道IG虽 ...
爬取电影资源之网页爬取篇（python）
不知道大家平常喜不喜欢待在宿舍一个人看电影? 作为一个高龄屌丝,电影对我来说是必不可少的.平常无聊时自己一个人待在宿舍看看电影,看看书. (人闲下来就会胡思乱想,不能让寂寞侵蚀自己的内心) 其实还是喜 ...
Python数据分析：爬虫从网页爬取数据需要几步？
对于数据分析师来说,数据获取通常有两种方式,一种是直接从系统本地获取数据,另一种是爬取网页上的数据,爬虫从网页爬取数据需要几步?总结下来,Python爬取网页数据需要发起请求.获取响应内容.解析数据. ...
【Python】网页爬取CVPR论文
动机利用python自动下载 cvpr论文流程获取网页内容找到所有论文链接下载 1. 获取网页内容所用模块:requests 重要函数:requests.get 输出:web_contex ...
Python爬虫【二】爬取PC网页版“微博辟谣”账号内容(selenium同步单线程)
专题系列导引爬虫课题描述可见: Python爬虫[零]课题介绍 – 对"微博辟谣"账号的历史微博进行数据采集课题解决方法: 微博移动版爬虫 Python爬虫[一]爬取移 ...
Python爬虫【四】爬取PC网页版“微博辟谣”账号内容(selenium多线程异步处理多页面)
专题系列导引爬虫课题描述可见: Python爬虫[零]课题介绍 – 对"微博辟谣"账号的历史微博进行数据采集课题解决方法: 微博移动版爬虫 Python爬虫[一]爬取移 ...

Python使用（一）从网页爬取并清洗一些优美的中英双语短句

Crawl Short Sentence

找到一个网站

用上面的网站链接做例子

分析网页

获取语句

分析获取的语句

清洗数据

完整代码

Python使用（一）从网页爬取并清洗一些优美的中英双语短句相关推荐

最新文章

热门文章

Python使用 （一）从网页爬取并清洗一些优美的中英双语短句

Crawl Short Sentence

找到一个网站

用上面的网站链接做例子

分析网页

获取语句

分析获取的语句

清洗数据

完整代码

Python使用 （一）从网页爬取并清洗一些优美的中英双语短句相关推荐

最新文章

热门文章

Python使用（一）从网页爬取并清洗一些优美的中英双语短句

Python使用（一）从网页爬取并清洗一些优美的中英双语短句相关推荐