Python 爬取外文网站并翻译中文和百度搜索验证

今天分享一个爬虫的简单实例，目标是爬取一个外文网站的博客信息，然后通过谷歌翻译成英文并使用百度搜索验证。

今天只是简单实现功能，以后有时间结合Scrapy框架综合的写一个博客。

# _*_ coding:utf-8 _*_
# @Time      : 15:52
# @Author    :baizhoufrom googletrans import Translator
import requests
from lxml import etreedef Google_Translator(text):'''利用谷歌翻译，实现文本翻译:param text::return:'''translator = Translator(service_urls=['translate.google.cn'])result = translator.translate(text, dest="zh-CN").textreturn resultdef Check_frequency(text, count):'''使用百度搜索，完成出现频率统计:param text::return: 返回yes or no'''# 拼urlurls = "http://www.baidu.com/s?ie=UTF-8&wd={}".format(text)# 请求百度# 构造请求头headers = {'User-Agent': "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"}status_code_num = requests.get(urls, headers=headers).status_codeif status_code_num == 200:"""如果返回码为200，则代表请求成功"""response = requests.get(urls, headers=headers).content.decode("utf-8")mytree = etree.HTML(response)contentList = mytree.xpath(".//*[@id='content_left']/div")for content in contentList:emList = content.xpath("./h3/a/em/text()")for i in emList:if len(str(i)) >= count:return "yes"else:passreturn "no"# 实现获取https://keuangan.kontan.co.id/网站博客信息
"""
1）分析需求字段：id（序号）、title（文章标题）、url（文章url）、typ（文
章所属分类）、publish_time（文章发布时间）、article_id（文章id）、Compliance（是否符合需求）
2）查找翻页信息
3）分析网页信息
"""def judge_month(month_str):month_list = ["Jan", "Feb", "Mar", "Apr", "May", "Jun","Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]if month_str.strip()[:3] in month_list:return str(month_list.index(month_str[:3]) + 1)def str_to_time(time_str):'''处理时间字符串:param time_str::return:|Selasa, 30 Juli 2019 / 20: 03 WIB'''str_list = time_str.strip().split(" ")print(str_list)hours = str_list[5] + str_list[6]year = str_list[3]day = str_list[1]month = judge_month(str_list[2])publish_time = year + "/" + month + "/" + day + " " + hoursprint(publish_time)def get_info_spider():urls = r"https://keuangan.kontan.co.id/"headers = {'User-Agent': "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"}status_code_num = requests.get(urls, headers=headers).status_codeif status_code_num == 200:response = requests.get(urls, headers=headers).content.decode('utf-8')mytree = etree.HTML(response)info_list = mytree.xpath(".//*[@id='list-news']/li")id = 0info_cnblog_list = []for info in info_list:info_dict = {}id += 1title = info.xpath("./a/div/img/@title")[0]article = r"https://keuangan.kontan.co.id" + info.xpath('./a/@href')[0]typ = info.xpath("./div[1]/div[1]/span[1]/a/text()")[0]publish_time = info.xpath("./div[1]/div[1]/span[2]/text()")[0]info_dict["id"] = str(id)info_dict["title"] = titleinfo_dict["article"] = articleinfo_dict["typ"] = typinfo_dict["publish_time"] = str_to_time(publish_time)info_cnblog_list.append(info_dict)return info_cnblog_listdef page_turning():'''翻页功能，POST请求，三个参数'''urls = r"https://keuangan.kontan.co.id/ajax/more_news"# page = 1# while True:data = {"offset": 15,"id_rubrik": "terbaru","kanal_name": "keuangan"}headers = {'User-Agent': "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"}res = requests.post(urls, data=data, headers=headers).content.decode("utf=8")mytree = etree.HTML(res)info_list = mytree.xpath("//li")info_cnblog_list = []id = 0for info in info_list:info_dict = {}id += 1title = info.xpath("./a/div/img/@title")[0]article = r"https://keuangan.kontan.co.id" + info.xpath('./a/@href')[0]typ = info.xpath("./div[1]/div[1]/span[1]/a/text()")[0]publish_time = info.xpath("./div[1]/div[1]/span[2]/text()")[0]info_dict["id"] = str(id)info_dict["title"] = titleinfo_dict["article"] = articleinfo_dict["typ"] = typinfo_dict["publish_time"] = str_to_time(publish_time)info_cnblog_list.append(info_dict)print(title)# print(len(title))if __name__ == "__main__":# res = Check_frequency("深度学习入门", 8)# print(res)# get_info_spider()# time_str = r"|Selasa, 30 Juli 2019 / 20: 03 WIB"# str_to_time(time_str)page_turning()

推荐一本书：爬虫的良心教材《Python爬虫开发与项目实战》范传辉编著

Python 爬取外文网站并翻译中文和百度搜索验证相关推荐

Python | 使用Python爬取Wallhaven网站壁纸并上传百度网盘
更多详情请查看Honker Python | 使用Python爬取Wallhaven网站壁纸并上传百度网盘给大家推荐一款超好用的壁纸下载网站-- wallhaven 第一次知道这个网站的时候,惊为天 ...
使用Python爬取马蜂窝网站的游记和照片
使用Python爬取马蜂窝网站的游记和照片特殊原因需要在马蜂窝上爬取一些游记和照片作为后续分析处理的数据,参考网上一些类似的爬虫文章,自己尝试了一下,这次爬取的是马蜂窝上所有有关苏州的游记(包括游记 ...
Python爬取素材网站的音频文件
这篇文章主要介绍了基于Python爬取素材网站音频文件,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下,另外我建立了一个Python学习圈子群:115 ...
用python爬取动态网页上的图片（百度图片）
用python爬取动态网页上的图片(百度图片) 参考B站一个视频,视频链接: https://www.bilibili.com/video/BV1Va4y1Y7fK?share_source=copy ...
python爬取外文文献翻译_利用Python爬取翻译网站的翻译功能
现在我想分享一个利用Python技术,爬取一个翻译网站的翻译功能的小代码. 首先隆重介绍我们今天将要爬取的网站:http://fy.iciba.com/ 咱们用Python中的urllib模块的功能进 ...
怎么用python爬取整个网站_5分钟学会Python爬取整个网站
本图文配套视频演示爬取网站的步骤: 设定爬取目标目标网站:我自己的博客,疯狂的蚂蚁 http://www.crazyant.net 目标数据:所有博客文章的 - 链接.标题.标签 2. 分析目标网 ...
python爬取小说网站资源_利用python的requests和BeautifulSoup库爬取小说网站内容
1. 什么是Requests?html Requests是用Python语言编写的,基于urllib3来改写的,采用Apache2 Licensed 来源协议的HTTP库.python 它比urlli ...
Python爬取小说网站下载小说
1前言这个小程序是用来爬取小说网站的小说的,一般的盗版小说网站都是很好爬取的因为这种网站基本没有反爬虫机制的,所以可以直接爬取该小程序以该网站http://www.126shu.com/15/下 ...
5分钟学会Python爬取整个网站
爬取网站的步骤: 设定爬取目标目标网站:我自己的博客,疯狂的蚂蚁 http://www.crazyant.net 目标数据:所有博客文章的 - 链接.标题.标签分析目标网站待爬取页面:http: ...

Python 爬取外文网站并翻译中文和百度搜索验证

Python 爬取外文网站并翻译中文和百度搜索验证相关推荐

最新文章

热门文章