Python爬虫-《神雕侠侣》

Python3.5

爬取《神雕侠侣》http://www.kanunu8.com/wuxia/201102/1610.html

武侠迷，所以喜欢爬取武侠小说

#!/usr/bin/python
# -*- coding: utf-8 -*-from selenium import webdriver
import os
from docx import Document
import reclass House():def __init__(self):self.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'}self.baseUrl = 'http://www.kanunu8.com/wuxia/201102/1610.html'self.basePath = os.path.dirname(__file__)def makedir(self, name):path = os.path.join(self.basePath, name)isExist = os.path.exists(path)if not isExist:os.makedirs(path)print('File has been created.')else:print('The file is existed.')#切换到该目录下
        os.chdir(path)def connect(self, url):try:driver = webdriver.PhantomJS()driver.get(url)return driverexcept:print('This page is not existed.')#爬取每个板块中每一章节的链接地址def getBookLinkList(self, url):driver = self.connect(url)bookLinkList = []try:#找到所有href链接bookLinks = driver.find_elements_by_xpath("//a")for link in bookLinks:temp = link.get_attribute('href')print(temp)try:#通过正则表达式筛选出各章节的链接pattern = re.compile(".+\/[0-9]{5}\.html$")if pattern.match(temp):print('ok')bookLinkList.append(link.get_attribute('href'))except:print('little error')except:print('Error')return bookLinkList#爬取每本书的细节数据def getBookDetail(self, url):driver = self.connect(url)try:#找到标题和文章内容title = driver.find_element_by_xpath('//h2').textcontent = driver.find_element_by_xpath('//p').textprint(title)print(content)except:print('Error.')return title, contentdef getData(self):doc = Document()self.makedir('StoryFiles')bookLinkList = self.getBookLinkList(self.baseUrl)for linkUrl in bookLinkList:doc.add_paragraph(self.getBookDetail(linkUrl))doc.save('神雕侠侣.docx')if __name__ == '__main__':house = House()house.getData()

转载于:https://www.cnblogs.com/fredkeke/p/7761100.html

Python爬虫-《神雕侠侣》相关推荐

Python爬虫初步
Python爬虫初步这里要介绍一下urllib2这个模块作用:主要是用于打开url. 核心方法: - urlopen(url[, data][, timeout]) - 打开一个url,该url参 ...
关于Python爬虫原理和数据抓取1.1
为什么要做爬虫? 首先请问:都说现在是"大数据时代",那数据从何而来? 企业产生的用户数据:百度指数.阿里指数.TBI腾讯浏览指数.新浪微博指数数据平台购买数据:数据堂.国云数据 ...
python爬虫之Scrapy框架的post请求和核心组件的工作流程
python爬虫之Scrapy框架的post请求和核心组件的工作流程一 Scrapy的post请求的实现在爬虫文件中的爬虫类继承了Spider父类中的start_urls,该方法就可以对star ...
python爬虫抓取信息_python爬虫爬取网上药品信息并且存入数据库
我最近在学习python爬虫,然后正好碰上数据库课设,我就选了一个连锁药店的,所以就把网上的药品信息爬取了下来. 1,首先分析网页 2,我想要的是评论数比较多的,毕竟好东西大概是买的人多才好.然后你会 ...
python爬虫案例_推荐上百个github上Python爬虫案例
现在学生都对爬虫感兴趣,这里发现一些好的github开源的代码,分享给各位 1.awesome-spider 该网站提供了近上百个爬虫案例代码,这是ID为facert的一个知乎工程师开源的,star6 ...
Python培训分享：python爬虫可以用来做什么?
爬虫又被称为网络蜘蛛,它可以抓取我们页面的一些相关数据,近几年Python技术的到来,让我们对爬虫有了一个新的认知,那就是Python爬虫,下面我们就来看看python爬虫可以用来做什么? Pytho ...
玩转 Python 爬虫，需要先知道这些
作者 | 叶庭云来源 | 修炼Python 头图 | 下载于视觉中国爬虫基本原理 1. URI 和 URL URI 的全称为 Uniform Resource Identifier,即统一资源标志 ...
买不到口罩怎么办？Python爬虫帮你时刻盯着自动下单！| 原力计划
作者 | 菜园子哇编辑 | 唐小引来源 | CSDN 博客马上上班了,回来的路上,上班地铁上都是非常急需口罩的. 目前也非常难买到正品.发货快的口罩,许多药店都售完了. 并且,淘宝上一些新店口罩 ...
一个月入门Python爬虫，轻松爬取大规模数据
如果你仔细观察,就不难发现,懂爬虫.学习爬虫的人越来越多,一方面,互联网可以获取的数据越来越多,另一方面,像 Python这样一个月入门Python爬虫,轻松爬的编程语言提供越来越多的优秀工具,让爬虫 ...
Python爬虫获取文章的标题及你的博客的阅读量，评论量。所有数据写入本地记事本。最后输出你的总阅读量！
Python爬虫获取文章的标题及你的博客的阅读量,评论量.所有数据写入本地记事本.最后输出你的总阅读量!还可以进行筛选输出!比如阅读量大于1000,之类的! 完整代码在最后.依据阅读数量进行降序输出! ...

Python爬虫-《神雕侠侣》

Python爬虫-《神雕侠侣》相关推荐

最新文章

热门文章