爬虫项目之爬取页面并按界面样式导入excel表格

该需求如下：
爬取http://sgv.gurdon.cam.ac.uk/gene_page.php?gene=Del1_TDA8类似这样的网站4000多个（巨恶心），然后爬取如下的字段：

并将其写入excel表格。
难点在于这个每个url的显示的字段是不一样的，需要作出判断，而且格式需求对齐导入excel。
接下来上代码：

#@brief:抓取文本并导入到excel表格当中
#author: ytouch
#email:942840260@qq.comimport requests
from bs4 import BeautifulSoup
import xlwt #负责写
from xlwt import Workbookrol_count = 0
def getUrlList(filepath):'''@brief:读取文本'''#打开文件url_count = 0list_url = []file = open(filepath)for line in file:cur_url =line.strip() #去掉空格url_count = url_count +1list_url.append(cur_url)file.close()return list_urldef getDataFromUrl(data_url,index,in_col):'''@brief:获取相应的url@param:data_url：链接@param:index:当前行数@param:in_rol:列数，@return:最终停留的行数'''book_sheet.write(index,0,label=data_url) #添加url到booksheetindex = index + 1response = requests.get(data_url)soup = BeautifulSoup(response.text,'lxml')tag_title = soup.find_all('h1')[0].get_text() + soup.find_all('h2')[0].get_text() #拼接titlelist_query = soup.find_all('tr',style="vertical-align: top")book_sheet.write(index, 0, label=tag_title)  # 添加title到booksheetindex = index + 1str_tag_data = list_query[0].find_all('td')in_col = in_col + 0for s_data in str_tag_data:if len(s_data.find_all('td'))<1:curdata_str = s_data.get_text() #获取字符串#判断是否小于cur_data_str是否为Secondary mutations 不是的话进行操作if curdata_str.find('Secondary mutations')< 0: #位置判断法#其余均写入数据if curdata_str == 'Homozygous':book_sheet.write(index,0,label=curdata_str)index = index+1in_col = 0continueif curdata_str == 'Heterozygous':book_sheet.write(index,0,label=curdata_str)index = index+1in_col = 0continueif curdata_str == ' - - -  ':book_sheet.write(index,0,label=' - - -  ')index = index + 1in_col = 0continueif curdata_str == 'IV':book_sheet.write(index, 0, label='IV')index = index + 1in_col = 0continue#需要遍历20-1难受if curdata_str == 'XX':book_sheet.write(index,0,label='XX')index = index + 1in_col = 0continueif curdata_str == 'XIX':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'XVIII':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'XVII':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'XVI':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'XV':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'XIV':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'XIII':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'XII':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'XI':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'X':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'IX':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'VIII':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'VII':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'VI':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'V':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'IV':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'III':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'II':book_sheet.write(index,0,label=curdata_str)index = index + 1in_col = 0continueif curdata_str == 'I':book_sheet.write(index, 0, label=curdata_str)index = index + 1in_col = 0continuein_col = in_col + 1index = index -1book_sheet.write(index,in_col,curdata_str)index = index + 1return index#获取url list
list_cur_url = getUrlList('test.txt') #list_cur_url: 需要请求的url集合
#excel部分
book = Workbook(encoding='utf-8') #设置encoding形式
book_sheet=book.add_sheet('gurdon') #添加新表:表名：gurdon
#外部导入序列号添加
cur_count_index = 0
for in_url in list_cur_url:cur_cols = 0cur_count_index = getDataFromUrl(in_url,cur_count_index,cur_cols)cur_count_index = cur_count_index + 1
book.save('gurdon.xls') #保存生成的xls文件

其中所有url都存在txt文件中读取，不然一个个写真的脑残啊
展示一下爬取结果：

基本是符合要求，用到的库其中比较实用的就是xlwt：比较垃圾的一点是没法一边读取一边写入，这个算比较垃圾的了，其他还行啦。
这段代码值几百元，希望对大家的思路有所帮助了。

爬虫项目之爬取页面并按界面样式导入excel表格相关推荐

python3和burpsuite组合爬取网页数据并存储在excel表格(需要登录后才能看到的大量数据)
python3和burpsuite组合爬取网页数据并存储在excel表格作者:ch4nge 时间:2020.12.18 前言最近在工作中遇到一个问题:渗透进入某网站后台,发现大量的用户数据(某恶意 ...
python爬取豆瓣网评并写入excel表格中
为了爬取网评我们需要导入几个模块 from selenium import webdriver import time import xlwt 先定义要爬取的网站url'以及设置浏览器参数 movie ...
vue3+vite项目使用xlsx+xlsx-style+file-saver导出带有样式的excel表格方法
基于vue3+vite的项目实现导出带有样式的excel表格,框架用的是vben,所以表格用的是ant的table组件数据源,如果用原生表格需要用到备注的另外方法. 首先需要下载xlsx.xlsx-s ...
爬虫项目——Scrapy爬取Boss直聘
Scrapy添加代理爬取boss直聘,并存储到mongodb 最终爬取截图项目创建 items Spider Middleware添加ip代理 Pipeline添加mongodb存储最终爬取截图 ...
Python爬虫项目：爬取JSON数据存储Excel表格与存储图片
随着网络的迅速发展,万维网成为大量信息的载体,如何有效地提取并利用这些信息成为一个巨大的挑战.搜索引擎(Search Engine),例如传统的通用搜索引擎AltaVista,Yahoo!和Googl ...
爬虫项目：爬取A股3000多家上市公司Python代码+解释
''' 爬虫流程: 1 模拟浏览器向服务器发出请求,然后处理响应,最常用的函数就是requests下面的get请求2 BeautifulSoup解析网页利用pandas库中的read_html方法快 ...
python下载图片、已知url_python 爬虫之requests爬取页面图片的url，并将图片下载到本地...
import requestsfromlxml import etree import time import os import re requests=requests.session() web ...
python自动化爬取淘宝商品数据导入execl表格！
hello,大家好,我是夜斗小神社! 电商时代,淘宝.京东.天猫商品数据对店铺运营有极大的帮助,因此获取相应店铺商品的数据能够带来极大的价值,那么我们如何获取到相应的数据呢? 上一篇我们讲了pytho ...
Python爬虫项目--批量爬取公司债券平台网公司信息并下载PDF
# 下载公司债券平台项目PDF信息 08年6月合起来67页数据import os import time import requests from selenium import webdriver ...

爬虫项目之爬取页面并按界面样式导入excel表格

爬虫项目之爬取页面并按界面样式导入excel表格相关推荐

最新文章

热门文章