爬取某网站，首先爬取目标的id，然后拼网址，但是再次运行，发现根据id拼接的网址已经有所变化，解决方式

第一，直接每页打开，不给网页id更换的时间#

第二，直接打开之后，将每个目标网页存到list，这样后面效率高，本质和第一种方式一致

第三，首先爬取所有小区名称，然后使用selenium一个一个搜索，这样的方法应该是最牛逼的，但是这样比较慢，还是先爬取小区之后，采用前两种方式，剩下来没有搞定的，在用这种，也就是说结合起来比较好

下面的已经已经学会了bs4 在飞机上试验几次明白了，后来看了别人说xml效率更高，准备换，不过bs4毕竟还是纯使用python写的，还是会用的。

上代码：

# -*- coding: utf-8 -*-
import requests
import re
import time
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
import random# 下面分别设置头和代理hds=[{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'},\{'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.12 Safari/535.11'},\{'User-Agent':'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)'},\{'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:34.0) Gecko/20100101 Firefox/34.0'},\{'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/44.0.2403.89 Chrome/44.0.2403.89 Safari/537.36'},\{'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'},\{'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'},\{'User-Agent':'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0'},\{'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'},\{'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1'},\{'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'},\{'User-Agent':'Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11'},\{'User-Agent':'Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11'}]
proxie = {'http' : 'http://221.195.53.117:8998'}  #靠谱的代理http://www.gatherproxy.com/zh/  https://www.zhihu.com/question/23825711 也许学会了就可以彻底翻墙了# 放弃首先爬取id的方式，直接打开每页之后每个打开,然后再次利用循环直接爬取for i in range(1,2):sleep(1)url = '你需要自己完成的'+ str(i)+'/'# html = requests.get(url,headers=hds[random.randint(0,len(hds)-1)],proxies = proxie,allow_redirects = False)html = requests.get(url,headers=hds[random.randint(0,len(hds)-1)],allow_redirects = False)html.encoding = 'utf-8'bigger = str(re.findall(r'''class="listContent"(.*?)class="contentBottom clear"''',html.text,re.S))  #不加str会提示expected string or bytes-like object，首先需要转化成字符串soup = BeautifulSoup(bigger,'html.parser',from_encoding='utf-8')newurls = soup.select('div[class="title"]')   # 列表表达式的应用list_a = [tag.get('href') for tag in soup.select('a[href]')]for each in newurls:xqurl = re.findall(r'''href="(.*?)"''',str(each),re.S)sleep(random.randint(6,12))xqhtml = requests.get(xqurl[0],headers=hds[random.randint(0,len(hds)-1)],allow_redirects = False)xqhtml.encoding = 'utf-8'xy = re.findall(r'''resblockPosition:'(.*?)',''', xqhtml.text, re.S)soup1 = BeautifulSoup(xqhtml.text, 'html.parser', from_encoding='utf-8')name = soup1.find(class_=["你需要自己完成的"]).stringprice = soup1.find(class_=["你需要自己完成的"]).string  # 列表表达式的应用list_a = [tag.get('href') for tag in soup.select('a[href]')]other = soup1.find_all(class_=["你需要自己完成的"])print(name,price,xy[0],other[0],other[1],other[2],other[3],other[4],other[5],other[6],other[7],xqurl[0])

数据做好之后，就可以放到csv或者其他数据库了

上面扫描每一页的方法存在一个bug，就是不同页有重复，有一些list包括的没有爬取到，于是就要和list对比，将没有爬取到单独查询一遍，一个意外收获是，这样的方法也可以用来更新数据（包括更新小区，更新小区信息）

另外本来想完全用sele实现，后来发现切换窗口比较复杂，还不如用request代码，反正追求结果好就行了。

代码如下：

# -*- coding: utf-8 -*-from selenium import webdriver
from selenium.webdriver.common.keys import Keys #导入模拟点击
import re
import codecs
import requests
from bs4 import BeautifulSoup
from time import sleep  #导入等待 无比重要的等待x = '搜索名称'
browser = webdriver.Chrome()  #mac系统的话chromedriver()放到 usr/local/bin/ 下面就可以 不需要禁用sip
browser.get('网址')
elem = browser.find_element_by_id("searchInput")
sleep(5)
elem.send_keys(x)
elem.send_keys(Keys.RETURN)
sleep(5)
html = requests.get(browser.current_url)
html.encoding = 'utf-8'
targeturl = re.findall(r'''<a class="img" href="(.*?)" target="_blank" rel="nofollow">''',html.text,re.S)
xqhtml = requests.get(targeturl[0])
xqhtml.encoding = 'utf-8'

爬取数据解决方案- 每页打开+单个查询相关推荐

Python爬取数据：翻页操作
Python爬取视频在上一章已经实现,如果爬取数据的时候发现不止一页数据,而是很多页数据的时候,我们就需要爬虫自行翻页操作继续获取另一页的数据.那么如何实现的翻页操作是本章主要描述内容. 该文章爬取数 ...
python使用requests爬取淘宝搜索页数据
前一段时间负责爬取淘宝的一些商品信息,本来接到爬取淘宝的任务的时候,下意识的就想用selenium(毕竟淘宝有点不好搞).但是使用selenium时搜索页面也得需要登录,并且当使用selenium时不 ...
python + selenium多进程爬取淘宝搜索页数据
python + selenium多进程爬取淘宝搜索页数据 1. 功能描述按照给定的关键词,在淘宝搜索对应的产品,然后爬取搜索结果中产品的信息,包括:标题,价格,销量,产地等信息,存入mongodb ...
python3爬取数据存入mysql_Python如何爬取51cto数据并存入MySQL
实验环境 1.安装Python 3.7 2.安装requests, bs4,pymysql 模块实验步骤1.安装环境及模块 2.编写代码 ? 1 2 3 4 5 6 7 8 9 10 11 12 1 ...
python爬虫excel数据_最简单的爬数据方法：Excel爬取数据，仅需6步
原标题:最简单的爬数据方法:Excel爬取数据,仅需6步在看到这篇文章的时候,大家是不是都还停留在对python爬虫的迷恋中,今天就来教大家怎样使用微软的Excel爬取一个网页的后台数据,注:此方法 ...
利用python编写爬虫程序，从招聘网站上爬取数据，将数据存入到MongoDB数据库中，将存入的数据作一定的数据清洗后做数据分析，最后将分析的结果做数据可视化
教程演示创建爬虫项目编写需要爬取的字段(items.py) 编写spider文件(wuyou.py) 编写数据库连接(pipelines.py) 编写反爬措施(settings.py) Mongo ...
python3爬取微博评论教程_用python 爬取微博评论，怎么打开微博评论下的查看更多|...
怎样用python爬新浪微博大V所有数据先上结论,通过公开的api如爬到某大v的所有数据,需足以下两个条件: 1.在你的爬虫开始运行时,该大v的所有微博发布量没有超过回溯查询的上限,新浪是2000, ...
教你如何使用Java代码从网页中爬取数据到数据库中——网络爬虫精华篇
文章目录 1:网络爬虫介绍 2:HttpClients类介绍 2.1 HttpGet参数问题 2.2 HttpPost参数问题 2.3 连接池技术问题 3:Jsoup介绍 4:动手实践如何抓取网页上数 ...
Python学习：Python分析钉钉评论（一）爬取数据
爬取数据从App Store官网拿下评论数据做分析 App Store评论API: https://itunes.apple.com/rss/customerreviews/page=1/id=/s ...

爬取数据解决方案- 每页打开+单个查询

爬取某网站，首先爬取目标的id，然后拼网址，但是再次运行，发现根据id拼接的网址已经有所变化，解决方式

爬取数据解决方案- 每页打开+单个查询相关推荐

最新文章

热门文章