celery爬取58同城二手车数据及若干问题

今天分享一下celery分布式爬取58同城二手车(以下简称58)的方法。

反爬

58中的反爬主要有字体加密和验证码验证。
先说字体加密,真实的字体文件经过base64加密后放在了网页源代码中,将其用匹配下来后进行解密,根据坐标数据和真实数据进行映射,创建一个字典。由于每次请求字体文件都不一样,所以每次都要匹配出字体文件,根据刚才创建的字典获取真实数据,然后替换掉加密字符,最后解析需要的数据,字体加密就解决了。然后详细说一下验证码的问题。

验证码

在爬取过程中,会不时的跳转到验证页面进行验证,验证方式主要为滑动验证,有时滑动验证之后还会弹出点选验证,这个应该需要darknet训练模型然后识别,这里先不探讨。本来想用上篇文章中的方法解决滑动验证,尴尬的是58中的验证码图片处理之后,干扰识别的轮廓太多,试了几次都不成功,而且每次小滑快(图1)的方向也是随机的,想了一下,可以使用aircv中模板匹配的方法进行解决,先将小滑块也进行灰度转换,二值化处理,然后截取小滑块中的某个特征部分,作为子图;完整的验证码图片经过灰度转换,二值化(阈值大约在185左右)处理后类似图2,基本都包含小滑块的整体轮廓,可以将其作为模板。现在的问题是截取小滑块中的哪个特征部分作为子图,由于每次验证小滑块的方向和形状都会变,经过比较和试验,将图1中红色圈出部分作为子图成功率较高,但这个应该还不是最好的,截取处理之后类似图3,这里不是很清楚,实际上这个小图包含图2小滑块轮廓外面的白色部分和滑块轮廓的黑色部分,这样将其作为子图进行匹配时基本都可以找到滑块在验证码中的坐标,但是找到的坐标可能是滑块轮廓的左边,也有可能是右边,所以还要设置一个偏移量,验证两次。


代码

1.验证码部分:verifycaptcha.py

from time import sleep
import random, math
import aircv as ac
import cv2
import numpy as np
from PIL import Image
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ECSCRIPT = 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined})'class VerifyCaptcha:def __init__(self, threshold=180):# 设置二值化阈值self.threshold = 175# imgsrc=原始图像,imgobj=待查找的图片self.imgsrc = "./imgs/handled.png"self.imgobj = "./imgs/handleds.png"def getTrack(self, gap, offset):# 生成滑动轨迹track = []gap = gap + offset# 当前位移current = 0# 减速阈值mid = gap * 4 / 5  # 前4/5段加速 后1/5段减速# 计算间隔t = 0.2# 初速度v = random.randint(1,4)while current < gap:if current < mid:a = 3  # 加速度为+3else:a = -3  # 加速度为-3# 初速度v0v0 = v# 当前速度v = v0 + a * t# 移动距离move = v0 * t + math.sin(1 / 2 * a * t * t) * 15# 当前位移current += move# 加入轨迹track.append(round(move))return track# 处理图片def handle(self, path, threshold):img = Image.open(path)img = img.convert('L')table = []for i in range(256):if i < threshold:table.append(0)else:table.append(1)bim = img.point(table, '1')bim.save('./imgs/handled.png')def getImg(self, url):global browserbrowser = webdriver.Chrome()wait = WebDriverWait(browser, 10)browser.get(url)# 切换到账号登录标签btn_ver = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.code_num > input')))btn_ver.click()# 获取验证码图片canvas = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.dvc-captcha__bgImg')))btn = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.dvc-slider__handler')))self.handle_slider(canvas, browser, btn, wait)sleep(3)browser.close()def handle_slider(self, canvas, browser, btn, wait):canvas.screenshot('./imgs/1.png')self.cropImg()self.handle('./imgs/croped.png', self.threshold)gap = self.matchImg()print(gap)# 这里设置了三个偏移量,增加成功率for offset in [57, 13, 11]:# 获取滑块轨迹track = self.getTrack(gap, offset)# 移动滑块ActionChains(browser).click_and_hold(btn).perform()for x in track:y = random.uniform(-3, 3)ActionChains(browser).move_by_offset(xoffset=x, yoffset=y).perform()ActionChains(browser).release(btn).perform()btn = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.dvc-slider__handler')))if btn:if offset == 11:# 这里判断三次尝试验证后是否成功,如果失败,继续验证,重复上面操作return self.handle_slider(canvas, browser, btn, wait)continueelse:returndef cropImg(self):# 剪切图片,去除小图对轮廓识别的影响base_crop = 70img1 = Image.open('./imgs/1.png')box = (base_crop, 30, img1.size[0]-20, img1.size[1]-25)img = img1.crop(box)img.save('./imgs/croped.png')# 从移动滑块中截取一小块作为子图,从拼图中进行模板匹配,从而找到小滑块的坐标def matchImg(self, confidencevalue=0.5):  imsrc = ac.imread(self.imgsrc)imobj = ac.imread(self.imgobj)match_result = ac.find_template(imsrc, imobj, confidencevalue)  if match_result is not None:match_result['shape'] = (imsrc.shape[1], imsrc.shape[0])  return match_result['result'][0]if __name__ == "__main__":vc = VerifyCaptcha()# 测试用验证码页面的连接url = 'https://callback.58.com/antibot/verifycode?serialId=52e2ca88845f157e29a6d26349ef0344_6a059b1477814bb8baf7ee04e2b61764&code=22&sign=235448960cb8f4d6cd710b06c61dc57a&namespace=usdt_infolist_car&url=https%3A%2F%2Finfo5.58.com%3A443%2Ftj%2Fershouche%2F%3FPGTID%3D0d100000-0001-2e0e-cd43-870ea87150b4%26ClickID%3D8'# url = 'https://bj.58.com/ershouche/pn2'vc.getImg(url)

2.celery生成任务部分:celery_client.py

from celery_con import crawl, app
from city import generate_urlfor url in generate_url():app.send_task('celery_con.crawl', args=(url,))

3.celery app部分:celert_con.py

from celery import Celery
from city import generate_url
from download_v3 import CarSpider
import celeryconfigapp = Celery('tasks')
app.config_from_object('celeryconfig')
carspider = CarSpider()@app.task
def crawl(url):carspider.run(url)

4.celery配置部分:celeryconfig.py

BROKER_URL = 'amqp://admin:yourpassword@ip:5672/'
CELERY_RESULT_BACKEND = 'amqp://'

5.生成url部分:city.py

CITY = ['bj', 'sh', 'tj', 'cq', 'hf', 'wuhu', 'bengbu', 'fy', 'hn', 'anqing', 'suzhou', 'la', 'huaibei', 'chuzhou', 'mas', 'tongling', 'xuancheng', 'bozhou', 'huangshan', 'chizhou', 'ch', 'hexian', 'hq', 'tongcheng', 'ningguo', 'tianchang', 'dongzhi', 'wuweixian', 'fz', 'xm', 'qz', 'pt', 'zhangzhou', 'nd', 'sm', 'np', 'ly', 'wuyishan', 'shishi', 'jinjiangshi', 'nananshi', 'longhai', 'shanghangxian', 'fuanshi', 'fudingshi', 'anxixian', 'yongchunxian', 'yongan', 'zhangpu', 'sz', 'gz', 'dg', 'fs', 'zs', 'zh', 'huizhou', 'jm', 'st', 'zhanjiang', 'zq', 'mm', 'jy', 'mz', 'qingyuan', 'yj', 'sg', 'heyuan', 'yf', 'sw', 'chaozhou', 'taishan', 'yangchun', 'sd', 'huidong', 'boluo', 'haifengxian', 'kaipingshi', 'lufengshi', 'nn', 'liuzhou', 'gl', 'yulin', 'wuzhou', 'bh', 'gg',
'qinzhou', 'baise', 'hc', 'lb', 'hezhou', 'fcg', 'chongzuo', 'guipingqu', 'beiliushi', 'bobaixian', 'cenxi', 'gy', 'zunyi', 'qdn', 'qn', 'lps', 'bijie', 'tr', 'anshun', 'qxn', 'renhuaishi', 'qingzhen', 'lz', 'tianshui', 'by', 'qingyang', 'pl', 'jq', 'zhangye', 'wuwei', 'dx', 'jinchang', 'ln', 'linxia', 'jyg', 'gn', 'dunhuang', 'haikou', 'sanya', 'wzs', 'sansha', 'qh', 'wenchang', 'wanning', 'tunchang', 'qiongzhong', 'lingshui', 'df', 'da', 'cm', 'baoting', 'baish', 'danzhou', 'zz', 'luoyang', 'xx', 'ny', 'xc', 'pds', 'ay', 'jiaozuo', 'sq', 'kaifeng', 'puyang', 'zk', 'xy', 'zmd', 'luohe', 'smx', 'hb', 'jiyuan', 'mg', 'yanling', 'yuzhou', 'changge', 'lingbaoshi', 'qixianqu', 'ruzhou', 'xiangchengshi', 'yanshiqu', 'changyuan', 'huaxian', 'linzhou', 'qinyang', 'mengzhou', 'wenxian', 'weishixian', 'lankaoxian', 'tongxuxian', 'lyxinan', 'yichuan', 'mengjinqu', 'lyyiyang', 'wugang', 'yongcheng', 'suixian', 'luyi', 'yingchixian', 'shenqiu', 'taikang', 'shangshui', 'qixianq', 'junxian', 'fanxian', 'gushixian', 'huaibinxian', 'dengzhou', 'xinye', 'hrb', 'dq', 'qqhr', 'mdj', 'suihua', 'jms', 'jixi', 'sys', 'hegang', 'heihe', 'yich', 'qth', 'dxal', 'shanda', 'shzhaodong', 'zhaozhou', 'wh', 'yc', 'xf', 'jingzhou', 'shiyan', 'hshi', 'xiaogan', 'hg', 'es', 'jingmen', 'xianning', 'ez', 'suizhou', 'qianjiang', 'tm', 'xiantao', 'snj', 'yidou', 'hanchuan', 'zaoyang', 'wuxueshi', 'zhongxiangshi', 'jingshanxian', 'shayangxian', 'songzi', 'guangshuishi', 'chibishi', 'laohekou', 'gucheng', 'yichengshi', 'nanzhang', 'yunmeng', 'anlu', 'dawu', 'xiaochang', 'dangyang', 'zhijiang', 'jiayuxian', 'suixia', 'cs', 'zhuzhou', 'yiyang', 'changde', 'hy', 'xiangtan', 'yy', 'chenzhou', 'shaoyang', 'hh', 'yongzhou', 'ld', 'xiangxi', 'zjj', 'liling', 'lixian', 'czguiyang', 'zixing', 'yongxing', 'changningshi', 'qidongxian', 'hengdong', 'lengshuijiangshi', 'lianyuanshi', 'shuangfengxian', 'shaoyangxian', 'shaodongxian', 'yuanjiangs', 'nanxian', 'qiyang', 'xiangyin', 'huarong', 'cilixian', 'zzyouxian', 'sjz', 'bd', 'ts', 'lf', 'hd', 'qhd', 'cangzhou', 'xt', 'hs', 'zjk', 'chengde', 'dingzhou', 'gt', 'zhangbei', 'zx', 'zd', 'qianan', 'renqiu', 'sanhe', 'wuan', 'xionganxinqu', 'lfyanjiao', 'zhuozhou', 'hejian', 'huanghua', 'cangxian', 'cixian', 'shexian', 'bazhou', 'xianghe', 'lfguan', 'zunhua', 'qianxixian', 'yutianxian', 'luannanxian', 'shaheshi', 'su', 'nj', 'wx', 'cz', 'xz', 'nt', 'yz', 'yancheng', 'ha', 'lyg', 'taizhou', 'suqian', 'zj', 'shuyang', 'dafeng', 'rugao', 'qidong', 'liyang', 'haimen', 'donghai', 'yangzhong', 'xinghuashi', 'xinyishi', 'taixing', 'rudong', 'pizhou', 'xzpeixian', 'jingjiang', 'jianhu', 'haian', 'dongtai', 'danyang', 'baoyingx', 'guannan', 'guanyun', 'jiangyan', 'jintan', 'szkunshan', 'sihong', 'siyang', 'jurong', 'sheyang', 'funingxian', 'xiangshui', 'xuyi', 'jinhu', 'jiangyins', 'nc', 'ganzhou', 'jj', 'yichun', 'ja', 'sr', 'px', 'fuzhou', 'jdz', 'xinyu', 'yingtan', 'yxx', 'lepingshi', 'jinxian', 'fenyi', 'fengchengshi', 'zhangshu', 'gaoan', 'yujiang', 'nanchengx', 'fuliangxian', 'cc', 'jl', 'sp', 'yanbian', 'songyuan', 'bc', 'th', 'baishan', 'liaoyuan', 'gongzhuling', 'meihekou', 'fuyuxian', 'changlingxian', 'huadian', 'panshi', 'lishu', 'sy', 'dl', 'as', 'jinzhou', 'fushun', 'yk', 'pj', 'cy', 'dandong', 'liaoyang',
'benxi', 'hld', 'tl', 'fx', 'pld', 'wfd', 'dengta', 'fengcheng', 'beipiao', 'kaiyuan', 'yinchuan', 'wuzhong', 'szs', 'zw', 'guyuan', 'hu', 'bt', 'chifeng', 'erds', 'tongliao', 'hlbe', 'bycem', 'wlcb', 'xl', 'xam', 'wuhai', 'alsm', 'hlr', 'xn', 'hx', 'haibei', 'guoluo', 'haidong', 'huangnan', 'ys', 'hainan', 'geermushi', 'qd', 'jn', 'yt', 'wf', 'linyi', 'zb', 'jining', 'ta', 'lc', 'weihai', 'zaozhuang', 'dz', 'rizhao', 'dy', 'heze', 'bz', 'lw', 'zhangqiu', 'kl', 'zc', 'shouguang', 'longkou', 'caoxian', 'shanxian', 'feicheng', 'gaomi', 'guangrao', 'huantaixian', 'juxian', 'laizhou', 'penglai', 'qingzhou', 'rongcheng', 'rushan', 'tengzhou', 'xintai', 'zhaoyuan', 'zoucheng', 'zouping', 'linqing', 'chiping', 'hzyc', 'boxing', 'dongming', 'juye', 'wudi', 'qihe', 'weishan', 'yuchengshi', 'linyixianq', 'leling', 'laiyang', 'ningjin', 'gaotang', 'shenxian', 'yanggu', 'guanxian', 'pingyi', 'tancheng', 'yiyuanxian', 'wenshang', 'liangshanx', 'lijin', 'yinanxian', 'qixia', 'ningyang', 'dongping', 'changyishi', 'anqiu', 'changle', 'linqu', 'juancheng', 'ty', 'linfen', 'dt', 'yuncheng', 'jz', 'changzhi', 'jincheng', 'yq', 'lvliang', 'xinzhou', 'shuozhou', 'linyixian', 'qingxu', 'liulin', 'gaoping', 'zezhou', 'xiangyuanxian', 'xiaoyi', 'xa', 'xianyang', 'baoji', 'wn', 'hanzhong', 'yl', 'yanan', 'ankang', 'sl', 'tc', 'shenmu', 'hancheng', 'fugu', 'jingbian', 'dingbian', 'cd', 'mianyang', 'deyang', 'nanchong', 'yb', 'zg', 'ls', 'luzhou', 'dazhou', 'scnj', 'suining', 'panzhihua', 'ms', 'ga', 'zy', 'liangshan', 'guangyuan', 'ya', 'bazhong', 'ab', 'ganzi', 'anyuexian', 'guanghanshi', 'jianyangshi', 'renshouxian', 'shehongxian', 'dazu', 'xuanhan', 'qux', 'changningx', 'xj', 'changji', 'bygl', 'yili', 'aks', 'ks', 'hami', 'klmy', 'betl', 'tlf', 'ht', 'shz', 'kzls', 'ale', 'wjq', 'tmsk', 'kel', 'alt', 'tac', 'lasa', 'rkz', 'sn', 'linzhi', 'changdu', 'nq', 'al', 'rituxian', 'gaizexian', 'km', 'qj', 'dali', 'honghe', 'yx', 'lj', 'ws', 'cx', 'bn', 'zt', 'dh', 'pe', 'bs', 'lincang', 'diqing', 'nujiang', 'milexian', 'anningshi', 'xuanwushi', 'hz', 'nb', 'wz', 'jh', 'jx', 'tz', 'sx', 'huzhou', 'lishui',
'quzhou', 'zhoushan', 'yueqingcity', 'ruiancity', 'yiwu', 'yuyao', 'zhuji', 'xiangshanxian', 'wenling', 'tongxiang', 'cixi', 'changxing', 'jiashanx', 'haining', 'deqing', 'dongyang', 'anji', 'cangnanxian', 'linhai', 'yongkang', 'yuhuan', 'pinghushi', 'haiyan', 'wuyix', 'shengzhou', 'xinchang', 'jiangshanshi', 'pingyangxian']
URL_TEMPLATE = 'https://{addr}.58.com/ershouche/pn{page}'
def city_list():for i in CITY[:10]:yield i
# 测试只爬取前十个城市的数据
def generate_url():for i in CITY[:10]:# 测试只爬取前十个页面的数据for p in range(1, 10):yield URL_TEMPLATE.format(addr=i, page=p)

5.不使用celery的爬虫代码:download_v4.py

import requests
import re
import base64
from fontTools.ttLib import TTFont
from hashlib import md5
from time import sleep
from bs4 import BeautifulSoup
import csv
import aiohttp
import asyncio
from city import city_list
from pybloom_live import BloomFilter
from log_write import SpiderLog
from threading import Threadclass CarSpider:def __init__(self):self.spiderlog = SpiderLog()self.bf = BloomFilter(capacity=100000, error_rate=0.01)self.url_template = 'https://{addr}.58.com/ershouche/pn{page}'self.dic_font = {'856c80c30a9c2100282e94be2ef01a1a': 3, '4c12e2ca6ab31a1832549d3a2661cee9': 2, '221ce0f06ec2094938778887f59c096c': 1, '0edc309270450f4e144f1fa90a633a72': 0, 'a06d9a83fde2ea9b2fd4b8c0e92da4d9': 7,'fe91949296531c26783936c17da4c896': 6, '0d0fd3a2d04e61526662b13c2db00537': 5, '0958ad9f2976dce5451697bef0227a0f': 4, 'bf3f23b53cb12e04d67b3f141771508d': 9, '9de9732e406d7025c0005f2f9cec817a': 8}self.headers = {'Origin': 'https://tj.58.com','Referer': 'https://c.58cdn.com.cn/escstatic/upgrade/zhuzhan_pc/ershouche/ershouche_list_v20200622145811.css','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}self.thread_loop = asyncio.new_event_loop()async def downHTML(self, url, session):try:count = 0await asyncio.sleep(5)async with session.get(url, headers=self.headers) as resp:if str(resp.status)[0] == '2':if url not in self.bf:self.bf.add(url)self.spiderlog.info('正在爬取:{url}'.format(url=url))await asyncio.sleep(3)return await resp.text()else:count += 1if count < 3:await self.downHTML(url, session)return Noneexcept Exception as e:self.spiderlog.info(e)def getTempdict(self, html):pattern = re.compile(r"charset=utf-8;base64,(.*?)//wAP", re.S)try:ttf_url = re.search(pattern, html)content = base64.b64decode(ttf_url.group(1)+'=')with open('tc.ttf', 'wb') as f:f.write(content)font = TTFont('tc.ttf')temp_dict = {}for i, k in enumerate(font.getGlyphOrder()):if i == 0:continuecoor = font['glyf'][k].coordinatesm = md5(str(coor).encode()).hexdigest()k = k.lower().replace('uni00', '&#x')k = k.replace('uni', '&#x')temp_dict[k.lower()] = self.dic_font[m]return temp_dictexcept Exception as e:self.spiderlog.info(e)# /.&#x4e07,¥时.&#x2ddef parseHtml(self, html, temp_dict):for k, v in temp_dict.items():html = html.replace(k, str(v))# res_dic = {}try:soup = BeautifulSoup(html, 'lxml')city = re.search(r'<title>【(.*?)二手车.*?二手车交易市场.*?58同城</title>', html).group(1)prices = soup.select('.info--price b')info = soup.select('.info_params')title = soup.select('.info_title>span')tag = soup.select('div.info--desc div:nth-of-type(1)')except Exception as e:self.spiderlog.info(e)for p, i, t, ta in zip(prices, info, title, tag):item = {}item['城市'] = cityitem['价格'] = p.get_text().replace(';', '')item['车型'] = t.get_text().split('急')[0].strip()i = i.get_text("\n", strip=True).split('\n')ta = '_'.join(ta.get_text().strip().split('\n'))item['上牌时间'] = i[0]item['里程'] = i[2]item['tag'] = tayield itemasync def save(self, item):with open('car.csv', 'a', encoding='utf-8') as f:fieldname = ['城市', '价格', '车型', '上牌时间', '里程', 'tag']writer = csv.DictWriter(f, fieldnames=fieldname)writer.writerow({'城市': '城市', '价格': '价格(万)', '车型': '车型','上牌时间': '上牌时间', '里程': '里程', 'tag': 'tag'})for i in item:writer.writerow(i)async def main(self, url):async with aiohttp.ClientSession() as session:html_detail = await self.downHTML(url, session)if html_detail:temp_dict = self.getTempdict(html_detail)item = self.parseHtml(html_detail, temp_dict)await self.save(item)async def add_task(self, url):asyncio.run_coroutine_threadsafe(self.main(url), self.thread_loop)def start_loop(self, loop):asyncio.set_event_loop(loop)loop.run_forever()def run(self, url):    athread = Thread(target=self.start_loop, args=(thread_loop,))athread.start()loop = asyncio.get_event_loop()loop.run_until_complete(self.add_task(url))if __name__ == "__main__":cs = CarSpider()

6.使用celery的爬虫代码:download_v3.py

import requests
import re
import base64
from fontTools.ttLib import TTFont
from hashlib import md5
from time import sleep
from bs4 import BeautifulSoup
import csv
import aiohttp
import asyncio
from city import city_list
from pybloom_live import BloomFilter
from log_write import SpiderLog
from threading import Threadclass CarSpider:def __init__(self):self.spiderlog = SpiderLog()self.bf = BloomFilter(capacity=100000, error_rate=0.01)self.url_template = 'https://{addr}.58.com/ershouche/pn{page}'self.dic_font = {'856c80c30a9c2100282e94be2ef01a1a': 3, '4c12e2ca6ab31a1832549d3a2661cee9': 2, '221ce0f06ec2094938778887f59c096c': 1, '0edc309270450f4e144f1fa90a633a72': 0, 'a06d9a83fde2ea9b2fd4b8c0e92da4d9': 7,'fe91949296531c26783936c17da4c896': 6, '0d0fd3a2d04e61526662b13c2db00537': 5, '0958ad9f2976dce5451697bef0227a0f': 4, 'bf3f23b53cb12e04d67b3f141771508d': 9, '9de9732e406d7025c0005f2f9cec817a': 8}self.headers = {'Origin': 'https://tj.58.com','Referer': 'https://c.58cdn.com.cn/escstatic/upgrade/zhuzhan_pc/ershouche/ershouche_list_v20200622145811.css','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}self.thread_loop = asyncio.new_event_loop()async def downHTML(self, url, session):try:count = 0await asyncio.sleep(5)async with session.get(url, headers=self.headers) as resp:if str(resp.status)[0] == '2':if url not in self.bf:self.bf.add(url)self.spiderlog.info('正在爬取:{url}'.format(url=url))await asyncio.sleep(3)return await resp.text()else:count += 1if count < 3:await self.downHTML(url, session)return Noneexcept Exception as e:self.spiderlog.info(e)def getTempdict(self, html):pattern = re.compile(r"charset=utf-8;base64,(.*?)//wAP", re.S)try:ttf_url = re.search(pattern, html)content = base64.b64decode(ttf_url.group(1)+'=')with open('tc.ttf', 'wb') as f:f.write(content)font = TTFont('tc.ttf')temp_dict = {}for i, k in enumerate(font.getGlyphOrder()):if i == 0:continuecoor = font['glyf'][k].coordinatesm = md5(str(coor).encode()).hexdigest()k = k.lower().replace('uni00', '&#x')k = k.replace('uni', '&#x')temp_dict[k.lower()] = self.dic_font[m]return temp_dictexcept Exception as e:self.spiderlog.info(e)# /.&#x4e07,¥时.&#x2ddef parseHtml(self, html, temp_dict):for k, v in temp_dict.items():html = html.replace(k, str(v))# res_dic = {}try:soup = BeautifulSoup(html, 'lxml')city = re.search(r'<title>【(.*?)二手车.*?二手车交易市场.*?58同城</title>', html).group(1)prices = soup.select('.info--price b')info = soup.select('.info_params')title = soup.select('.info_title>span')tag = soup.select('div.info--desc div:nth-of-type(1)')except Exception as e:self.spiderlog.info(e)for p, i, t, ta in zip(prices, info, title, tag):item = {}item['城市'] = cityitem['价格'] = p.get_text().replace(';', '')item['车型'] = t.get_text().split('急')[0].strip()i = i.get_text("\n", strip=True).split('\n')ta = '_'.join(ta.get_text().strip().split('\n'))item['上牌时间'] = i[0]item['里程'] = i[2]item['tag'] = tayield itemasync def save(self, item):with open('car.csv', 'a', encoding='utf-8') as f:fieldname = ['城市', '价格', '车型', '上牌时间', '里程', 'tag']writer = csv.DictWriter(f, fieldnames=fieldname)writer.writerow({'城市': '城市', '价格': '价格(万)', '车型': '车型','上牌时间': '上牌时间', '里程': '里程', 'tag': 'tag'})for i in item:writer.writerow(i)async def main(self, url):async with aiohttp.ClientSession() as session:html_detail = await self.downHTML(url, session)if html_detail:temp_dict = self.getTempdict(html_detail)item = self.parseHtml(html_detail, temp_dict)await self.save(item)async def add_task(self, url):asyncio.run_coroutine_threadsafe(self.main(url), self.thread_loop)def start_loop(self, loop):asyncio.set_event_loop(loop)loop.run_forever()def run(self, url):    athread = Thread(target=self.start_loop, args=(thread_loop,))athread.start()loop = asyncio.get_event_loop()loop.run_until_complete(self.add_task(url))if __name__ == "__main__":cs = CarSpider()

7.日志部分代码:log_write.py

import logging
import getpass
import sysclass SpiderLog(object):# 类SpiderLog的构造函数# 日志模块采用单例模式def __new__(cls):if not hasattr(cls, '_instance'):cls._instance = super(SpiderLog, cls).__new__(cls)return cls._instancedef __init__(self):self.user = getpass.getuser()self.logger = logging.getLogger(self.user)self.logger.setLevel(logging.DEBUG)# 日志文件名self.logFile = sys.argv[0][0:-3] + '.log'self.formatter = logging.Formatter('%(asctime)-12s %(levelname)-8s %(name)-10s %(message)-12s\r\n')# 输出到日志文件self.logHand = logging.FileHandler(self.logFile, encoding='utf8')self.logHand.setFormatter(self.formatter)self.logHand.setLevel(logging.DEBUG)# 添加Handlerself.logger.addHandler(self.logHand)def info(self, msg):self.logger.info(msg)if __name__ == '__main__':spiderlog = SpiderLog()spiderlog.info("test")

问题

所有的代码都在上面,但是这里仍然有几个问题。
1.当代码进入到验证部分时,即使验证成功,也不会主动跳转到58主页,但是手动验证后却可以。
2.58实际上还涉及点选验证,这里没有解决。
3.当使用aiohttp.ClientSession.get发送请求后,突然创建了40个线程(debug显示的是ThreadPoolExecutor线程),我自己又单独试了一下,随着task里任务增多,创建的线程越多。这样就产生了另外一个问题,当某一次响应需要验证时,即使验证通过,后面的线程也执行过session.get了,即已经发送请求了,获取的响应也已经包含 '请输入验证码‘了,开始等待第一次验证完成后接着验证,如果不加处理,那这几十个线程都需要进行滑动验证。这里的处理方式是将需要验证的那次url及后面的39个url(这39个url绕过验证,放弃此次请求)都通过run_coroutine_threadsafe重新添加到任务里面。
但是我很疑惑,需要验证的情况难道不适合用aiohttp爬取吗?还是我的代码写的有问题?如果代码没问题,那该怎么更好的解决这个问题呢?希望能有个人能帮我解答这些问题,欢迎赐教讨论。

结语

以上只做学习交流使用,希望有人能一起讨论,也欢迎各位提出宝贵意见,另转载请注明出处。

celery爬取58同城二手车数据及若干问题相关推荐

  1. 多线程爬取58同城二手车信息

    多线程爬取58同城二手车信息 目录 多线程的介绍 数据的爬取 数据的解析 多线程 简介:线程是轻量级的进程,是程序执行流的最小单元,它不拥有系统的资源,运行占用独立的资源且资源小,且多个线程共享一个单 ...

  2. WebMagic爬取58同城租房数据

    WebMagic爬取58同城租房数据 1.WebMagic webmagic是一个开源的Java垂直爬虫框架,目标是简化爬虫的开发流程,让开发者专注于逻辑功能的开发.webmagic的核心非常简单,但 ...

  3. python实战|python爬取58同城租房数据并以Excel文件格式保存到本地

    python实战|python爬取58同城租房数据并以Excel文件格式保存到本地 一.分析目标网站url 目标网站:https://cq.58.com/minsuduanzu/ 让我们看看网站长啥样 ...

  4. 利用python爬取58同城简历数据

    利用python爬取58同城简历数据 最近接到一个工作,需要获取58同城上面的简历信息(http://gz.58.com/qzyewu/).最开始想到是用python里面的scrapy框架制作爬虫.但 ...

  5. 利用python爬取58同城简历数据_利用python爬取58同城简历数据-Go语言中文社区

    利用python爬取58同城简历数据 最近接到一个工作,需要获取58同城上面的简历信息(http://gz.58.com/qzyewu/).最开始想到是用python里面的scrapy框架制作爬虫.但 ...

  6. Python爬取58同城租房数据,完美解决字体加密

    前言 在这里我就不再一一介绍每个步骤的具体操作了,因为在爬取老版今日头条数据的时候都已经讲的非常清楚了,所以在这里我只会在重点上讲述这个是这么实现的,如果想要看具体步骤请先去看我今日头条的文章内容,里 ...

  7. Python爬取58同城租房数据,破解字体加密

    本文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,如有问题请及时联系我们以作处理. 以下文章来源于CSDN,作者:TRHX • 鲍勃 刚接触Python的新手.小白,可以复制下面的链 ...

  8. 利用python爬取58同城简历数据_python爬虫程序 58同城二手交易信息爬取

    本脚本分为5部分: spider_main    主程序 url_manager    url管理器 html_downloader    网页下载器 html_parser    网页解析器 htm ...

  9. python爬虫爬取58网站数据_python实战学习笔记:爬取58同城平板电脑数据

    学习爬虫一周后独立完成的第一个作业项目:爬取58同城平板电脑数据. 1.首先确定URL,并抓取详情页中需要的信息 首先我们确定好需要爬取的网页URL是:http://zhuanzhuan.58.com ...

最新文章

  1. ListView style
  2. java 进程睡眠_Linux进程的睡眠和唤醒简析
  3. node开启子线程_真Node多线程
  4. 26张图带你彻底搞懂volatile关键字
  5. 二维数组和二级指针关系浅析
  6. String StringBuilder StringBuffer 对比 总结得非常好
  7. 关于LOH(Large Object Heap)及内存泄漏
  8. python矩阵运算dot_矩阵、张量乘法(numpy.tensordot)的时间复杂度分析
  9. 信息学奥赛一本通 1063:最大跨度值 | OpenJudge NOI 1.5 06:整数序列的元素最大跨度值
  10. dp(0,1背包)-----高数Umaru系列(9)——哈士奇
  11. python 查找文件夹下的文件名_python查找模式后面的文件夹中的所有文件名
  12. UNP Chapter 25 - 原始套接口
  13. IDF2013:可信计算在中国的发展
  14. 2018最新老男孩Linux架构师实战课程14期视频
  15. 家庭mesh网络与IPTV的搭建
  16. C4D动力学边界是什么意思?
  17. plupload插件上传总结(分片上传,php后端处理)
  18. 数据分析进阶-Excel绘制分段折线图
  19. c语言单片机程序段,51单片机C语言编程基础及实例
  20. IOS------网易新闻滚动标题

热门文章

  1. git命令解决冲突解决
  2. AcWing238. 银河英雄传说
  3. 如何在Windows 10中打印照片
  4. 认识Kernel 内存泄漏
  5. 24 张图总结 TCP 基础知识,看完我飘了。
  6. 云数据库 GaussDB(for Influx) 解密第十一期:让智能电网中时序数据处理更高效
  7. Vmware安装Ubuntu Kylin麒麟系统图文
  8. 数据结构之KH[第五,六章] -->选择题 (二)
  9. 如何写一份详细的创业计划书?
  10. 如何向icloud上传文件_怎样用icloud把手机文件传到电脑上?