python爬取去哪网数据_用户观点：企查查数据爬取技术与Python 爬取企查查数据...

主体数据来源是全国工商信用网但是每个省的工商系统都不同，要针对每个省的工商系统单独写爬虫每个省的验证码也不同，也要单独做。企查查的原理不是主动爬去数据，而是有人查询该企业时，如果自己的数据库没有该企业，他们的爬虫就会去工商系统抓取信息。这个步骤非常耗时，爬一个企业资料都需要40秒。一旦信息获取成功，就放到他们自己的数据库中，下次有人在查询该企业，就只有几毫秒了。从这种模式上来看，验证码也不可能是针对每个省都单独做了识别模块的，而是接入了打码平台。采集一个企业信息这么慢，只有是验证码打码才能解释了。这种方法爬出来的数据可能不全，但是没人关注的公司就不用花钱打码了，非常节省成本。我计划把全国各省的验证码识别模块单独做出来，目前只做了一个省的 100%的识别率。该省每天新增企业信息包括个体户全都可以获取到。

这是企查查数据爬取技术，那么也有用户爬取企查查数据。

由于工作需要，爬取企查查数据，在前人的基础上做了修改，可以爬全部的数据。

首先你的电脑上要已经安装了Python，在此基础上安装request模块，lxml模块，BeautifulSoup模块和xlwt模块，

代码如下：

#-*- coding-8 -*-
import requests
import lxml
from bs4 import BeautifulSoup
import xlwt

def craw(url,key_word):
User_Agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'
headers = {

'Host':'www.qichacha.com',

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Connection': 'keep-alive',
'User-Agent':r'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0',
'Cache-Control': 'max-age=0',
'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding':'gzip, deflate',
'Referer': 'http://www.qichacha.com/search?key='+key_word,
'Cookie':r'zg_did=%7B%22did%22%3A%20%2215fa403cb9d15f-036a7756df6645-173b7740-100200-15fa403cb9e2c6%22%7D; zg_de1d1a35bfa24ce29bbf2c7eb17e6c4f=%7B%22sid%22%3A%201510617843749%2C%22updated%22%3A%201510626410277%2C%22info%22%3A%201510285233062%2C%22superProperty%22%3A%20%22%7B%7D%22%2C%22platform%22%3A%20%22%7B%7D%22%2C%22utm%22%3A%20%22%7B%7D%22%2C%22referrerDomain%22%3A%20%22www.baidu.com%22%2C%22cuid%22%3A%20%22f62f1e8a5eaa98a4fdd7be63f003baf3%22%7D; UM_distinctid=15fa403d33a86-04e900fd4e3572-173b7740-100200-15fa403d33b3d; CNZZDATA1254842228=770755339-1510283630-https%253A%252F%252Fwww.baidu.com%252F%7C1510623074; _uab_collina=151028523773154434859974; _umdata=2BA477700510A7DFF3E360D067D6CBF26EBF4D0B7616E2F668ACF5B05BA3A15BB7B2A5C9048062DECD43AD3E795C914C698D4F63619694FD3C24BCCF0E0016EF; PHPSESSID=tm27c7utiff9j5iqbh4g1cg0l5; acw_tc=AQAAAIyPMmgK2AUA4oumtwogJ3fbLlic; hasShow=1',
'Cache-Control':'no-cache',

}
    response = requests.get(url,headers = headers)
    if response.status_code != 200:
        response.encoding = 'utf-8'
        print(response.status_code)
        print('ERROR')
    soup = BeautifulSoup(response.text,'lxml')
    #print(soup)
    com_names = soup.find_all(class_='ma_h1')#获取公司名称
    #print(com_names)
    #com_name1 = com_names[1].get_text()
    #print(com_name1)
    peo_names = soup.find_all(class_='a-blue')#公司法人
    #print(peo_names)
    peo_phones = soup.find_all(class_='m-t-xs')#公司号码
    #tags = peo_phones[4].find(text = True).strip()
    #print(tags)
    #tttt = peo_phones[0].contents[5].get_text()
    #print (tttt)
    #else_comtent = peo_phones[0].find(class_='m-l')
    #print(else_comtent)
    #peo_emails=soup.find_all(class_='m-1')
    global com_name_list
    global peo_name_list
    global peo_phone_list
    global com_place_list
    global zhuceziben_list
    global chenglishijian_list
    global email_list
    print('开始爬取数据，请勿打开excel')
    for i in range(0,len(com_names)):
        n = 1+3*i
        m = i+2*(i+1)
        try:
            peo_phone = peo_phones[n].find(text = True).strip()
            com_place = peo_phones[m].find(text = True).strip()
            zhuceziben = peo_phones[3*i].find(class_='m-l').get_text()
            chenglishijian = peo_phones[3*i].contents[5].get_text()
            email=peo_phones[n].contents[1].get_text()
        #print('email',email)
            peo_phone_list.append(peo_phone)
            com_place_list.append(com_place)
            zhuceziben_list.append(zhuceziben)
            chenglishijian_list.append(chenglishijian)
            email_list.append(email)
        except Exception:
            print('exception')

for com_name,peo_name in zip(com_names,peo_names):
        com_name = com_name.get_text()
        peo_name = peo_name.get_text()
        com_name_list.append(com_name)
        peo_name_list.append(peo_name)

if __name__ == '__main__':
    com_name_list = []
    peo_name_list = []
    peo_phone_list = []
    com_place_list = []
    zhuceziben_list = []
    chenglishijian_list = []
    email_list=[]

key_word = input('请输入您想搜索的关键词：')
    print('正在搜索，请稍后')
    for x in range(400,500):
        if x==1:
            url = r'http://www.qichacha.com/search?key={}#p:{}&'.format(key_word,x)
        else:
            url=r'http://www.qichacha.com/search_index?key={}&ajaxflag=1&p={}&'.format(key_word,x)
        #url = r'http://www.qichacha.com/search?key={}#p:{}&'.format(key_word,x)
        s1 = craw(url,key_word.encode("utf-8").decode("latin1"))
    workbook = xlwt.Workbook()
    #创建sheet对象，新建sheet
    sheet1 = workbook.add_sheet('xlwt', cell_overwrite_ok=True)
    #---设置excel样式---
    #初始化样式
    style = xlwt.XFStyle()
    #创建字体样式
    font = xlwt.Font()
    font.name = 'Times New Roman'
    font.bold = True #加粗
    #设置字体
    style.font = font
    #使用样式写入数据
    # sheet.write(0, 1, "xxxxx", style)
    print('正在存储数据，请勿打开excel')
    #向sheet中写入数据
    name_list = ['公司名字','法定代表人','联系方式','注册人资本','成立时间','公司地址','公司邮件']
    for cc in range(0,len(name_list)):
        sheet1.write(0,cc,name_list[cc],style)
    for i in range(0,len(com_name_list)):
        sheet1.write(i+1,0,com_name_list[i],style)#公司名字
        sheet1.write(i+1,1,peo_name_list[i],style)#法定代表人
        sheet1.write(i+1,2,peo_phone_list[i],style)#联系方式
        sheet1.write(i+1,3,zhuceziben_list[i],style)#注册人资本
        sheet1.write(i+1,4,chenglishijian_list[i],style)#成立时间
        sheet1.write(i+1,5,com_place_list[i],style)#公司地址
        sheet1.write(i+1,6,email_list[i],style)#邮件地址
    #保存excel文件，有同名的直接覆盖
    workbook.save(r'E:\test.xls')
    print('the excel save success')

代码执行结果如下：

(本文由中国计算网总编栾玲收录到《超算AI数据库》转载请注明出处)

微信关注公众号“cncompute_com ”，每天为您奉上最新最热的计算头条资讯，满满干货~多年软件设计师经历，业内资深分析人士，圈中好友众多，信息丰富，观点独到。发布各大自媒体平台，覆盖百万读者。《苹果的品牌设计之道》、《谁拥有未来：小米互联网思维PK传统行业思维》二本畅销书作者栾玲，现为中国计算网设计总监与内容总编，栾玲专著与国画已被国图、清华北大图书馆等收藏

python爬取去哪网数据_用户观点：企查查数据爬取技术与Python 爬取企查查数据...相关推荐

python爬取去哪儿网机票_干货｜Python爬取《去哪儿》攻略库，制作一份详细的旅行攻略，疫情后来一场说走就走的旅行！...
去哪儿是中国领先的在线OTA网站,为消费者提供机票.酒店.会场 .度假产品的实时搜索,并提供旅游产品团购以及其他旅游信息服务.去哪儿网站上有丰富的图片.评论数据,这些大量的数据对于从事数据岗位的来说的 ...
python爬取去哪儿网机票_王老吉携手去哪儿网，打造出行全链路营销盛事
全球最大规模的人类迁徙来了,"你买到票了吗"代替"最近怎么样",成为了朋友间心照不宣的问候语,而在这个'不孝有三,放假回家为大'的时刻,王老吉携手在线旅游平台领 ...
python爬取去哪儿网机票_去哪儿网：国内机票预订量恢复超五成，杭州进出港增量在两成以上...
钱江晚报·小时新闻记者马焱五一过后,以公商务为主的民航旅客量快速上升.去哪儿网数据显示,五一节后三天以京沪航线为代表的公商务主要航线,反弹增幅达到2倍以上.去哪儿网副总裁兰翔表示,节后民航业国内机 ...
python爬虫去哪儿网_大型爬虫案例：爬取去哪儿网
世界那么大,我想去看看.相信每到暑假期间,就会有很多人都想去旅游.但是去哪里玩,没有攻略这又是个问题.这次作者给大家带来的是爬取去哪网自由行数据.先来讲解一下大概思路,我们去一个城市旅行必定有一个出发 ...
python selenium 爬取去哪儿网的数据
python selenium 爬取去哪儿网的数据完整代码下载:https://github.com/tanjunchen/SpiderProject/tree/master/selenium+qu ...
python selenium爬取去哪儿网的酒店信息——详细步骤及代码实现
目录准备工作一.webdriver部分二.定位到新页面三.提取酒店信息 ??这里要注意?? 四.输出结果五.全部代码准备工作 1.pip install selenium 2.配置浏览器驱 ...
【爬虫】用Python爬取去哪儿网热门旅游信息（并打包成旅游信息查询小工具）
以下内容为本人原创,欢迎大家观看学习,禁止用于商业用途,谢谢合作! ·作者:@Yhen ·原文网站:CSDN ·原文链接:https://blog.csdn.net/Yhen1/article/det ...
使用Python requests和BeautifulSoup库爬取去哪儿网
功能说明:爬取去哪儿网城市下面若干条景点详细信息并将数据导入Excel表(使用xlwt库) 爬取去哪儿网的教程参考自 https://blog.csdn.net/gscsd_t/article/det ...
【Python学习笔记】36：抓取去哪儿网的旅游产品数据
学习<Python3爬虫.数据清洗与可视化实战>时自己的一些实践. 书上这章开篇就说了尽量找JSON格式的数据,比较方便解析(在python里直接转换成字典),去哪儿网PC端返回的不是JS ...

python爬取去哪网数据_用户观点：企查查数据爬取技术与Python 爬取企查查数据...

python爬取去哪网数据_用户观点：企查查数据爬取技术与Python 爬取企查查数据...相关推荐

最新文章

热门文章