使用BeautifulSoap爬取安智网的所有应用信息

开发工具：

python版本：python2.7

开发工具：Eclipse

开发需求：

1、爬取安智网下的app应用信息：应用分类、应用名称、下载次数、上线时间、包大小、支持系统版本、资费、作者、软件语言

2、从网页可以看到安智应用标签页中右侧：有大类、小类

3、可以根据大类找到所有的小类进行分类存储

4、可以点击小类标签，进入小类的应用列表

5、可以根据该小类中每页的url判断每页的url组成

...............

开发思路

1、找到app应用分类的url规律

首先找到安智应用右侧分类页面的url：http://www.anzhi.com/widgetcat_1.html

然后通过html找到每一子类的url，如http://www.anzhi.com/sort_49_1_hot.html

最后获取所有大类下的每子类的url

2、找到每子类中的app应用列表翻页的url规律

第一页：http://www.anzhi.com/sort_49_1_hot.html

第二页：http://www.anzhi.com/sort_49_2_hot.html

第三页：http://www.anzhi.com/sort_49_3_hot.html

..............

3、找到app应用的超链接的url规律

http://www.anzhi.com+href

其中从app应用信息中获取href标签半个路径

如何获取需要的下载次数、上线时间、包大小、支持系统版本、资费、作者、软件语言，并组装到一个列表中，然后组装字典

生成excel文件，并把字典数据存储进去

4、源代码实现

首先，创建一个空白的excel文件

#encoding:utf-8
#/usr/bin/python2.7
'''
Created on 2018年01月12日
@author: *********
'''
import xlwt
import time,os
class StatisticsReport(object):
    t=time.strftime('%Y%m%d%H%M%S',time.localtime(time.time()))
    #设置单元格样式
    def set_style(self,name,height,bold=False):
        # 初始化样式
        style = xlwt.XFStyle()
        # 为样式创建字体
        font = xlwt.Font()
        font.name = name
        font.bold = bold
        font.color_index = 4
        font.height = height
        style.font = font
        return style
    def __createStatisticsReport__(self):
        RunNo=self.t
        reportname=RunNo+'.xls'
        self.__setreportname__(reportname)
        ReportFile=xlwt.Workbook()
        #创建1个获取应用信息sheet页名称
        ReportFile.add_sheet(u'Android应用信息',cell_overwrite_ok=True)
        #-------------写入按获取应用信息的信息表头
        #父分类    子分类    应用名称    下载次数    上线时间    包大小    支持系统版本    资费    作者    软件语言
        wr_tree = ReportFile.get_sheet(0)
        row0=[u'父分类',u'子分类',u'应用名称',u'下载次数',u'上线时间',u'包大小',u'支持系统版本',u'资费',u'作者',u'软件语言']
        #生成按测试类岗位的信息表头
        for i in range(0,len(row0)):
            wr_tree.write(0,i,row0[i],self.set_style('Times New Roman',220,True))
        reportpath=os.path.abspath("..")+'\\'
        print reportpath+reportname
        ReportFile.save(reportpath+reportname)
    def __setreportname__(self,reportname):
        self.reportname=reportname
    def __getreportname__(self):
        return self.reportname

然后，循环找到app的应用信息并实时存储

#/usr/bin/python
#encoding:utf-8
'''
Created on 2018年01月12日

@author: ********
'''
import urllib2,re
from bs4 import BeautifulSoup
import xlrd,os
from xlutils.copy import copy
from StatisticsReport1 import StatisticsReport

def GetAppinfo(urlhead,page,report):
    dict1={}
    head = {}   #设置头
    head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0'
    #获取url路径
    get_url=urlhead;
    #模拟浏览器，定制http请求头
    try:
        request=urllib2.Request(url=get_url,headers = head)
        #模拟浏览器，调用urlopen并传入Request对象，将返回一个相关请求response对象
        reponse=urllib2.urlopen(request)
    except:
        print u"父类标签页面，出现异常，终止"
    #这个应答对象如同一个文件对象，可以在Response中调用.read()读取这个文件信息
    appi_html=reponse.read().decode('utf-8')
    # UTF-8模式读取获取的页面信息标签和内容
    appi_htmltables=BeautifulSoup(appi_html,'lxml');
    #获取应用大分类的标签dl以及内容
    get_linkdl_list=appi_htmltables.find_all('dl')
    #获取所有的子分类的href
    #app记录个数
    i=0
    for dllink in get_linkdl_list:
        Fatherclassname=dllink.h2.get_text()
        get_linka_list=dllink.find_all('a')
        for alink in get_linka_list:
            href=alink.get('href')
            if href.find('/sort_')==-1:
                pass
            else:
                hrefstr=re.findall(r"sort_(.+?)_1_hot.html",href)[0]
                n=1
                while True:
#                 for n in range(1,page+1):
                    get_subclassurl='http://www.anzhi.com'+'/sort_'+hrefstr+'_'+str(n)+'_hot.html';
                    subclassname=alink.get_text()
                    n+=1
                    #模拟浏览器，定制http请求头
                    try:
                        get_subcalssrequest=urllib2.Request(url=get_subclassurl,headers = head)
                        #模拟浏览器，调用urlopen并传入Request对象，将返回一个相关请求response对象
                        get_subclassreponse=urllib2.urlopen(get_subcalssrequest)
                    except:
                        print u"子类页码页面，出现异常，终止"
                    #这个应答对象如同一个文件对象，可以在Response中调用.read()读取这个文件信息
                    get_app_html=get_subclassreponse.read().decode('utf-8')
                    app_subhtmltables=BeautifulSoup(get_app_html,'lxml');
                    get_subapp_spanlist=app_subhtmltables.find_all('span',{"class":"app_name"})
                    if len(get_subapp_spanlist)>0:
                        for get_subapp_span in get_subapp_spanlist:
                                get_apphref=get_subapp_span.find_all('a')[0].get('href')
                                get_appurl="http://www.anzhi.com"+get_apphref
                                appname=get_subapp_span.get_text()
                                #模拟浏览器，定制http请求头
                                try:
                                    get_apprequest=urllib2.Request(url=get_appurl,headers = head)
                                    #模拟浏览器，调用urlopen并传入Request对象，将返回一个相关请求response对象
                                    get_appreponse=urllib2.urlopen(get_apprequest)
                                except:
                                    print u"App页面，出现异常，终止,继续"
                                    continue;
                                #这个应答对象如同一个文件对象，可以在Response中调用.read()读取这个文件信息
                                get_app_html=get_appreponse.read().decode('utf-8')
                                app_apphtmltables=BeautifulSoup(get_app_html,'lxml');
                                get_app_lilist=app_apphtmltables.find_all('ul',attrs={"id":"detail_line_ul"})
                                if len(get_app_lilist)>0:
                                    get_app_infolist=get_app_lilist[0].find_all('li')
                                    try:
                                        app_downloadcounts=get_app_infolist[1].get_text()
                                        app_uplinedate=get_app_infolist[2].get_text()
                                        app_pkgsize=get_app_infolist[3].get_text()
                                        app_Supportver=get_app_infolist[4].get_text()
                                        app_charge=get_app_infolist[5].get_text()
                                        app_author=get_app_infolist[6].get_text()
                                        app_language=get_app_infolist[7].get_text()
                                    except:
                                        app_downloadcounts=''
                                        app_uplinedate=''
                                        app_pkgsize=''
                                        app_Supportver=''
                                        app_charge=''
                                        app_author=''
                                        app_language=''
                                    list1=[Fatherclassname,subclassname,appname,app_downloadcounts,app_uplinedate,app_pkgsize,app_Supportver,app_charge,app_author,app_language]
                                    key='app_'+str(i+1)
                                    dict2=dict.fromkeys([key], list1)
                                    dict1={}
                                    dict1.update(dict2)
                                    reportpath=os.path.abspath("..")+'\\'
                                    reportname=report.__getreportname__()
                                    bk=xlrd.open_workbook(reportpath+reportname)
                                    wb=copy(bk)
                                    wa=wb.get_sheet(0)
                                    for j in range(0,len(dict1.values()[0])):
                                        wa.write(i+1,j,dict1.values()[0][j])
                                    i+=1
                                    wb.save(reportpath+reportname)
#                                     time.sleep(0.001)
                                else:
                                    print u"app页面，无详情信息,跳出循环"
                                    break;
                    else:
                        print u"当前页面无app数据,跳出循环"
                        break;
                print u"爬取到子类名称：",subclassname
    print u'已经爬取app总数：',i
def GenerateReport(report,job_dict):
    reportpath=os.path.abspath("..")+'\\'
    reportname=report.__getreportname__()
    bk=xlrd.open_workbook(reportpath+reportname)
    wb=copy(bk)
    wa=wb.get_sheet(0)
    for i in range(0,len(job_dict)):
        for j in range(0,len(job_dict.values()[i])):
            wa.write(i+1,j,job_dict.values()[i][j])
    wb.save(reportpath+reportname)

if __name__ == '__main__':
    report=StatisticsReport()
    report.__createStatisticsReport__()
    url='http://www.anzhi.com/widgetcat_1.html';
    page=1
    app_dict=GetAppinfo(url,page,report)
#     GenerateReport(report,app_dict)

总结：

该方式消耗的cpu和网络资源比较大，稳定，但效率较慢，后续研究性能较快的方式

使用BeautifulSoap爬取安智网的所有应用信息相关推荐

Python 爬虫爬取安智网应用信息
2019独角兽企业重金招聘Python工程师标准>>> 爬取目标网站安卓应用的信息,爬取分类.更新时间.系统要求.下载量以及下载链接等描述信息 http://www.anzhi.co ...
Python 小项目 01 爬虫项目爬取链家网南京地区二手房信息
SpiderLianjia 介绍 python爬虫小程序,爬取链家网南京地区普通住宅二手房数据. 代码下载: https://gitee.com/lihaogn/SpiderLianjia 1 程序设 ...
网络爬虫之scrapy爬取某招聘网手机APP发布信息
1 引言过段时间要开始找新工作了,爬取一些岗位信息来分析一下吧.目前主流的招聘网站包括前程无忧.智联.BOSS直聘.拉勾等等.有段时间时间没爬取手机APP了,这次写一个爬虫爬取前程无忧手机APP岗位 ...
scrapy爬取某招聘网手机APP发布信息
1 引言过段时间要开始找新工作了,爬取一些岗位信息来分析一下吧.目前主流的招聘网站包括前程无忧.智联.BOSS直聘.拉勾等等.有段时间时间没爬取手机APP了,这次写一个爬虫爬取前程无忧手机APP ...
Python爬取链家网24685个租房信息并进行数据分析
2020年注定是一个不平凡的年份,很多行业受疫情影响艰难前行,即便复产复工提速,被抑制的需求也难以短期释放.与此同时,地摊经济孕育而生,如果人们真的都去摆地摊了,是不是也会出现睡地摊的普遍现象?这时候 ...
记录用web scraper爬取裁判文书网的文书列表信息以及批量下载word文书
这个是一位网友在B站交流的一个问题,这里记录一下. 需求 1.爬取的网站地址:http://wenshu.court.gov.cn/website/wenshu/181217BMTKHNT2W0/in ...
爬取猫眼电影网经典电影TOP100信息Markdown效果展示
No:1 霸王别姬主演:张国荣,张丰毅,巩俐上映时间:1993-01-01(中国香港) 评分:9.6 了解更多- No:2 肖申克的救赎主演:蒂姆·罗宾斯,摩根·弗里曼,鲍勃·冈顿上映时间:1 ...
Python的scrapy之爬取顶点小说网的所有小说
闲来无事用Python的scrapy框架练练手,爬取顶点小说网的所有小说的详细信息. 看一下网页的构造: tr标签里面的 td 使我们所要爬取的信息下面是我们要爬取的二级页面小说的简介信息: 下面 ...
基于python多线程和Scrapy爬取链家网房价成交信息
文章目录知识背景 Scrapy- spider 爬虫框架 SQLite数据库 python多线程爬取流程详解爬取房价信息封装数据库类,方便多线程操作数据库插入操作构建爬虫爬取数据基于百度 ...

使用BeautifulSoap爬取安智网的所有应用信息

使用BeautifulSoap爬取安智网的所有应用信息相关推荐

最新文章

热门文章