一个简单的实践，爬取51job有关大数据的工作，并进行分析。
在这里做一个记录。
主要思路是在网上找到的一篇博文，之前也做过类似的东西，本身没有什么难度，这里我就将细节细细的过一遍，巩固一下所学吧

参考的博文的链接
https://blog.csdn.net/lbship/article/details/79452459

这里搜索的关键词为：数据分析、大数据

要爬取的网站：https://www.51job.com/

文章目录

一、前期准备
- 1.1 使用virtualenv搭建环境
- 1.2 查看首页的html
- - 1.2.1 查看搜索页面
二、进行操作
- 2.1 根据搜索页面的信息构造链接
- - 2.1.1分析页面的html，寻找有效信息
  - 2.1.2 通过xpath进行定位
  - 2.1.3 先获取一份信息看看效果
  - - 2.1.3.1 放开循环，爬取10个左右，查看效果
  - 2.1.4 下一步，将爬取到的数据存储起来
  - 2.1.5 接下来尝试一下多线程，加快爬取速度
- 2.2 开始爬取
- - 2.2.1 异常处理
  - - 网页请求过程中的超时异常
    - 请求到的网页不是51job的
    - 查看效果
- 2.3 程序已基本上可以持续运行
三、数据处理
- 3.1 查看爬取的数据
- 3.2 查重
- python中的True、False
- 去重后的效果
- 调整数据格式
- 函数注释、参数注释
四、数据分析
- - 4.1 搜集资料
  - 4.2 绘制饼图
  - 4.3 绘制柱状图
  - 4.4 绘制词云
  - 4.5 绘制地理图
- 问题解决
- - numpy数组转python列表
- 另外看到的一些比较好的思路

一、前期准备

1.1 使用virtualenv搭建环境

1.2 查看首页的html

def get_51job_html():url1='https://www.51job.com/'req=requests.get(url1)req.encoding=req.apparent_encodinghtml=req.textprint(html)fj=open('./analyze/html.txt','w',encoding='utf-8')fj.write(html)fj.close()

在这里我遇到了网页编码与本地编码不一致的问题,
在requests.text源码中看到这样一段描述：

     The encoding of the response content is determined based solely on HTTPheaders, following RFC 2616 to the letter. If you can take advantage ofnon-HTTP knowledge to make a better guess at the encoding, you shouldset ``r.encoding`` appropriately before accessing this property.相应内容的编码方式是基于HTTP headers，严格按照RFC 2616，如果你能利用
非http知识，以便更好地猜测编码，你应当在访问这个属性前设置适当的r.encoding
.........if not self.content:return str('')# Fallback to auto-detected encoding.    回退到自动识别编码if self.encoding is None:encoding = self.apparent_encoding# Decode unicode from given encoding.    从给定的编码解码unicode。try:content = str(self.content, encoding, errors='replace')except (LookupError, TypeError):# A LookupError is raised if the encoding was not found which could.........

也就是说，如果你能判断网页的编码方式，那么你可以使用r.ecoding=‘gbk’，来指定编码方式

如果，你不能准确的判断，那么可以使用 req.encoding=req.apparent_encoding，让requests库来自动进行选择。

这是搜集的参考资料：
https://blog.csdn.net/weixin_37686372/article/details/79231846

1.2.1 查看搜索页面

https://search.51job.com/list/010000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=https://search.51job.com/list/020000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=

010000：城市的编码，这个是北京。020000，是上海
%25E6%2595%25B0%25E6%258D%25AE%25E5%2588%2586%25E6%259E%2590：搜索的关键词，只不过是被编码了，还被编码了两次

查看parse.quote(）的源码后发现

reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |"$" | "," | "~"Each of these characters is reserved in some component of a URL,but not necessarily in all of them.Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings.Now, "~" is included in the set of reserved characters.By default, the quote function is intended for quoting the pathsection of a URL.  Thus, it will not encode '/'.  This characteris reserved, but in typical usage the quote function is beingcalled on a path where the existing slash characters are used as

python 3.7 遵守RFC 3986标准来引用URL字符串，汉字不在URL标准内，于是要通过parse.quote函数将汉字编码为符合标准的字符。

star='数据分析'
#编码
star1=parse.quote(parse.quote(star))
print(star1)#解码可以用这个
star2=parse.unquote(star1)
print(star2)

https://www.cnblogs.com/jessicaxu/p/7977277.html
https://blog.csdn.net/ZTCooper/article/details/80165038

二、进行操作

2.1 根据搜索页面的信息构造链接

000000:代表任意城市
key：要检索的关键词
str(page)：页数

url ='http://search.51job.com/list/000000,000000,0000,00,9,99,'+key+',2,'+ str(page)+'.html'

2.1.1分析页面的html，寻找有效信息

网页html的一部分

<p class="t1 "><em class="check" name="delivery_em" onclick="checkboxClick(this)"></em><input class="checkbox" type="checkbox" name="delivery_jobid" value="114060428" jt="0" style="display:none" /><span><a target="_blank" title="数据分析专员" href="https://jobs.51job.com/wuhan-dxhq/114060428.html?s=01&t=0" onmousedown="">数据分析专员                                </a>

使用正则表达式，匹配出有效信息

expression=re.compile(r'class="t1 ">.*? <a target="_blank" title=".*?" href="(.*?)".*? <span class="t2">',re.S)

在Python的正则表达式中，有一个参数为re.S。它表示“.”（不包含外侧双引号，下同）的作用扩展到整个字符串，包括“\n”。
https://www.jb51.net/article/146384.htm
.这篇比较通俗
https://blog.csdn.net/weixin_42781180/article/details/81302806

也就是说，设置re.S后，会将整个字符串作为匹配对象，去文本中匹配符合标准的内容。

1、. 匹配任意除换行符“\n”外的字符；
2、*表示匹配前一个字符0次或无限次；
3、+或*后跟？表示非贪婪匹配，即尽可能少的匹配，如*？重复任意次，但尽可能少重复；
4、 .*? 表示匹配任意数量的重复，但是在能使整个匹配成功的前提下使用最少的重复。
如：a.*?b匹配最短的，以a开始，以b结束的字符串。如果把它应用于aabab的话，它会匹配aab和ab。
--------------
原文：https://blog.csdn.net/qq_37699336/article/details/84981687

2.1.2 通过xpath进行定位

'//div[@class="tHeader tHjob"]//p[@class="msg ltype"]/text()'

//div[@class=“tHeader tHjob”] 构造的相对路径，div标签下的class="tHeader tHjob"标签

//p[@class=“msg ltype”] 定位到这个标签后，发现里面有我们想要的文本
使用text()获取标签内的文本

2.1.3 先获取一份信息看看效果

获取到的文本内有许多无用的字符，使用正则表达式进行过滤。

参考资料
https://blog.csdn.net/qq_26442553/article/details/82754722

#详情describe = re.findall(re.compile(r'<div class="bmsg job_msg inbox">.*?岗位职责(.*?)<p>&nbsp;</p>', re.S), req1.text)[0]r1 = re.compile(r'[:<p></p>&nbs;r]+')s2=re.sub(r1,'',describe)print('岗位职责： ',s2)#需求require = re.findall(re.compile(r'<div class="bmsg job_msg inbox">.*?任职资格(.*?)<div class="mt10">', re.S),req1.text)[0]s=re.sub(r1,'',require)print('任职资格： ',s)

效果：

2.1.3.1 放开循环，爬取10个左右，查看效果

发现文本种含有大量的html标签，于是构建正则表达式，过滤标签

r1 = re.compile(r'<[^<]+?>')s = re.sub(r1, '', describe).replace('&nbsp;','').strip()print('职位信息： ',s)

参考资料
https://blog.csdn.net/jcl314159/article/details/86030734

过滤后，OK

2.1.4 下一步，将爬取到的数据存储起来

这里我是用的是pandas库
将多列数据构建成一个列表队对象，将列表对像转为pd.Series对象，作为函数的返回值。这样就可以方便的使用dataframe中的append方法，逐行的添加数据。

datalist=[str(link),str(job),str(companyname),str(string),str(education),str(salary),str(area),str(companytype),str(companyscale),str(workyear),str(s)]series=pd.Series(datalist,index=['link','job','companyname','welfare','education','salary','area','companytype','companyscale','workyear','describe'])return series

测试：

if __name__ == '__main__':page_number=1while(page_number!=1001):print('正在爬取第：', str(page_number), '页...')datasets=pd.DataFrame()links = get_position_links(page_number)i = 0for link in links:print(link)state,series = get_content(link)#print(type(series))if state==1:# print(series)datasets = datasets.append(series, ignore_index=True)i = i + 1print("------------------" + str(page_number) + "--" + str(i) + "----------------------")# if i > 10:        #用于测试，获取11条数据，看看效果#     print(datasets)#     exit()print('第 ', str(page_number), ' 爬取完成')page_number=page_number+1print(datasets)datasets.to_csv('./analyze/datasets.csv',sep='#',index=False,index_label=False, encoding='utf-8',mode='a+')

效果：

2.1.5 接下来尝试一下多线程，加快爬取速度

这个先放放，先把数据爬下来再说

2.2 开始爬取

if __name__ == '__main__':page_number=1datasets = pd.DataFrame(columns=['link','job','companyname','welfare','education','salary','area','companytype','companyscale','workyear','describe'])while(page_number!=1001):print('正在爬取第：', str(page_number), '页...')links = get_position_links(page_number)i = 0for link in links:print(link)series = get_content(link)#print(series)datasets = datasets.append(series, ignore_index=True)i = i + 1print("------------------" +str(page_number)+"--"+ str(i) + "----------------------")# if i > 10:#     print(datasets)#     exit()print('第 ', str(page_number), ' 爬取完成')page_number=page_number+1print(datasets)datasets.to_csv('./analyze/datasets.csv',sep='#',index=False,index_label=False, encoding='utf-8',mode='a+')

效果：

2.2.1 异常处理

网页请求过程中的超时异常

采取重新发起请求的方式进行解决

max_retries = 0while max_retries < 4:try:req1=requests.get(link,headers=get_header(),timeout=10)breakexcept requests.exceptions.ReadTimeout:print("读取超时，第" + str(max_retries) + "次请求,准备进行下一次尝试")max_retries = max_retries + 1except requests.exceptions.ConnectTimeout:print("连接超时，第" + str(max_retries) + "次请求,准备进行下一次尝试")max_retries = max_retries + 1

请求到的网页不是51job的

此时会导致程序爬不到东西，因为html的结构变了，
爬取第一条信息时返回的是个列表，如果没有爬到信息，但是你又对空列表进行索引，那么会导致IndexError，所以在程序中加了一条异常处理，即未获得信息，将series置为None
job = html1.xpath('//div[@class="tHeader tHjob"]//h1/text()')[0].strip()

https://m.jb51.net/article/152963.htm

    try:# 工作名称job = html1.xpath('//div[@class="tHeader tHjob"]//h1/text()')[0].strip()# print('工作名称：', job).............datalist = [str(link),str(job),str(companyname),str(string),str(education),str(salary),str(area),str(companytype),str(companyscale),str(workyear),str(s)]series = pd.Series(datalist, index=['link','job','companyname','welfare','education','salary','area','companytype','companyscale','workyear','describe'])return (1,series)except IndexError:print('error,未定位到有效信息导致索引越界')series=Nonereturn (-1,series)

查看效果

2.3 程序已基本上可以持续运行

缺点：慢、慢、慢
耗时：爬取一页内容大致要耗费40s左右的时间。
基本实现爬取数据的目的
截至目前：2019年6月3日13:28:55

程序已持续爬取17页的内容

截至时间：2019年6月3日18:20:52，已爬取500页的内容，共计24288条信息。

三、数据处理

3.1 查看爬取的数据

import pandas as pddata_path="./analyze/test_datasets.csv"
test_data=pd.read_csv(data_path,delimiter='#',encoding='utf-8')print(test_data.columns)>>>Index(['area', 'companyname', 'companyscale', 'companytype', 'describe','education', 'job', 'link', 'salary', 'welfare', 'workyear'],dtype='object')>>>(24288, 11)>>>          area  companyname companyscale  ...  salary  welfare workyear
count    24288        24288        24288  ...   24133    21214    24288
unique     569        18483           10  ...     514    14247        8
top     广州-天河区  companyname      50-150人  ...  6-8千/月  welfare    无工作经验
freq      1136          499         7781  ...    3024      499     7267[4 rows x 11 columns]print(test_data.iloc[956:960,:]) #随便查看几条>>>        area   companyname  ...                                   welfare workyear
956   上海-长宁区  广州淘通科技股份有限公司  ...            绩效奖金,年终奖金,弹性工作,专业培训,五险一金,定期体检,    无工作经验
957  南京-雨花台区  南京科源信息技术有限公司  ...                 过节福利,加班补贴,员工旅游,周末双休,五险一金,     2年经验
958   上海-长宁区      携程旅行网业务区  ...        加班补贴,餐饮补贴,五险一金,专业培训,全勤奖,绩效奖金,做五休二,     1年经验
959  深圳-龙华新区    深圳领脉科技有限公司  ...  定期体检,弹性工作,股票期权,绩效奖金,年终奖金,专业培训,餐饮补贴,五险一金,    无工作经验[4 rows x 11 columns]

3.2 查重

使用pandas.DataFrame.duplicated,进行查重
作用：返回布尔级数，表示重复的行，只考虑某些列。

根据公司名进行查看后发现，公司名并不能作为唯一标识，可能存在同一公司招聘不同岗位。
根据请求链接进行查重后发现，链接可以作为唯一标识，不存在重复项。

print(test_data['companyname'].duplicated())
print(test_data['link'].duplicated())print(test_data.iloc[24281:24282,:])
print(test_data.loc[test_data["companyname"]=="重庆江小白品牌管理有限公司"]["link"])>>>         area    companyname companyscale  ...    salary welfare workyear
15262  成都-武侯区  重庆江小白品牌管理有限公司   1000-5000人  ...  3.5-5千/月   年终奖金,    无工作经验
15480  成都-武侯区  重庆江小白品牌管理有限公司   1000-5000人  ...  3.5-5千/月   年终奖金,    无工作经验
24281      镇江  重庆江小白品牌管理有限公司   1000-5000人  ...  4.5-6千/月     NaN    无工作经验[3 rows x 11 columns]
15262    https://jobs.51job.com/chengdu-whq/98113296.ht...
15480    https://jobs.51job.com/chengdu-whq/98113296.ht...
24281    https://jobs.51job.com/zhenjiang/99364955.html...
Name: link, dtype: object

python中的True、False

True为1
False为0

numpy.logical_xor #Compute the truth value of x1 XOR x2, element-wise. 从元素的角度计算x1 XOR x2的真值。

集合的异或运算（对称差）
https://www.cnblogs.com/organic/archive/2015/12/06/5023038.html

去重后的效果

对numpy.logical_xor()函数的测试
https://blog.csdn.net/az9996/article/details/90771491

去重后剩余18790项数据。

其中，包含关键词：“数据”、“分析”的职位有2630项。

调整数据格式

python中 r’’, b’’, u’’, f’’ 的含义
https://blog.csdn.net/gymaisyl/article/details/85109627

if u'万' in salary and u'年' in salary:  # 单位统一成千/月的形式

控制浮点数精度的方法
https://blog.csdn.net/larykaiy/article/details/83095042

low_salary = '%.2f'%(float(low_salary) / 12 * 10)

函数注释、参数注释

函数注释是关于用户定义函数使用的类型的完全可选的元数据信息

注释作为字典存储在函数的……annotations__属性中，对函数的任何其他部分都没有影响。参数注释由参数名称后面的冒号定义，后面是对注释值求值的表达式。返回注释由一个文字->和一个表达式定义，在参数列表和表示def语句结束的冒号之间。

https://docs.python.org/3/tutorial/controlflow.html#function-annotations

四、数据分析

4.1 搜集资料

这个最有用
https://pyecharts.org/#/zh-cn/

这个更好些
https://blog.csdn.net/u010700066/article/details/80835018
.
备用：
地理图绘制
https://segmentfault.com/a/1190000010871928
世界国家行政区划分，文件下载
https://gadm.org/download_country_v3.html

python的pandas中series类型提取索引并转化为列表
http://blog.sina.com.cn/s/blog_9e103b930102x9ic.html

将列表内字符类型的元素，转换为int型或float型
https://blog.csdn.net/az9996/article/details/91408120

Python遍历pandas数据方法总结
https://www.jb51.net/article/134753.htm

wordcloud绘制词云
https://www.jianshu.com/p/daa54db9045d

结巴分词github
https://github.com/fxsjy/jieba

4.2 绘制饼图

def education_operation():path = os.path.join(data_path, file_name[1])test_data = pd.read_csv(path, delimiter="#", encoding='utf-8', dtype={'salary': str})edu_values_count=test_data['education'].value_counts() #返回类型Seriesdrop_value=edu_values_count.drop(labels=['招1人','招2人','招3人','招若干人','招4人','招5人','招6人','招20人','招25人','招10人'])print(drop_value)print(type(drop_value))print(drop_value.index)print(type(drop_value.index))print(drop_value.values)print(type(drop_value.values))np.set_printoptions(precision=3)c = (Pie().add("",[list(z) for z in zip(drop_value.index,((drop_value.values.tolist())/(drop_value.values.sum())).round(decimals=3))],radius=["40%", "75%"],).set_global_opts(title_opts=opts.TitleOpts(title="学历要求"),legend_opts=opts.LegendOpts(orient="vertical", pos_top="15%", pos_left="2%"),).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")))c.render('./view/education_pie.html')

4.3 绘制柱状图

    c = (Bar().add_xaxis(low_ndarray_to_list).add_xaxis(high_ndarry_to_list).add_yaxis("最低薪资分布",low_salary_count.values.tolist()).add_yaxis("最高薪资分布",high_salary_count.values.tolist()).set_global_opts(title_opts=opts.TitleOpts(title="薪资", subtitle="")))c.render('./view/salary_bar.html')

4.4 绘制词云

words=[]for word in count.iteritems():print(word)words.append(word)c = (WordCloud().add("", words, word_size_range=[20, 100]).set_global_opts(title_opts=opts.TitleOpts(title="WordCloud")))c.render('./view/job_information_wordcloud.html')

4.5 绘制地理图

全球地图、中国省级地图、中国市级地图、中国区县级地图、中国区域地图

pip install echarts-countries-pypkg  pip install echarts-china-provinces-pypkg pip install echarts-china-cities-pypkg pip install echarts-china-counties-pypkg pip install echarts-china-misc-pypkg

地图数据扩展包

pip install echarts-cities-pypkg

参考资料
https://blog.csdn.net/weixin_41563274/article/details/82904106

还得调整，效果不理想

问题解决

Traceback (most recent call last):File "<input>", line 1, in <module>File "D:\Program Files\JetBrains\PyCharm 2019.1\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfilepydev_imports.execfile(filename, global_vars, local_vars)  # execute the scriptFile "D:\Program Files\JetBrains\PyCharm 2019.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfileexec(compile(contents+"\n", file, 'exec'), glob, loc)File "D:/programming/python/PycharmProjects/Final_course_design/transform_format.py", line 235, in <module>area_operation()File "D:/programming/python/PycharmProjects/Final_course_design/transform_format.py", line 219, in area_operationtype_=ChartType.HEATMAP,File "D:\programming\python\PycharmProjects\Final_course_design\venv\lib\site-packages\pyecharts\charts\basic_charts\geo.py", line 54, in adddata = self._feed_data(data_pair, type_)File "D:\programming\python\PycharmProjects\Final_course_design\venv\lib\site-packages\pyecharts\charts\basic_charts\geo.py", line 144, in _feed_datalng, lat = self.get_coordinate(n)
TypeError: cannot unpack non-iterable NoneType object

一开始遇到TypeError: cannot unpack non-iterable NoneType object这个错误提示，十分不解。在搜索过程中看到这篇文章中的分析（不得不感慨，还是Google的搜索技术更好用！），让我意识到，可能是库中某个函数在调用后返回值为None，因为NoneType object （空对象）肯定是无法被迭代的。

在细细看看，抛出的错误lng, lat = self.get_coordinate(n)，察觉到这个函数调用过程中可能出现错误，翻看源码，函数定义如下：

    def _feed_data(self, data_pair: Sequence, type_: str) -> Sequence:result = []for n, v in data_pair:if type_ == ChartType.LINES:f, t = self.get_coordinate(n), self.get_coordinate(v)result.append({"name": "{}->{}".format(n, v), "coords": [f, t]})else:lng, lat = self.get_coordinate(n)result.append({"name": n, "value": [lng, lat, v]})return result

该函数的作用是根据省、市名称，返回该省、市地理位置坐标，返回值类型是list。也就是说，如果省、市名输入有误（和库中的不匹配）那么是查不到位置信息，也就是说该函数的返回值会变为NoneType object （空对象）。

因此，我将省、市名称调整为和该库一样的格式，程序顺利执行，问题解决！

https://www.crifan.com/python_typeerror_nonetype_object_is_not_iterable/

这次的效果不错！

def area_operation():path = os.path.join(data_path, file_name[1])# print(path)test_data = pd.read_csv(path, delimiter="#", encoding='utf-8', dtype={'salary': str})area=test_data['area']#print(area)final=[]for format in area.iteritems():kkk=re.split(r'-',format[1])final.append(kkk[0])series=pd.Series(final)area_count=series.value_counts()data_sheng=[]data_shi=[]for k in area_count.iteritems():print(k[0])name=k[0]print(type(k[0]))if name in ['北京','上海','天津','重庆','陕西省','湖北省','江苏省','广东省']:name=re.sub(r'[省]','',name)tmp=[name,k[1]]data_sheng.append(tmp)else:name = name + '市'tmp=[name,k[1]]data_shi.append(tmp)print(data_sheng)print(data_shi)c = (Geo().add_schema(maptype="china").add("省",data_sheng).add('市',data_shi).set_series_opts(label_opts=opts.LabelOpts(is_show=False)).set_global_opts(visualmap_opts=opts.VisualMapOpts(),title_opts=opts.TitleOpts(title="工作分布"),))c.render('./view/area_heatMap.html')

numpy数组转python列表

NumPy数组（9）-- 将NumPy数组转换为Python列表
https://blog.csdn.net/zhubao124/article/details/80719306

另外看到的一些比较好的思路

https://www.jianshu.com/p/af61c53c0705

对爬取51job中用到的技巧的分析、记录相关推荐

Python爬虫，爬取51job上有关大数据的招聘信息
Python爬虫,爬取51job上有关大数据的招聘信息爬虫初学者,练手实战最近在上数据收集课,分享一些代码. 分析所要爬取的网址 https://search.51job.com/list/000 ...
SeleniumChrome实战:动态爬取51job招聘信息
一.概述 Selenium自动化测试工具,可模拟用户输入,选择,提交. 爬虫实现的功能: 输入python,选择地点:上海,北京 ---->就去爬取上海,北京2个城市python招聘信息输入会 ...
scrapy模拟浏览器爬取51job(动态渲染页面爬取)
scrapy模拟浏览器爬取51job 51job链接网络爬虫时,网页不止有静态页面还有动态页面,动态页面主要由JavaScript动态渲染,网络爬虫经常遇见爬取JavaScript动态渲染的页面. ...
Python Scrapy爬虫框架爬取51job职位信息并保存至数据库
Python Scrapy爬虫框架爬取51job职位信息并保存至数据库 -------------------------------- 版权声明:本文为CSDN博主「杠精运动员」的原创文章,遵循CC ...
爬取51job职位信息--进行专业市场需求可视化分析（python、tableau、DBeaver）
爬取51job信管专业相关岗位的情况进行可视化分析. 采用工具:python.tableau(可视化).DBeaver(数据库管理软件) 文章目录一．数据爬取 1.1导入相关的库 1.2对每个岗位搜 ...
python爬取51job的示例
如何爬取51job的岗位和薪资信息,可参考以下代码 import json import re import sqlite3 import urllib.error import urllib.req ...
使用Python爬取51job招聘网的数据
使用Python爬取51job招聘网的数据进行网站分析获取职位信息存储信息最终代码进行网站分析进入https://www.51job.com/这个网站我在这就以python为例搜索职位跳 ...
python词云代码手机_【云计算】爬取淘宝手机品牌词云分析（python）
本文主要向大家介绍了[云计算]爬取淘宝手机品牌词云分析(python),通过具体的内容向大家展现,希望对大家学习云计算有所帮助. 淘宝手机信息的爬取,请看这边博客(点击这里),然后我们利用其中保存的文 ...
python爬取豆瓣读书并进行图形化分析
python爬取豆瓣读书并进行图形化分析豆瓣读书网页数据爬取并保存至csv 对数据进行分析并汇成图形绘制散点图图形效果展示以下代码内容大多是团队小伙伴的杰作,而本人只是为了能让更多的人学习到知 ...

对爬取51job中用到的技巧的分析、记录