概述

  • 基于bs4解析的python爬虫,没有用别的框架,用了requests库抓取,实现了链家新房数据爬取并保存到mysql数据库中,图片url也存入数据库,图片本体保存在本地,还码了一个gui可视化饼图。

效果图

  • 控制台

  • 饼图

  • 数据库

  • 图片保存到本地

具体实现

  • 确定链接。我的链接是这个:链家新房-广州
  • 然后是插件。这里我装了bs4,pymysql,matplotlib。需要的自行安装一下。接下来是分区代码
  • 配置header和项目初始化:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36','Accept': 'image/webp,image/*,*/*;q=0.8','Accept-Encoding': 'gzip, deflate','Referer': 'http://www.baidu.com/link?url=_andhfsjjjKRgEWkj7i9cFmYYGsisrnm2A-TN3XZDQXxvGsM9k9ZZSnikW2Yds4s&wd=&eqid=c3435a7d00006bd600000003582bfd1f','Connection': 'keep-alive'}
page = ('pg')
hlist = []
  • 对链接信息的处理:
def listinfo(listhtml):areasoup = BeautifulSoup(listhtml, 'html.parser')ljhouse = areasoup.find_all('div', attrs={'class': 'resblock-desc-wrapper'})loupanimg = areasoup.find_all("img", attrs={"class": "lj-lazy"})i=0for house in ljhouse:loupantitle = house.find("div", attrs={"class": "resblock-name"})loupanname = loupantitle.a.get_text()loupantag = loupantitle.find_all("span")wuye = loupantag[0].get_text()xiaoshouzhuangtai = loupantag[1].get_text()location = house.find("div", attrs={"class": "resblock-location"}).get_text()jishi = house.find("a", attrs={"class": "resblock-room"}).get_text()area = house.find("div", attrs={"class": "resblock-area"})sarea = area.find("span").get_text()r_area = '暂无'if sarea != '':r_area = house.find("div", attrs={"class": "resblock-area"}).get_text().split()[1]tag = house.find("div", attrs={"class": "resblock-tag"}).get_text()jiage = house.find("div", attrs={"class": "resblock-price"})price = jiage.find("div", attrs={"class": "main-price"}).get_text().split()[0]  # 截取数字if price.replace('\n','').find('-') != -1:price = price.split('-')[1]total = jiage.find("div", attrs={"class": "second"})totalprice = "暂无"if total is not None:totalprice = total.get_text()h = {'title': loupanname, 'wuye': wuye, 'states': xiaoshouzhuangtai, 'location': location.replace("\n", ""),'jishi': jishi.replace("\n", ""), 'area': r_area.replace('\n', ''), 'tag': tag.replace('\n', ''), 'price': price.replace('\n', ''),'totalprice': totalprice,'loupanimg':loupanimg[i].get('data-original')};i = i+1hlist.append(h)
  • 链接数据库:
conn = pymysql.connect(host='localhost',user='用户名',password='密码',database='数据库名',charset='utf8',# autocommit=True,    # 如果插入数据,, 是否自动提交? 和conn.commit()功能一致。
)
# 使用 cursor() 方法创建一个游标对象 cursor
cursor = conn.cursor()
for i in range(len(hlist)):try:insert_sql = "insert into gzdata(title, wuye, states, location, jishi, area, tag, price, totalprice, imgurl) values ("for key in hlist[i]:insert_sql = insert_sql + "'" + hlist[i][key] + "',"insert_sql = insert_sql[:-1] + ")"print(insert_sql)cursor.execute(insert_sql)conn.commit()except:pass
print("数据入库完成")
# 关闭数据库连接
conn.close()
  • 图片保存到本地:
def downloadPic(hlist):for i in range(len(hlist)):imgUrl = requests.get(hlist[i]['loupanimg']).contentf = open('C:\\Users\\Lenovo\\Desktop\\study\\py\\catchLJ\\imgFile\\'+hlist[i]['title']+'.jpg', 'wb')f.write(imgUrl)print(hlist[i]['title'], "图片正在下载")print("图片下载完成")f.close()
  • 可视化饼图:
def printPic(list):plt.rcParams["font.family"] = "kaiti"priceData = []lowNum = 0midNum = 0heighNum = 0# 数据准备for i in range(len(list)):priceData.append(int(list[i]['price']))# print(priceData)for num in priceData:if num <= 20000:lowNum = lowNum + 1elif num <= 45000:midNum = midNum + 1else:heighNum = heighNum + 1p_low = lowNum / len(priceData)p_mid = midNum / len(priceData)p_height = heighNum / len(priceData)nums = [p_low, p_mid, p_height]labels = ['0-20000', '20001-45000', '>45000']# 用Matplotlib画饼图plt.pie(x=nums, labels=labels, autopct="%.1f%%", shadow=True)plt.title('楼价区间比例(每平)', size=20)plt.show()
  • 主函数:
if __name__ == '__main__':user_in_city = 'gz'url = generate_cityurl(user_in_city)print(url)areahtml = areainfo(url)listinfo(areahtml)downloadPic(hlist)printPic(hlist)print(hlist)

完整代码

  • 说明:这是学了py两三天写出来的,所以数据处理很冗余没有任何优化,对于防反爬也用的是粗暴的sleep哈,只是简单码个作业。配置好插件和mysql数据库就能直接跑。仅供学习使用。
from bs4 import BeautifulSoup
import requests
import time
import pymysql
import matplotlib.pyplot as plt
# 配置header,配置referer防盗链防止图片爬不下来
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36','Accept': 'image/webp,image/*,*/*;q=0.8','Accept-Encoding': 'gzip, deflate','Referer': 'http://www.baidu.com/link?url=_andhfsjjjKRgEWkj7i9cFmYYGsisrnm2A-TN3XZDQXxvGsM9k9ZZSnikW2Yds4s&amp;amp;wd=&amp;amp;eqid=c3435a7d00006bd600000003582bfd1f','Connection': 'keep-alive'}
page = ('pg')def generate_cityurl(user_in_city):  # 生成urlcityurl = 'https://' + user_in_city + '.lianjia.com/loupan/'return cityurldef areainfo(url):page = ('pg')for i in range(1, 2):  # 获取1-n页的数据if i == 1:i = str(i)a = (url + page + i + '/')r = requests.get(url=a, headers=headers)print(a)htmlinfo = r.contentelse:i = str(i)a = (url + page + i + '/')print(a)r = requests.get(url=a, headers=headers)html2 = r.contenthtmlinfo = htmlinfo + html2time.sleep(0.2)return htmlinfohlist = []
def listinfo(listhtml):areasoup = BeautifulSoup(listhtml, 'html.parser')ljhouse = areasoup.find_all('div', attrs={'class': 'resblock-desc-wrapper'})loupanimg = areasoup.find_all("img", attrs={"class": "lj-lazy"})i=0for house in ljhouse:loupantitle = house.find("div", attrs={"class": "resblock-name"})loupanname = loupantitle.a.get_text()loupantag = loupantitle.find_all("span")wuye = loupantag[0].get_text()xiaoshouzhuangtai = loupantag[1].get_text()location = house.find("div", attrs={"class": "resblock-location"}).get_text()jishi = house.find("a", attrs={"class": "resblock-room"}).get_text()area = house.find("div", attrs={"class": "resblock-area"})sarea = area.find("span").get_text()r_area = '暂无'if sarea != '':r_area = house.find("div", attrs={"class": "resblock-area"}).get_text().split()[1]tag = house.find("div", attrs={"class": "resblock-tag"}).get_text()jiage = house.find("div", attrs={"class": "resblock-price"})price = jiage.find("div", attrs={"class": "main-price"}).get_text().split()[0]  # 截取数字if price.replace('\n','').find('-') != -1:price = price.split('-')[1]total = jiage.find("div", attrs={"class": "second"})totalprice = "暂无"if total is not None:totalprice = total.get_text()h = {'title': loupanname, 'wuye': wuye, 'states': xiaoshouzhuangtai, 'location': location.replace("\n", ""),'jishi': jishi.replace("\n", ""), 'area': r_area.replace('\n', ''), 'tag': tag.replace('\n', ''), 'price': price.replace('\n', ''),'totalprice': totalprice,'loupanimg':loupanimg[i].get('data-original')};i = i+1hlist.append(h)# 下载图片到本地
def downloadPic(hlist):for i in range(len(hlist)):imgUrl = requests.get(hlist[i]['loupanimg']).contentf = open('C:\\Users\\Lenovo\\Desktop\\study\\py\\catchLJ\\imgFile\\'+hlist[i]['title']+'.jpg', 'wb')f.write(imgUrl)print(hlist[i]['title'], "图片正在下载")print("图片下载完成")f.close()# 可视化
def printPic(list):plt.rcParams["font.family"] = "kaiti"priceData = []lowNum = 0midNum = 0heighNum = 0# 数据准备for i in range(len(list)):priceData.append(int(list[i]['price']))# print(priceData)for num in priceData:if num <= 20000:lowNum = lowNum + 1elif num <= 45000:midNum = midNum + 1else:heighNum = heighNum + 1p_low = lowNum / len(priceData)p_mid = midNum / len(priceData)p_height = heighNum / len(priceData)nums = [p_low, p_mid, p_height]labels = ['0-20000', '20001-45000', '>45000']# 用Matplotlib画饼图plt.pie(x=nums, labels=labels, autopct="%.1f%%", shadow=True)plt.title('楼价区间比例(每平)', size=20)plt.show()if __name__ == '__main__':# user_in_city = input('输入抓取城市:')user_in_city = 'gz'url = generate_cityurl(user_in_city)print(url)areahtml = areainfo(url)listinfo(areahtml)downloadPic(hlist)printPic(hlist)print(hlist)# 连接数据库
conn = pymysql.connect(host='localhost',user='root',password='123456',database='catchdata',charset='utf8',# autocommit=True,    # 如果插入数据,, 是否自动提交? 和conn.commit()功能一致。
)
# 使用 cursor() 方法创建一个游标对象 cursor
cursor = conn.cursor()
for i in range(len(hlist)):try:insert_sql = "insert into gzdata(title, wuye, states, location, jishi, area, tag, price, totalprice, imgurl) values ("for key in hlist[i]:insert_sql = insert_sql + "'" + hlist[i][key] + "',"insert_sql = insert_sql[:-1] + ")"print(insert_sql)cursor.execute(insert_sql)conn.commit()except:pass
print("数据入库完成")
# 关闭数据库连接
conn.close()

基于bs4的python爬虫-链家新房(广州页面)相关推荐

  1. php爬取房源,Python 爬虫 链家二手房(自行输入城市爬取)

    因同事想在沈阳买房,对比分析沈阳各区的房价,让我帮忙爬取一下链家网相关数据,然后打 算记下笔记 用于总结学到的东西&用到的东西. 一.爬虫需要会什么? 学习东西 首先你要知道它是干嘛的.爬虫 ...

  2. 基于bs4的python爬虫+mongoDB

    这是我们这学期的一个小实验,自学后我自己简单的写了一下,在写的过程中,倒是没遇到什么难题,只是有一些小疑惑,在这里希望各位看客能给出建议. 问题一: from fake_useragent impor ...

  3. python爬虫——链家苏州成交房价2

    # -*- coding: utf-8 -*- import bs4 import requests import time#引入time,计算下载时间def open_url(url): # url ...

  4. python 爬虫 链家网二手房信息采集代码

    直接上代码吧,应该很好理解 import requests import lxml.html import time from fake_useragent import UserAgent impo ...

  5. python链家新房信息获取练习

    使用python对链家新房相关数据进行爬取,并进行持久化存储. 文章目录 前言 一.页面分析 二.代码编写 1.数据库表的建立 2.代码编写 结果 前言 保持练习 以下是本篇文章正文内容,下面案例可供 ...

  6. 爬虫项目六:用Python爬下链家新房所有城市近三万条数据

    文章目录 前言 一.分析url 二.拼接url 1.实例化chrome 2.获取首字符.page 3.拼接url 三.获取房源数据 前言 本文全面解析了链家新房源数据,爬取了全部城市的房源信息,共两万 ...

  7. python爬取链家新房数据

    没有搜索到关于python爬虫,所以自己写一个 from bs4 import BeautifulSoup import requests import time import pandas as p ...

  8. 【Python】基于Python获取链家小区房价信息及其POI数据

    文章目录 1 简介 2 效果展示 3 分析网页 4 代码思路 5 完整代码 6 相关文章 1 简介 本来要先发在csdn上的,但是之前学弟催我给他公众号写点东西,我就把这篇博客首发在他的公众号上,现在 ...

  9. python适应的领域_“Andrew说Python爬虫”百家号娱乐领域排行-哪个领域更适合新手作者?...

    Andrew说Python爬虫是当前百家号中的普通号,目前账号百家号权重为2,综合排名位列690769名,娱乐分类排名位列181017名,领先了37.8%的百家号. Andrew说Python爬虫的简 ...

  10. 爬取‘广州链家新房’数据并以csv形式保存。

    --本次的目标是爬取'广州链家新房'前十页的信息,具体需要爬取的信息为'楼房名字.地址.价格以及是否在售的情况',具体的代码如下. import requests,time import pandas ...

最新文章

  1. 状压DP Hiho-1044 状态压缩
  2. 2014年湖北省TI杯大学生电子设计竞赛论文格式
  3. 简单动画函数封装及缓动效果
  4. 为什么Kubernetes要引入pod的概念,而不直接操作Docker容器
  5. [TJOI2011] 书架(线段数优化dp + 单调栈)
  6. ​4种实现多列布局css
  7. ios最新防越狱检测插件_-一份从零开始的iOS插件分享-
  8. 图解TCPIP-传输层 UDP
  9. 里约奥运会的五项技术创新
  10. 不敢去争取,学不会珍惜,却难以忘记——dbGet(三)
  11. 编程学习记录1:编程的一些简单概念
  12. spark访问不存在文件,或者空文件
  13. xtrabackup mysql 5.1_编译支持mysql-5.1.73版本的xtrabackup
  14. zemax---System Explorer(系统选项)
  15. 关于计算机知识的动画电影,动画概论总复习题目(附答案)
  16. Intel RDT特性详解
  17. python方位角计算
  18. vbs脚本在服务器上虚拟按键,怎么用VBS代码实现模拟键盘按键?
  19. APS科普:如何缩短制造提前期?
  20. 索尼CEO吉田宪一郎:智能手机业务是公司必不可少的一部分

热门文章

  1. 十亿级别的MySQL数据库表(InnoDB存储引擎),旧数据清理的蹩脚方案。
  2. php --enable-maintainer-zts,我的PHP编译日志
  3. 大道至简之八:透过现象看本质(房价推手)
  4. ps怎么抠地图线路_用PS怎么抠地图?
  5. ios 凭据验证_苹果内购服务器验证凭证回执Data
  6. 没有比粥更温柔的了。念予毕生流离红尘,就找不到一个似粥温柔的人。
  7. 谷歌浏览器翻译插件 划词翻译
  8. 大一新生的第一篇博客
  9. AD9854 MSP430 代码总结
  10. PTA 习题3.6 一元多项式的乘法与加法运算