基于bs4的python爬虫-链家新房(广州页面)
概述
- 基于bs4解析的python爬虫,没有用别的框架,用了requests库抓取,实现了链家新房数据爬取并保存到mysql数据库中,图片url也存入数据库,图片本体保存在本地,还码了一个gui可视化饼图。
效果图
控制台
饼图
数据库
图片保存到本地
具体实现
- 确定链接。我的链接是这个:链家新房-广州
- 然后是插件。这里我装了bs4,pymysql,matplotlib。需要的自行安装一下。接下来是分区代码
- 配置header和项目初始化:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36','Accept': 'image/webp,image/*,*/*;q=0.8','Accept-Encoding': 'gzip, deflate','Referer': 'http://www.baidu.com/link?url=_andhfsjjjKRgEWkj7i9cFmYYGsisrnm2A-TN3XZDQXxvGsM9k9ZZSnikW2Yds4s&wd=&eqid=c3435a7d00006bd600000003582bfd1f','Connection': 'keep-alive'}
page = ('pg')
hlist = []
- 对链接信息的处理:
def listinfo(listhtml):areasoup = BeautifulSoup(listhtml, 'html.parser')ljhouse = areasoup.find_all('div', attrs={'class': 'resblock-desc-wrapper'})loupanimg = areasoup.find_all("img", attrs={"class": "lj-lazy"})i=0for house in ljhouse:loupantitle = house.find("div", attrs={"class": "resblock-name"})loupanname = loupantitle.a.get_text()loupantag = loupantitle.find_all("span")wuye = loupantag[0].get_text()xiaoshouzhuangtai = loupantag[1].get_text()location = house.find("div", attrs={"class": "resblock-location"}).get_text()jishi = house.find("a", attrs={"class": "resblock-room"}).get_text()area = house.find("div", attrs={"class": "resblock-area"})sarea = area.find("span").get_text()r_area = '暂无'if sarea != '':r_area = house.find("div", attrs={"class": "resblock-area"}).get_text().split()[1]tag = house.find("div", attrs={"class": "resblock-tag"}).get_text()jiage = house.find("div", attrs={"class": "resblock-price"})price = jiage.find("div", attrs={"class": "main-price"}).get_text().split()[0] # 截取数字if price.replace('\n','').find('-') != -1:price = price.split('-')[1]total = jiage.find("div", attrs={"class": "second"})totalprice = "暂无"if total is not None:totalprice = total.get_text()h = {'title': loupanname, 'wuye': wuye, 'states': xiaoshouzhuangtai, 'location': location.replace("\n", ""),'jishi': jishi.replace("\n", ""), 'area': r_area.replace('\n', ''), 'tag': tag.replace('\n', ''), 'price': price.replace('\n', ''),'totalprice': totalprice,'loupanimg':loupanimg[i].get('data-original')};i = i+1hlist.append(h)
- 链接数据库:
conn = pymysql.connect(host='localhost',user='用户名',password='密码',database='数据库名',charset='utf8',# autocommit=True, # 如果插入数据,, 是否自动提交? 和conn.commit()功能一致。
)
# 使用 cursor() 方法创建一个游标对象 cursor
cursor = conn.cursor()
for i in range(len(hlist)):try:insert_sql = "insert into gzdata(title, wuye, states, location, jishi, area, tag, price, totalprice, imgurl) values ("for key in hlist[i]:insert_sql = insert_sql + "'" + hlist[i][key] + "',"insert_sql = insert_sql[:-1] + ")"print(insert_sql)cursor.execute(insert_sql)conn.commit()except:pass
print("数据入库完成")
# 关闭数据库连接
conn.close()
- 图片保存到本地:
def downloadPic(hlist):for i in range(len(hlist)):imgUrl = requests.get(hlist[i]['loupanimg']).contentf = open('C:\\Users\\Lenovo\\Desktop\\study\\py\\catchLJ\\imgFile\\'+hlist[i]['title']+'.jpg', 'wb')f.write(imgUrl)print(hlist[i]['title'], "图片正在下载")print("图片下载完成")f.close()
- 可视化饼图:
def printPic(list):plt.rcParams["font.family"] = "kaiti"priceData = []lowNum = 0midNum = 0heighNum = 0# 数据准备for i in range(len(list)):priceData.append(int(list[i]['price']))# print(priceData)for num in priceData:if num <= 20000:lowNum = lowNum + 1elif num <= 45000:midNum = midNum + 1else:heighNum = heighNum + 1p_low = lowNum / len(priceData)p_mid = midNum / len(priceData)p_height = heighNum / len(priceData)nums = [p_low, p_mid, p_height]labels = ['0-20000', '20001-45000', '>45000']# 用Matplotlib画饼图plt.pie(x=nums, labels=labels, autopct="%.1f%%", shadow=True)plt.title('楼价区间比例(每平)', size=20)plt.show()
- 主函数:
if __name__ == '__main__':user_in_city = 'gz'url = generate_cityurl(user_in_city)print(url)areahtml = areainfo(url)listinfo(areahtml)downloadPic(hlist)printPic(hlist)print(hlist)
完整代码
- 说明:这是学了py两三天写出来的,所以数据处理很冗余没有任何优化,对于防反爬也用的是粗暴的sleep哈,只是简单码个作业。配置好插件和mysql数据库就能直接跑。仅供学习使用。
from bs4 import BeautifulSoup
import requests
import time
import pymysql
import matplotlib.pyplot as plt
# 配置header,配置referer防盗链防止图片爬不下来
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36','Accept': 'image/webp,image/*,*/*;q=0.8','Accept-Encoding': 'gzip, deflate','Referer': 'http://www.baidu.com/link?url=_andhfsjjjKRgEWkj7i9cFmYYGsisrnm2A-TN3XZDQXxvGsM9k9ZZSnikW2Yds4s&amp;wd=&amp;eqid=c3435a7d00006bd600000003582bfd1f','Connection': 'keep-alive'}
page = ('pg')def generate_cityurl(user_in_city): # 生成urlcityurl = 'https://' + user_in_city + '.lianjia.com/loupan/'return cityurldef areainfo(url):page = ('pg')for i in range(1, 2): # 获取1-n页的数据if i == 1:i = str(i)a = (url + page + i + '/')r = requests.get(url=a, headers=headers)print(a)htmlinfo = r.contentelse:i = str(i)a = (url + page + i + '/')print(a)r = requests.get(url=a, headers=headers)html2 = r.contenthtmlinfo = htmlinfo + html2time.sleep(0.2)return htmlinfohlist = []
def listinfo(listhtml):areasoup = BeautifulSoup(listhtml, 'html.parser')ljhouse = areasoup.find_all('div', attrs={'class': 'resblock-desc-wrapper'})loupanimg = areasoup.find_all("img", attrs={"class": "lj-lazy"})i=0for house in ljhouse:loupantitle = house.find("div", attrs={"class": "resblock-name"})loupanname = loupantitle.a.get_text()loupantag = loupantitle.find_all("span")wuye = loupantag[0].get_text()xiaoshouzhuangtai = loupantag[1].get_text()location = house.find("div", attrs={"class": "resblock-location"}).get_text()jishi = house.find("a", attrs={"class": "resblock-room"}).get_text()area = house.find("div", attrs={"class": "resblock-area"})sarea = area.find("span").get_text()r_area = '暂无'if sarea != '':r_area = house.find("div", attrs={"class": "resblock-area"}).get_text().split()[1]tag = house.find("div", attrs={"class": "resblock-tag"}).get_text()jiage = house.find("div", attrs={"class": "resblock-price"})price = jiage.find("div", attrs={"class": "main-price"}).get_text().split()[0] # 截取数字if price.replace('\n','').find('-') != -1:price = price.split('-')[1]total = jiage.find("div", attrs={"class": "second"})totalprice = "暂无"if total is not None:totalprice = total.get_text()h = {'title': loupanname, 'wuye': wuye, 'states': xiaoshouzhuangtai, 'location': location.replace("\n", ""),'jishi': jishi.replace("\n", ""), 'area': r_area.replace('\n', ''), 'tag': tag.replace('\n', ''), 'price': price.replace('\n', ''),'totalprice': totalprice,'loupanimg':loupanimg[i].get('data-original')};i = i+1hlist.append(h)# 下载图片到本地
def downloadPic(hlist):for i in range(len(hlist)):imgUrl = requests.get(hlist[i]['loupanimg']).contentf = open('C:\\Users\\Lenovo\\Desktop\\study\\py\\catchLJ\\imgFile\\'+hlist[i]['title']+'.jpg', 'wb')f.write(imgUrl)print(hlist[i]['title'], "图片正在下载")print("图片下载完成")f.close()# 可视化
def printPic(list):plt.rcParams["font.family"] = "kaiti"priceData = []lowNum = 0midNum = 0heighNum = 0# 数据准备for i in range(len(list)):priceData.append(int(list[i]['price']))# print(priceData)for num in priceData:if num <= 20000:lowNum = lowNum + 1elif num <= 45000:midNum = midNum + 1else:heighNum = heighNum + 1p_low = lowNum / len(priceData)p_mid = midNum / len(priceData)p_height = heighNum / len(priceData)nums = [p_low, p_mid, p_height]labels = ['0-20000', '20001-45000', '>45000']# 用Matplotlib画饼图plt.pie(x=nums, labels=labels, autopct="%.1f%%", shadow=True)plt.title('楼价区间比例(每平)', size=20)plt.show()if __name__ == '__main__':# user_in_city = input('输入抓取城市:')user_in_city = 'gz'url = generate_cityurl(user_in_city)print(url)areahtml = areainfo(url)listinfo(areahtml)downloadPic(hlist)printPic(hlist)print(hlist)# 连接数据库
conn = pymysql.connect(host='localhost',user='root',password='123456',database='catchdata',charset='utf8',# autocommit=True, # 如果插入数据,, 是否自动提交? 和conn.commit()功能一致。
)
# 使用 cursor() 方法创建一个游标对象 cursor
cursor = conn.cursor()
for i in range(len(hlist)):try:insert_sql = "insert into gzdata(title, wuye, states, location, jishi, area, tag, price, totalprice, imgurl) values ("for key in hlist[i]:insert_sql = insert_sql + "'" + hlist[i][key] + "',"insert_sql = insert_sql[:-1] + ")"print(insert_sql)cursor.execute(insert_sql)conn.commit()except:pass
print("数据入库完成")
# 关闭数据库连接
conn.close()
基于bs4的python爬虫-链家新房(广州页面)相关推荐
- php爬取房源,Python 爬虫 链家二手房(自行输入城市爬取)
因同事想在沈阳买房,对比分析沈阳各区的房价,让我帮忙爬取一下链家网相关数据,然后打 算记下笔记 用于总结学到的东西&用到的东西. 一.爬虫需要会什么? 学习东西 首先你要知道它是干嘛的.爬虫 ...
- 基于bs4的python爬虫+mongoDB
这是我们这学期的一个小实验,自学后我自己简单的写了一下,在写的过程中,倒是没遇到什么难题,只是有一些小疑惑,在这里希望各位看客能给出建议. 问题一: from fake_useragent impor ...
- python爬虫——链家苏州成交房价2
# -*- coding: utf-8 -*- import bs4 import requests import time#引入time,计算下载时间def open_url(url): # url ...
- python 爬虫 链家网二手房信息采集代码
直接上代码吧,应该很好理解 import requests import lxml.html import time from fake_useragent import UserAgent impo ...
- python链家新房信息获取练习
使用python对链家新房相关数据进行爬取,并进行持久化存储. 文章目录 前言 一.页面分析 二.代码编写 1.数据库表的建立 2.代码编写 结果 前言 保持练习 以下是本篇文章正文内容,下面案例可供 ...
- 爬虫项目六:用Python爬下链家新房所有城市近三万条数据
文章目录 前言 一.分析url 二.拼接url 1.实例化chrome 2.获取首字符.page 3.拼接url 三.获取房源数据 前言 本文全面解析了链家新房源数据,爬取了全部城市的房源信息,共两万 ...
- python爬取链家新房数据
没有搜索到关于python爬虫,所以自己写一个 from bs4 import BeautifulSoup import requests import time import pandas as p ...
- 【Python】基于Python获取链家小区房价信息及其POI数据
文章目录 1 简介 2 效果展示 3 分析网页 4 代码思路 5 完整代码 6 相关文章 1 简介 本来要先发在csdn上的,但是之前学弟催我给他公众号写点东西,我就把这篇博客首发在他的公众号上,现在 ...
- python适应的领域_“Andrew说Python爬虫”百家号娱乐领域排行-哪个领域更适合新手作者?...
Andrew说Python爬虫是当前百家号中的普通号,目前账号百家号权重为2,综合排名位列690769名,娱乐分类排名位列181017名,领先了37.8%的百家号. Andrew说Python爬虫的简 ...
- 爬取‘广州链家新房’数据并以csv形式保存。
--本次的目标是爬取'广州链家新房'前十页的信息,具体需要爬取的信息为'楼房名字.地址.价格以及是否在售的情况',具体的代码如下. import requests,time import pandas ...
最新文章
- 状压DP Hiho-1044 状态压缩
- 2014年湖北省TI杯大学生电子设计竞赛论文格式
- 简单动画函数封装及缓动效果
- 为什么Kubernetes要引入pod的概念,而不直接操作Docker容器
- [TJOI2011] 书架(线段数优化dp + 单调栈)
- ​4种实现多列布局css
- ios最新防越狱检测插件_-一份从零开始的iOS插件分享-
- 图解TCPIP-传输层 UDP
- 里约奥运会的五项技术创新
- 不敢去争取,学不会珍惜,却难以忘记——dbGet(三)
- 编程学习记录1:编程的一些简单概念
- spark访问不存在文件,或者空文件
- xtrabackup mysql 5.1_编译支持mysql-5.1.73版本的xtrabackup
- zemax---System Explorer(系统选项)
- 关于计算机知识的动画电影,动画概论总复习题目(附答案)
- Intel RDT特性详解
- python方位角计算
- vbs脚本在服务器上虚拟按键,怎么用VBS代码实现模拟键盘按键?
- APS科普:如何缩短制造提前期?
- 索尼CEO吉田宪一郎:智能手机业务是公司必不可少的一部分