基于bs4的python爬虫+mongoDB

这是我们这学期的一个小实验，自学后我自己简单的写了一下，在写的过程中，倒是没遇到什么难题，只是有一些小疑惑，在这里希望各位看客能给出建议。
问题一：
from fake_useragent import UserAgent
我本来是想调UserAgent的库，因为我怕反爬嘛，就像让它随机变动，但是我pip能下载UserAgent的包，但是里面的fake_useragent我下不下来，也就调用不到在，最后实在没办法，算了，那就不用它嘛！
但是现在想来，我还是想不通。
我先说一下我这实验环境吧
实验环境：
Ubuntu+mongoDB+python爬虫BeautifulSoup
因为我们实验要求是在ubuntu上搭建数据库，所以要存放数据，就必须连接服务器，我把ssh服务都配到了pycharm上，用的是ubuntu自带的python解释器（这里我做过升级），所以我就有猜想是不是实验环境的问题。
window环境下的python解释器是能调用UserAgent的包得。
总结：
至于代码，看起来很繁琐，但是我自我感觉真的很繁琐，毕竟是野路子。
我也不知道这样的代码写出来有没有用，但这也算是我前期得一种代码风格吧！
写代码+优化代码用了大概小半天，写出来还是蛮有收获的，但是依旧还是只菜鸟。各位看个如果觉得有可取之处，不忘一键三连，万分感谢！！！！
不知道要写什么了，暂时先写到这儿
数据截图：

# -*- codeing = utf-8 -*-
# @Time 2022/11/30 11:22
# @Author : 小肖蚊子
# @File : bs4_ershoufang.py
# @Software : PyCharm
"""
实验要求：对二手房房源信息进行分析需要获取房源
所在区域(area)、小区名(title)、房型(type)、面积(square)、
具体地域(detail_area)、具体位置(detail_place)、价格(price)等信息
存储格式：
{"_id" : ObjectId("5c1aebe52ca2902f3cd00c4a"),"area" : "xx","title" : "xxx","type" : "xxx","square" : "int","detail_area" : "xx","detail_place" : "xxx","price" : "int"
}
"""import requests
import time
from bs4 import BeautifulSoup
import random
import pymongo# 连接数据库
client = pymongo.MongoClient()
db = client.lianjia
collection = db.ershoufang
collection.delete_many({})# 网址
url = "https://cd.lianjia.com/ershoufang/jinjiang/pg{}/"
base_url = "https://cd.lianjia.com/ershoufang"# 构建请求头
# useragent = UserAgent()
headers = {"UserAgent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36","Accept": "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8","Referer": "https://cd.lianjia.com/",
}# 设置代理ip
proxies_list = ['https://190.216.234.242:8080','https://103.51.45.9:4145',
]
proxies = {"http": random.choice(proxies_list)
}# 发起请求
def load_page(url):try:resp = requests.get(url, headers=headers, proxies=proxies)html = resp.text# print(html)if resp.status_code == 200:print("页面响应成功！" + url)return htmlexcept:print("网络请求出错！或者 100个页面已经请求完了！")def page(html):main_page = BeautifulSoup(html, "html.parser")# print(main_page)# 获取区域链接源代码代码块page_area_url = main_page.find("div", class_="position").find("div").find_all("a")# print(page_area_url)# 循环获取成都每个区域的链接for i in page_area_url:area_url = i.get("href").split("/")area_url_base = "/" + area_url[2] + "/"# 构建的区域链接area_url_all = base_url + area_url_baseprint(area_url_all)# 对区域页面发起请求resp_area = requests.get(area_url_all, headers=headers, proxies=proxies)second_page = BeautifulSoup(resp_area.text, "html.parser")# 获取每个房源信息的源代码代码块<div class="title>下的<a calss href>标签page_content = second_page.find("div", class_="title").find("a")# 获取所有房源的信息页面链接page_content_url = page_content.get("href")print(page_content_url)resp_house_resource = requests.get(page_content_url, headers=headers, proxies=proxies)content = BeautifulSoup(resp_house_resource.text, "html.parser")# <div class="areaName"><span class="info">area_div_span = content.find("div", class_="areaName").find("span", class_="info")# 定义一个列表存放具体位置detail_place = []detail_place_value = area_div_span.text.split("\xa0")if len(detail_place_value[2]) == 0:detail_place_value_a = content.find("div", class_="areaName").find("a", class_="supplement").text# print("detail_place_value_a", detail_place_value_a)if len(detail_place_value_a) == 0:detail_place.append("具体位置未知")else:detail_place.append(detail_place_value_a)else:detail_place.append(detail_place_value[2])# 定义一个列表存放所在区域area = [detail_place_value[0]]# 定义一个列表存放具体地域detail_area = [detail_place_value[1]]# 定义一个列表存放小区名title = []title_div_h1 = content.find("div", class_="title").find("h1", class_="main")title_value = title_div_h1.text.split(" ")[0]if len(title_value) == 0:title.append("未知小区名")else:title.append(title_value)# 定义一个列表存放面积square = []square_div_span = content.find("div", class_="introContent").find("div", class_="content").find_all("li")square_value = square_div_span[2].text.split("建筑面积")[-1]if len(square_value) == 0:square.append("面积未知")print("面积未知")else:square.append(square_value)# 定义一个列表存放价格price = []price_div_span = content.find("div", class_="price").find("span", class_="unitPriceValue")price_value = price_div_span.textif len(price_value) == 0:price.append("价格待定")print("价格待定")else:price.append(price_value)# 定义一个列表存放房型type = []type_div_span = content.find("div", class_="transaction").find("div", class_="content").find_all("span")type_value = type_div_span[7].textif len(type_value) == 0:type.append("未知房型")else:type.append(type_value)# print(type)# #list可以相加的python是强类型语言data = area + title + type + square + detail_area + detail_place + priceprint(data)insert_mongo(data)time.sleep(1)def insert_mongo(data):""":param data: 数据源:return: none"""collection.insert_one({'area': data[0],'title': data[1],'type': data[2],'square': data[3],'detail_area': data[4],'detail_place': data[5],'price': data[6],})# 多个页面数据下载
if __name__ == '__main__':for h in range(1, 100):html = load_page(url.format(h))page(html)

不忘一键三连，万分感谢！！！！！
不足之处也可指正！！！

基于bs4的python爬虫+mongoDB相关推荐

基于requests-html的python爬虫
目录 requests-html 基本使用 - 获取网页 - 获取链接 - 获取元素 - xlsxwriter 示例 - 开奖结果爬虫demo - 爬取表情包demo requests-html Re ...
python爬虫+mongodb+matlab彩票抓取
每一个穷人都有一个发财的梦想. 于是彩票这个东西,诞生了. 随之而来各种预测方式···千奇百怪十二生肖说,星座说,图象说,等等今天,我摸了摸干瘪的口袋,怀揣着一颗求富的心,试着抓取了2004年到 ...
基于Requests的Python爬虫入门实例------爬取豆瓣图书排行榜的前25本图书（小白福利）
话不多说,先上代码: # -*- coding:utf-8 -*- import sys import requests import lxml from bs4 import BeautifulSo ...
爬虫技术python nutch_基于Nutch的python爬虫分析
1.创建一个新的WebDb (admin db -create): 2.将抓取起始URLs写入WebDB中 (inject): 3.根据WebDB生成fetchlist并写入相应的segment(ge ...
基于浏览器的python爬虫神器pyppeteer介绍及入门
简介在讲 python 的 pyppeteer 前,先来说下 Node 的 puppeteer 库: puppeteer的中文意思是:操纵木偶的人,木偶师. 那么 Node 的 puppeteer ...
喜马拉雅APP基于Scrapy的Python爬虫
爬取的是app下的抖音专栏下的音频文件用抓包工具,抓取url,分析url,获取json数据,用xpath提取列表. from XmlySpider.items import XmlyItemclas ...
python爬虫爬网站数据登录_使用webdriver+urllib爬取网页数据(模拟登陆，过验证码)...
urilib是python的标准库,当我们使用Python爬取网页数据时,往往用的是urllib模块,通过调用urllib模块的urlopen(url)方法返回网页对象,并使用read()方法获得ur ...
基于python爬虫数据处理_基于Python爬虫的校园数据获取
苏艺航徐海蛟何佳蕾杨振宇王佳鹏摘要:随着移动时代的到来,只适配了电脑网页.性能羸弱的校园教务系统,已经不能满足学生们的移动查询需求.为此,设计了一种基于网络爬虫的高实用性查询系統.它首先通过 ...
基于Python爬虫的大众点评商家评论的文本挖掘
使用工具编程语言工具:Python 2.7 R 2 .2.1 excel 浏览器:Google Chrome 数据库: Mongodb 相关算法:情感分析情感分析(Sentiment Analy ...

基于bs4的python爬虫+mongoDB

基于bs4的python爬虫+mongoDB相关推荐

最新文章

热门文章