本文基于《Django企业开发实战》一书和网络资料,拼拼凑凑,做了简单的网站实现,大致把前后端+数据库方面的内容结合爬虫代码做通了。不得不说,Django和MongoDB是真的不搭配,强烈建议不要使用。

开始:conda环境

本文在Ubuntu18.04LTS系统上,使用anaconda搭建虚拟环境,python版本为3.6.10,主要库版本为django(1.11.4),mongoengine(0.20.0),pymongo(3.10.1),以及爬虫用的requests(2.24.0)和beautifulsoup4(4.9.1)。爬虫部分详见上一篇文章的内容,这边稍作改造直接拿来用。

第一部分:Django与MongoDB的适配

关于Django项目的创建此处先略,建议直接参考《Django企业开发实战》这本的第3章,具体内容有空再补。

最终本文成品的文件结构如下(最外边还有一个search_project文件夹,是禁忌的三重套娃):

在本地安装好MongoDB以及创建好Django项目后,首先是在settings.py中屏蔽掉默认非常好使的sqlite,转而添加上必须由第三方支持才能使用而且资料极少的MongoDB依赖项,主要参考来自于Django 数据库操作(MongoDB+Django),Django+MongoDB,Scrapy爬取数据存储到Mongodb数据库,python+django从mongo读取数据和图片展示在html,Django 框架之 使用MangoDB数据库这几篇博客。

我的settings.py文件修改完成后如下:

"""
Django settings for search_project project.Generated by 'django-admin startproject' using Django 1.11.4.For more information on this file, see
https://docs.djangoproject.com/en/1.11/topics/settings/For the full list of settings and their values, see
https://docs.djangoproject.com/en/1.11/ref/settings/
"""import os
import mongoengine# Build paths inside the project like this: os.path.join(BASE_DIR, ...)
BASE_DIR = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/1.11/howto/deployment/checklist/# SECURITY WARNING: keep the secret key used in production secret!
SECRET_KEY = '**^!7vv0ylwqme#^a6zs*$x)zzwgff-(u_6va%q@u-_g5wz8q#'#项目创建生成的key,每个都不一样# SECURITY WARNING: don't run with debug turned on in production!
DEBUG = FalseALLOWED_HOSTS = ['*']# Application definition
#show是我的app名,而mongoengine也作为第三方组件需要被当做app
INSTALLED_APPS = ['show','mongoengine','django.contrib.admin','django.contrib.auth','rest_framework.authtoken','django.contrib.contenttypes','django.contrib.sessions','django.contrib.messages','django.contrib.staticfiles',
]
#mongodb数据库的名字,地址,以及??
MONGODB_DATABASES = {"default": {"name": "search_record","host": '127.0.0.1','tz_aware': True,}
}MIDDLEWARE = ['django.middleware.security.SecurityMiddleware','django.contrib.sessions.middleware.SessionMiddleware','django.middleware.common.CommonMiddleware','django.middleware.csrf.CsrfViewMiddleware','django.contrib.auth.middleware.AuthenticationMiddleware','django.contrib.messages.middleware.MessageMiddleware','django.middleware.clickjacking.XFrameOptionsMiddleware',
]ROOT_URLCONF = 'search_project.urls'TEMPLATES = [{'BACKEND': 'django.template.backends.django.DjangoTemplates','DIRS': [],'APP_DIRS': True,'OPTIONS': {'context_processors': ['django.template.context_processors.debug','django.template.context_processors.request','django.contrib.auth.context_processors.auth','django.contrib.messages.context_processors.messages',],},},
]WSGI_APPLICATION = 'search_project.wsgi.application'# Database
# https://docs.djangoproject.com/en/1.11/ref/settings/#databases
#注释掉sqlite部分
DATABASES = {'default': {'ENGINE': 'django.db.backends.dummy',#'ENGINE': 'django.db.backends.sqlite3',#'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),}
}
#连接数据库
mongoengine.connect('search_record', host='127.0.0.1', port=27017)# Password validation
# https://docs.djangoproject.com/en/1.11/ref/settings/#auth-password-validatorsAUTH_PASSWORD_VALIDATORS = [{'NAME': 'django.contrib.auth.password_validation.UserAttributeSimilarityValidator',},{'NAME': 'django.contrib.auth.password_validation.MinimumLengthValidator',},{'NAME': 'django.contrib.auth.password_validation.CommonPasswordValidator',},{'NAME': 'django.contrib.auth.password_validation.NumericPasswordValidator',},
]# Internationalization
# https://docs.djangoproject.com/en/1.11/topics/i18n/LANGUAGE_CODE = 'zh-hans'TIME_ZONE = 'Asia/Shanghai'USE_I18N = TrueUSE_L10N = TrueUSE_TZ = True# Static files (CSS, JavaScript, Images)
# https://docs.djangoproject.com/en/1.11/howto/static-files/STATIC_URL = '/static/'

第二部分:Model创建

由于需求很简单,也没有涉及管理员和用户系统(Django+MongDB在这方面非常对新手而言非常不友好),所以只需要创建两个用于存储不同网站爬虫信息的模型即可。在show这个app里的models.py文件中创建两个class,分别对应江苏和全国的爬虫数据格式。

from django.db import models
import datetime
from mongoengine import *# Create your models here.
"""
为了方便日期时间等存储,还是建议将相关类设置为DateTimeField格式,
这个可以由datetime库的数据类型直接存储,不需要做转换,
此外mate中ordering为排序key,这边取负数“-”倒序,
务必确认这里属性的名称和实际数据库中的名称一致(不然网页会报错500)
"""
class JiangSu_Record(Document):title = StringField(max_length=2048, required=True)url = StringField(max_length=2048, required=True)start_time = DateTimeField(required=True)deadline = DateTimeField(required=True)issue_time = DateTimeField(required=True)project_type = StringField(max_length=32, required=True)created_time = DateTimeField(default=datetime.datetime.now(), required=True)meta = {'collection': 'JiangSu', 'ordering': ['-created_time'], }
#江苏的相关信息包括标题,链接,开始时间,截止时间,发布时间,项目类型和创建时间,存在数据库的JiangSu这个集合中class Global_Record(Document):title = StringField(max_length=2048, required=True)url = StringField(max_length=2048, required=True)area_city = StringField(max_length=64, required=True)issue_date = DateTimeField(required=True)created_time = DateTimeField(default=datetime.datetime.now(), required=True)meta = {'collection': 'Global', 'ordering': ['-created_time'], }
#全国的相关信息包括标题,链接,区域,发布时间和创建时间,存在数据库的Global这个集合中

由于Django不支持MongoDB下的admin设置,所以admin.py的设置直接略去,同理,创建数据库迁移文件,创建表和创建超级用户等过程也略去。

第三部分:view和html以及URL配置

在app的views.py文件中创建相应网页的视图函数,分别对应主页,江苏查询页和全国查询页。实现MongoDB数据库操作的QuerySet接口是Django自带的,可以大致参考QuerySet API reference以及其他博文。

from django.shortcuts import render
import datetime
from .models import JiangSu_Record, Global_Record
import os, django
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "search_project.settings")# project_name 项目名称
django.setup()
from .crawler_jiangsu import is_date, visit_jiangsu_home
from .crawler_global_user import update_or_not
#直接使用render渲染页面
def index(request):return render(request, 'index.html')def jiangsu_result(request):visit_jiangsu_home()#每次都先爬取,如果是post操作则会读取post数据进行相应查询操作if request.method == "POST":jiangsu_key_word = request.POST.get('jiangsu_key_word', None)jiangsu_date = request.POST.get('jiangsu_date', None)jiangsu_date = is_date(jiangsu_date)target_results = JiangSu_Record.objects.all()if jiangsu_key_word is not None:#匹配多关键词keys = jiangsu_key_word.split(" ")for key in keys:target_results = target_results.filter(title__contains=key)else:passif jiangsu_date is not None:target_results = target_results.filter(deadline__gte=jiangsu_date)else:passcontext = {'results': target_results}return render(request, 'jiangsu_result.html', context=context)else:jiangsu_results = JiangSu_Record.objects.all()context = {'results': jiangsu_results}return render(request, 'jiangsu_result.html', context=context)def global_result(request):update_or_not()if request.method == "POST":global_key_word = request.POST.get('global_key_word', None)global_date = request.POST.get('global_date', None)global_date = is_date(global_date)target_results = Global_Record.objects.all()if global_key_word is not None:keys = global_key_word.split(" ")for key in keys:target_results = target_results.filter(title__contains=key)else:passif global_date is not None:target_results = target_results.filter(issue_date__gte=global_date)else:passtarget_results = target_results[:100]#只显示前100行数据(最新创建的100行)context = {'results': target_results}return render(request, 'global_result.html', context=context)else:global_results = Global_Record.objects.all()[:100]context = {'results': global_results}return render(request, 'global_result.html', context=context)

在写好view.py以后,在app目录下创建templates文件(注意:有“s”),在这个文件夹中创建相应的网页文件,使view中的函数在渲染时能直接调用这些网页。

本文需要3个页面,分别是主页index.html:

<!DOCTYPE html>
<head><title>首页</title>
</head>
<body><hr/><a target="_blank" href="/jiangsu_result/">江苏搜索</a><hr/><a target="_blank" href="/global_result/">全国搜索</a><hr/>
</body>
</html>

这个页面只是实现简单的跳转功能。

江苏爬虫查询页面jiangsu_result.html:

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><title>江苏-搜索结果</title>
</head>
<body><hr/><form action="/jiangsu_result/" method="post" align="center">{% csrf_token %}关键字:<input type="text" name="jiangsu_key_word" />公告截止日期:<input type="text" name="jiangsu_date" /><input type="submit" value="江苏搜索" /></form><hr/><table border="1" align="center"><tbody><tr><th>公告名称</th><th>公告截止日期</th><th>公告发布日期</th><th>项目类型</th><th>公告链接</th></tr>{% for item in results %}<tr><td width="1000" align="left">{{ item.title }}</td><td width="250" align="center">{{ item.deadline }}</td><td width="250" align="center">{{ item.issue_time }}</td><td width="200" align="center">{{ item.project_type }}</td><td width="100" align="center"><a href={{ item.url }}>链接</a></td></tr>{% endfor %}</tbody></table>
</body>
</html>

此处的results就是views.py中render函数context中的内容。

全国爬虫查询页面global_result.html:

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><title>全国-搜索结果</title>
</head>
<body><hr/><form action="/global_result/" method="post" align="center">{% csrf_token %}关键字:<input type="text" name="global_key_word" />起始公布日期:<input type="text" name="global_date" /><input type="submit" value="全国搜索" /></form><hr/><table border="1" align="center"><tbody><tr><th>公告名称</th><th>地区</th><th>公告发布日期</th><th>公告链接</th></tr>{% for item in results %}<tr><td width="1000" align="left">{{ item.title }}</td><td width="150" align="center">{{ item.area_city }}</td><td width="250" align="center">{{ item.issue_date }}</td><td width="100" align="center"><a href={{ item.url }}>链接</a></td></tr>{% endfor %}</tbody></table>
</body>
</html>

最后进行URL配置,即对settings.py同级的urls.py文件进行修改:

from django.conf.urls import url
from django.contrib import admin
from django.contrib.staticfiles.urls import staticfiles_urlpatterns
from show import views
#分别对应3个界面的3个映射
urlpatterns = [url(r'^$', views.index, name='index'),url(r'^jiangsu_result', views.jiangsu_result, name='jiangsu_result'),url(r'^global_result', views.global_result, name='global_result'),url(r'^admin/', admin.site.urls),
]
urlpatterns += staticfiles_urlpatterns()#此处意义未理解

第四部分:完成

回到项目文件目录下,在命令行输入python manage.py runserver启动项目,访问http://127.0.0.1:8000即可以访问主页。

其他:

对上一篇中的爬虫程序做了一些修改,补充了相应的数据库操作和后台自动更新数据库等功能,代码如下。

其中江苏数据更新量少,可以每次查询前先更新一下,不需要在后台不断更新:

# -*- coding: utf-8 -*-
import requests, json
from bs4 import BeautifulSoup
import time
import datetime
import sys
from pymongo import MongoClient, InsertOne
import redef visit_jiangsu_home(x=10):#默认找前10页#模拟头部分headers = {'Accept': 'application/json, text/javascript, */*; q=0.01','Accept-Encoding': 'gzip, deflate','Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7','Connection': 'keep-alive','Content-Length': '14','Content-Type': 'application/x-www-form-urlencoded','Cookie': 'JSESSIONID=6938F495DAA5F25B2E458C7AB108BEDF','Host': '218.2.208.144:8094','Origin': 'http://218.2.208.144:8094','Referer': 'http://218.2.208.144:8094/EBTS/publish/announcement/paglist1?type=1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36','X-Requested-With': 'XMLHttpRequest'}#模拟form data部分form_data = {'page':'1','rows': '10'}#空字典存放信息#home_dict = {}#从第一页开始times = 1repeat_num = 0while times <= x:form_data['page'] = times#注意url是Name下的链接,不是网址url = 'http://218.2.208.144:8094/EBTS/publish/announcement/getList?placard_type=1'response = requests.post(url, data=form_data, headers=headers, stream=True)#请求到的信息,解析后选择转为字符串形式方便无脑操作soup = BeautifulSoup(response.content, "html5lib")soup = str(soup)for i in range(1,11):#由于每页有10行即10个信息,故分可以根据关键字分为11段,其中有效信息在第2到第11段id = soup.split('placard_id":"')[i].split('","project_id')[0]#提取公告IDmark = soup.split('is_new_tuisong":"')[i].split('","remark_id')[0]#提取是否是修改型的公告project_type = soup.split('tender_project_type":"')[i].split('","publish_remark_by_zzw')[0]#提取招标种类station = visit_single_website(id, mark, project_type) #访问单个网站并插入数据库if station == False:repeat_num+=1if repeat_num>10:print("重复过多")returntimes= times+1time.sleep(0.5)#防止频繁访问#return home_dictreturndef visit_single_website(id, mark, project_type):data_address_first1 = 'http://218.2.208.148:9092/api/BiddingNotice/GetByKeys?BNID='  # 普通公告json链接头data_address_first2 = 'http://218.2.208.148:9092/api/BiddingNoticeAdditional/GetByKeys?TNAID='  # 更正公告json链接头page_address_first1 = 'http://218.2.208.148:9092/OP/Bidding/BiddingNoticeAdditionalView.aspx?bnid='  # 普通公告网址头page_address_first2 = 'http://218.2.208.148:9092/OP/Bidding/BiddingNoticeAdditionalView2.aspx?action=view&TNAID='  # 更正公告网址头if int(mark)==2: #是更正公告page_url = page_address_first2 + idaddress_url = data_address_first2 + idtry:return_data = requests.get(address_url, stream=True) #获取json中的全部内容all_content = json.loads(return_data.content)try:title = str(all_content.get('TNANAME'))  # 更正公告的标题,普通公告中没有这个TNANAME项except:title = "无公告内容"try:start_date = str(all_content.get('APPLYBEGINTIME')).replace("T", " ")  # 截止日期start_date = datetime.datetime.strptime(start_date, '%Y-%m-%d %H:%M:%S')except:start_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')try:deadline = str(all_content.get('APPLYENDTIME')).replace("T", " ")  # 截止日期deadline = datetime.datetime.strptime(deadline, '%Y-%m-%d %H:%M:%S')except:deadline = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')try:issue_date = str(all_content.get('CREATETIME')).replace("T", " ")  # 公布日期issue_date = datetime.datetime.strptime(issue_date, '%Y-%m-%d %H:%M:%S')except:issue_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')except:title = "无公告内容"start_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')deadline = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')issue_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')else:page_url = page_address_first1 + idaddress_url = data_address_first1 + idtry:return_data = requests.get(address_url, stream=True)all_content = json.loads(return_data.content)try:title = str(all_content.get('BNNAME'))  # 普通公告的标题except:title = "无公告内容"try:start_date = str(all_content.get('APPLYBEGINTIME')).replace("T", " ")  # 截止日期start_date = datetime.datetime.strptime(start_date, '%Y-%m-%d %H:%M:%S')except:start_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')try:deadline = str(all_content.get('APPLYENDTIME')).replace("T", " ")  # 截止日期deadline = datetime.datetime.strptime(deadline, '%Y-%m-%d %H:%M:%S')except:deadline = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')try:issue_date = str(all_content.get('CREATETIME')).replace("T", " ")  # 公布日期issue_date = datetime.datetime.strptime(issue_date, '%Y-%m-%d %H:%M:%S')except:issue_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')except:title = "无公告内容"start_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')deadline = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')issue_date = datetime.datetime.strptime("1993-01-03", '%Y-%m-%d')state = insert_single_data(title, page_url, start_date, deadline, issue_date, project_type) #true或者false计数return statedef insert_single_data(title, page_url, start_date, deadline, issue_date, project_type):conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)my_db = conn['search_record']try:my_db['JiangSu'].insert_one({"title": title,"url": page_url,"start_time": start_date,"deadline": deadline,"issue_time": issue_date,"project_type": project_type,"created_time": datetime.datetime.now()})conn.close()print(title, page_url, start_date, deadline, issue_date, project_type)print("插入成功")return Trueexcept:print("重复,插入失败")conn.close()return Falsedef search_jiangsu_record(key_word=None, deadline=None):conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)my_db = conn['search_record']my_col = my_db['JiangSu']search_result = []if deadline==None:deadline = datetime.datetime.now()if key_word!=None:key_word = make_key_word(key_word)for x in my_col.find({"title": re.compile(key_word), "deadline": {"$gte":deadline}}):search_result.append(x)conn.close()else:for x in my_col.find({"deadline": {"$gte":deadline}}):search_result.append(x)conn.close()if len(search_result)==0:search_result="No Result"return search_resultdef make_key_word(str_input):str_list = str_input.split(" ")print(str_list)str_more = ""for i in str_list:str_more = str_more+"(?=.*?"+i+")"str_more = "^"+str_more+".+"print(str_more)return str_moredef crawler_and_search(jiangsu_key_word, jiangsu_date):visit_jiangsu_home(x=10)str_more = make_key_word(jiangsu_key_word)deadline = is_date(jiangsu_date)return search_jiangsu_record(key_word=str_more, deadline=deadline)def is_date(date):try:deadline = datetime.datetime.strptime(date, '%Y-%m-%d')except:deadline = Nonereturn deadlineif __name__ == "__main__":jiangsu_key_word = input("关键字:")jiangsu_date = input("日期:")print(crawler_and_search(jiangsu_key_word, jiangsu_date))

全国数据更新速度快,故设置一个后台更新程序,每十五分钟爬取一次更新数据库:

# -*- coding: utf-8 -*-
import requests
import os
from bs4 import BeautifulSoup
import time
import datetime
import sys,json
from pymongo import MongoClient, InsertOne
import re# 创建标记文件
def create_mark_file(mark_file_path):mark_file = open(mark_file_path, "w")mark_file.close()# 删除标记文件
def delete_mark_file(mark_file_path):if os.path.exists(mark_file_path):os.remove(mark_file_path)else:passdef visit_global_home(key_word=None, max_repeat=200):# 模拟头部分headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','Accept-Encoding': 'gzip, deflate','Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7','Cache-Control': 'max-age=0','Connection': 'keep-alive','Content-Length': '181','Content-Type': 'application/x-www-form-urlencoded','Cookie': 'JSESSIONID=585E887A70EA60F7982104CE88AE9B7E; UM_distinctid=1735f8daa6a299-0f79f479f2fe33-b7a1334-144000-1735f8daa6b447; CNZZdate2114438=cnzz_eid%3D1721879153-1595032889-http%253A%252F%252Fwww.infobidding.com%252F%26ntime%3D1595032889','Host': 'www.infobidding.com','Origin': 'http://www.infobidding.com','Referer': 'http://www.infobidding.com/listAction.do?count=25&type=2&freeflg=&tradeid=','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',}# 模拟form data部分form_data = {'cdiqu': '-1','pdiqu': '-1','term': None,'areaid': '-1','townid': '-1','pageNum': '1','#request.state': '1','#request.tname': None,'#request.type': '2','#request.flg': None,'#request.tradeid': None,}form_data['term'] = key_wordurl = 'http://www.infobidding.com/listAction.do?count=25&type=2&freeflg=&tradeid='response = requests.post(url, data=form_data, headers=headers)# 请求到的信息,解析后选择转为字符串形式方便无脑操作soup = BeautifulSoup(response.content, "html5lib")if soup.find_all(colspan="5") != []:# notice_list.append("没有查到相关数据!")# 无内容则退出,不做数据库的任何操作print("无内容")returnelse:all_repeat_num = 0  # 总重复数max_repeat_num = max_repeat  # 最大重复数all_repeat_num = all_repeat_num + get_name_date_area(soup)  # 对第一页进行插入,返回这一页的重复数并加入总重复数#print("完成第1页")try:  # 如果有多页,则从第二页开始,先获取总页数page_num_is = \str(soup.find_all(style="text-align: center;", colspan="4")[0]).split("</font>/")[1].split("页")[0]#print("有多页内容")time.sleep(1)for i in range(2, int(page_num_is) + 1):form_data['pageNum'] = str(i)response = requests.post(url, data=form_data, headers=headers)# 请求到的信息,解析后选择转为字符串形式方便无脑操作soup = BeautifulSoup(response.content, "html5lib")all_repeat_num = all_repeat_num + get_name_date_area(soup)  # 返回这一页的重复数并加入总重复数#print("完成第" + str(i) + "页")if all_repeat_num > max_repeat_num:  # 超过则退出returntime.sleep(1)except:print("无多页内容")returndef get_name_date_area(soup):  # soup是网页内全部内容repeat_num = 0  # 每一页的重复数for x in soup.find_all(align="left", height="28px"):title = str(x.find_all(width="68%")).split('hand;">')[1].split('</span>')[0].strip()url_ads = "http://www.infobidding.com" + str(x.find_all(width="68%")).split('href="')[1].split('" target=')[0].strip()area = str(x.find_all(width="11%")).split('color="red">')[1].split('</font>')[0].strip()date = str(x.find_all(width="10%")).split('color="red">')[1].split('</font>')[0].strip()date = datetime.datetime.strptime(date, "%Y-%m-%d")insertion = insert_global_data(title, url_ads, area, date)  # 插入变为原子行为if insertion == False:repeat_num += 1return repeat_num# 单项插入
def insert_global_data(title, url_ads, area, date):conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)my_db = conn['search_record']try:my_db['Global'].insert_one({"title": title,"url": url_ads,"area_city": area,"issue_date": date,"created_time": datetime.datetime.now()})conn.close()#print("插入成功")return True  # 成功插入except:conn.close()#print("重复,插入失败")return False  # 插入失败def is_date(notice_list, date=''):if notice_list[0] == "没有查到相关数据!":lan = "没有查到该日期的数据,请重新输入关键字或日期!\n"print(lan)return lanelif date == '':text_content = as_text(notice_list)return text_contentelse:date_notice = []for i in range(len(notice_list)):if notice_list[i][3].strip() == date:date_notice.append(notice_list[i])if len(date_notice) == 0:lan = "没有查到该日期的数据,请重新输入关键字或日期!\n"print(lan)return lantext_content = as_text(date_notice)return text_contentdef as_text(list_text):text = ''for i in range(len(list_text)):text = text + "No." + str(i + 1) + "\n" + "日期:" + str(list_text[i][3]) + "\t" + "地区:" + str(list_text[i][2]) \+ "\n" + "公告名称:" + str(list_text[i][0]) + "\n" + "链接:" + str(list_text[i][1]) + "\n"return textdef search_global_record(key_word=None, issue_date=None):conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)my_db = conn['search_record']my_col = my_db['Global']search_result = []for x in my_col.find({"title": re.compile(key_word), "issue_date": re.compile(issue_date)}):print(x)search_result.append(x)conn.close()return search_resultdef make_key_word(str_input):str_list = str_input.split(" ")print(str_list)str_more = ""for i in str_list:str_more = str_more + "(?=.*?" + i + ")"str_more = "^" + str_more + ".+"print(str_more)return str_moredef is_the_time():the_time = str(datetime.datetime.now()).split(" ")[1].split(".")[0].split(":")# 只在6点到22点的整点和半点进行后台爬虫if (int(the_time[0])>=6 and int(the_time[0])<=22 and (int(the_time[1])==0 or int(the_time[1])==30 or int(the_time[1])==15 or int(the_time[1])==45)):return Trueelse:return Falsedef backend_crawler():mark_file_path = os.getcwd() + "/inserting.txt"print("启动后端爬虫更新")while(True):if is_the_time():print(datetime.datetime.now(), "到点爬虫开始")if os.path.exists(mark_file_path):print("数据库正在手动更新,自动更新暂停,休眠5分钟")time.sleep(300)# 休眠5分钟while(os.path.exists(mark_file_path)):# 若5分钟后还在手动更新,则继续等5分钟print("数据库正在手动更新,自动更新暂停,再次休眠5分钟")time.sleep(300)  # 休眠5分钟# 直到手动更新结束,开始自动更新create_mark_file(mark_file_path)  # 创建标记文件visit_global_home(max_repeat=200)  # 插入delete_mark_file(mark_file_path)  # 删除标记文件print(datetime.datetime.now(), "数据库自动更新完成")else:# 不在自动更新中则立刻开始自动更新create_mark_file(mark_file_path)# 创建标记文件visit_global_home(max_repeat=500)# 插入delete_mark_file(mark_file_path)# 删除标记文件# 完成更新,休眠一段时间print(datetime.datetime.now(), "数据库自动更新完成")else:#print(datetime.datetime.now(), "没到点")pass#每分钟查看一次是否到设定时间time.sleep(60)if __name__=="__main__":backend_crawler()

手工操作时先更新后查询则使用如下部分代码:

# -*- coding: utf-8 -*-
import requests
import os
from bs4 import BeautifulSoup
import time
import datetime
from pymongo import MongoClient
import redef search_first_or_later(max_repeat=100, key_word=None, issue_date=None):mark_file_path = os.getcwd() + "/inserting.txt"key_word = make_key_word(key_word)if os.path.exists(mark_file_path):# 若存在,说明最近刚更新过print("刚更新过,直接搜索")search_global_record(key_word=key_word, issue_date=issue_date)# 直接搜索else:print("先更新再搜索")create_mark_file(mark_file_path)visit_global_home(max_repeat=max_repeat)# 进行爬取和插入search_global_record(key_word=key_word, issue_date=issue_date) #再搜索delete_mark_file(mark_file_path)def update_or_not(max_repeat=100):mark_file_path = os.getcwd() + "/inserting.txt"if os.path.exists(mark_file_path):# 若存在,说明最近刚更新过print("刚更新过")passelse:print("先更新")create_mark_file(mark_file_path)visit_global_home(max_repeat=max_repeat)# 进行爬取和插入delete_mark_file(mark_file_path)# 创建标记文件
def create_mark_file(mark_file_path):mark_file = open(mark_file_path, "w")mark_file.close()# 删除标记文件
def delete_mark_file(mark_file_path):if os.path.exists(mark_file_path):os.remove(mark_file_path)else:passdef visit_global_home(key_word=None, max_repeat=200):#模拟头部分headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9','Accept-Encoding': 'gzip, deflate','Accept-Language': 'zh,zh-CN;q=0.9,en-US;q=0.8,en;q=0.7','Cache-Control': 'max-age=0','Connection': 'keep-alive','Content-Length': '181','Content-Type': 'application/x-www-form-urlencoded','Cookie': 'JSESSIONID=585E887A70EA60F7982104CE88AE9B7E; UM_distinctid=1735f8daa6a299-0f79f479f2fe33-b7a1334-144000-1735f8daa6b447; CNZZdate2114438=cnzz_eid%3D1721879153-1595032889-http%253A%252F%252Fwww.infobidding.com%252F%26ntime%3D1595032889','Host': 'www.infobidding.com','Origin': 'http://www.infobidding.com','Referer': 'http://www.infobidding.com/listAction.do?count=25&type=2&freeflg=&tradeid=','Upgrade-Insecure-Requests': '1','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36',}#模拟form data部分form_data = {'cdiqu': '-1','pdiqu': '-1','term': None,'areaid': '-1','townid': '-1','pageNum': '1','#request.state': '1','#request.tname': None,'#request.type': '2','#request.flg': None,'#request.tradeid': None,}form_data['term'] = key_wordurl = 'http://www.infobidding.com/listAction.do?count=25&type=2&freeflg=&tradeid='response = requests.post(url, data=form_data, headers=headers)# 请求到的信息,解析后选择转为字符串形式方便无脑操作soup = BeautifulSoup(response.content, "html5lib")if soup.find_all(colspan="5") != []:#notice_list.append("没有查到相关数据!")#无内容则退出,不做数据库的任何操作print("无内容")returnelse:all_repeat_num = 0# 总重复数max_repeat_num = max_repeat# 最大重复数all_repeat_num = all_repeat_num + get_name_date_area(soup)#对第一页进行插入,返回这一页的重复数并加入总重复数print("完成第1页")try:#如果有多页,则从第二页开始,先获取总页数page_num_is = str(soup.find_all(style="text-align: center;", colspan="4")[0]).split("</font>/")[1].split("页")[0]print("有多页内容")time.sleep(1)for i in range(2, int(page_num_is)+1):form_data['pageNum'] = str(i)response = requests.post(url, data=form_data, headers=headers)# 请求到的信息,解析后选择转为字符串形式方便无脑操作soup = BeautifulSoup(response.content, "html5lib")all_repeat_num = all_repeat_num + get_name_date_area(soup)#返回这一页的重复数并加入总重复数print("完成第" + str(i) + "页")if all_repeat_num > max_repeat_num:#超过则退出returntime.sleep(1)except:print("无多页内容")returndef get_name_date_area(soup):#soup是网页内全部内容repeat_num = 0 #每一页的重复数for x in soup.find_all(align="left", height="28px"):title = str(x.find_all(width="68%")).split('hand;">')[1].split('</span>')[0].strip()url_ads = "http://www.infobidding.com"+str(x.find_all(width="68%")).split('href="')[1].split('" target=')[0].strip()area = str(x.find_all(width="11%")).split('color="red">')[1].split('</font>')[0].strip()date = str(x.find_all(width="10%")).split('color="red">')[1].split('</font>')[0].strip()date = datetime.datetime.strptime(date, "%Y-%m-%d")insertion = insert_global_data(title, url_ads, area, date)#插入变为原子行为if insertion==False:repeat_num += 1return repeat_num# 单项插入
def insert_global_data(title, url_ads, area, date):conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)my_db = conn['search_record']try:my_db['Global'].insert_one({"title": title,"url": url_ads,"area_city": area,"issue_date": date,"created_time": datetime.datetime.now()})conn.close()print("插入成功")return True#成功插入except:conn.close()print("重复,插入失败")return False#插入失败def is_date(notice_list, date=''):if notice_list[0] == "没有查到相关数据!":lan = "没有查到该日期的数据,请重新输入关键字或日期!\n"print(lan)return lanelif date == '':text_content = as_text(notice_list)return text_contentelse:date_notice = []for i in range(len(notice_list)):if notice_list[i][3].strip() == date:date_notice.append(notice_list[i])if len(date_notice)==0:lan = "没有查到该日期的数据,请重新输入关键字或日期!\n"print(lan)return lantext_content = as_text(date_notice)return text_contentdef as_text(list_text):text = ''for i in range(len(list_text)):text = text+"No."+str(i+1)+"\n"+"日期:"+str(list_text[i][3])+"\t"+"地区:"+str(list_text[i][2])\+"\n"+"公告名称:"+str(list_text[i][0])+"\n"+"链接:"+str(list_text[i][1])+"\n"return textdef search_global_record(key_word=None,issue_date=None):conn = MongoClient("127.0.0.1:27017", maxPoolSize=None)my_db = conn['search_record']my_col = my_db['Global']search_result = []if issue_date==None:deadline = datetime.datetime.now()for x in my_col.find({"title": re.compile(key_word), "issue_date": re.compile(issue_date)}):print(x)search_result.append(x)conn.close()return search_resultdef make_key_word(str_input):str_list = str_input.split(" ")print(str_list)str_more = ""for i in str_list:str_more = str_more+"(?=.*?"+i+")"str_more = "^"+str_more+".+"print(str_more)return str_moredef is_date(date):try:deadline = datetime.datetime.strptime(date, '%Y-%m-%d')except:deadline = Nonereturn deadlineif __name__=="__main__":search_first_or_later(key_word="江苏 有限公司 高速", issue_date="2020-08-20")

收工。

菜鸟的Django+MongoDB搭建简单爬虫显示网站踩坑心得相关推荐

  1. pythondjango搭建数据管理平台,使用Django快速搭建简单的数据管理后台

    使用Django快速搭建简单的数据管理后台 概述 需求描述: 数据表已建好,能可视化操作增删改查,避免直接操作数据库 简版的管理系统 环境 Windows 10 x64 Python 3.6.3 (A ...

  2. springboot+vue搭建简单的聊天网站,从0到上线(腾讯云)

    springboot+vue搭建简单的聊天网站,从0到上线 整体架构简单梳理 云服务器 nginx的基础配置 springboot-eureka简单梳理 聊天功能实现的基础流程 ws的实现 整体架构简 ...

  3. 通过图片和base64互转,实现图片上传和下载显示以及踩坑

    前言 实现一个类似于微信的图片界面,包括拍照和相册,拍照包括裁剪,相册包括预览,可以选中指定张数的图片,将图片转换为base64上传到服务器.可以从服务器将已经上传的图片资源,通过base64字符串下 ...

  4. Django:快速搭建简单的Blog

    一,创建项目 1, 为blog创建名为mysite的工程项目: django-admin.py startproject mysite 2, 项目结构如下: mysite├── manage.py└─ ...

  5. 码云怎么创建公开的仓库_使用码云或GitHub搭建简单的个人网站(补充hexo搭建博客)...

    使用码云或GitHub搭建个人网站 码云: 1.新建一个仓库 路径名会影响你的个人网站的网址(自行修改),开源(公开),其他默认 我们点击创建 2.进入仓库后,我们新建一个文件 3.文件名为index ...

  6. Valheim英灵神殿服务器搭建指南 个人记录及踩坑

    云服务器搭建游戏私服 一开始是朋友想自己建个服务器一起玩游戏,无奈如果使用某人主机当服务器的话,只要他关机了其他人都不能玩,而且作为服务器的主机需要一定的配置要求,并且这种方式会有很大的延迟波动,所以 ...

  7. CentOS7 使用 kubeadm 搭建 kubernetes 集群(踩坑篇)

    序言 本人在搭建 k8s 集群的过程中曲折不断,故写下此文,欲与"同是天涯沦落人"分享. 这篇文章会详细描述安装过程中遇到的问题,以及本人相应的解决办法,如读者只想快速搭建 k8s ...

  8. 关于C++ Qt项目实现最简单Http Server的踩坑记录

    前言 在开发Qt项目中,需要实现一个Http服务器功能,接收并处理客户端的http请求,Qt的发行版中并没有提供相应的类来处理,所以要实现该功能只能找其他开源项目了,本文主要记录我在收集相关资料过程中 ...

  9. letex(winedit)公式和正文的间距,超级简单(以后别踩坑)

    有三种常用的方法,别人写过的,感谢 LaTeX_Fun 原站参考http://blog.sina.com.cn/s/blog_5e16f1770100mirv.html 用法,放在各种公式命令的适当位 ...

最新文章

  1. Debain 7.2安装配置
  2. 宇宙第一 IDE Visual Studio 2019 正式发布
  3. 11.3finally块控制的读取文件释放
  4. 2022.2.21显示器连接器引脚信号定义1
  5. 电脑无法连接到系统服务器,请问怎么客户端的电脑连接不到服务器?这是什么原因?...
  6. 二叉树的锯齿形层次遍历
  7. c++文件流基本用法(fstream, ifstream, ostream)
  8. PC端 java 开发蓝牙所遇到的问题
  9. 云图说|一张图看懂一站式DevOps利器——华为云DevCloud
  10. MySQL常用优化指南,及大表优化思路(值得收藏)
  11. idea 中maven编译速度过慢的问题的解决
  12. Lync Server 2010迁移至Lync Server 2013部署系列 Part7:配置Office Web App 02
  13. python中使用frame需要安装_python – 在SFrame中分组而不安装graphlab
  14. 基于机器学习和深度学习的推荐系统实战【图书推荐、电影推荐、音乐推荐】
  15. PAT L1-019. 谁先倒
  16. 大型仿人机器人整机构型研究与应用
  17. 关于springboot无法跳转解析html,404,500无法创建实体类工厂
  18. 人证核验终端设备技术
  19. 压力越大责任越大,数字基础设施还要“稳”字当头
  20. 电脑上的软件卸载不了怎么办

热门文章

  1. 世界首富比尔·盖茨面湖而建的6000万美元豪宅被誉为美国除白宫外最受瞩目的名人官邸
  2. 【C】编写一个函数,由实参传来一个字符串,统计此字符串中字母、数字、空格和其他字符的个数,在主函数中输入字符串以及输出上述的结果。
  3. 【Tanzu 社区版=TCE】 Mac 笔记本快速部署安装体验-(一)
  4. 山东大学计算机博士申请,山东大学博士研究生招生“申请
  5. 德国罗氏Roehrs BT40-3010090、HSK A63-3009555
  6. html中加上边框的语句有哪些,HTML如何给自己的图片加上精美边框图片
  7. Android移动支付之HCE
  8. Unity Edior下合并场景 合并网格 合并贴图
  9. 购买条码打印机如何选择才能更适合
  10. 我家的狗5岁了,5岁的狗相当于人的多大呢?,其实狗的前两年,每一年相当于人的10.5岁,之后每增加一岁就增加4岁,那么5岁的狗相当于人的年龄就应该是10.5+10.5+4+4+4=33岁.编写程序