使用python Request Module自动下载网站数据

获取请求头
手动获取：
点击右键，选择检查，再选择network，刷新一下（ctrl+r），随机选其中一个内容，将 User-Agent 后的内容复制出来就行：


import urllib.request  # url request
import re  # regular expression
import os  # dirs
import time'''
url 下载网址
pattern 正则化的匹配关键词
Directory 下载目录
'''def BatchDownload(url, pattern, Directory):# 拉动请求，模拟成浏览器去访问网站->跳过反爬虫机制# 在这里，必须使用元组或列表的方式定制请求头。headers = {'User-Agent','Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}opener = urllib.request.build_opener()                          #自定义opener,使用build_opener()修改报头opener.addheaders = [headers]                                   #添加报头content = opener.open(url).read().decode('utf8')                # 获取网页内容raw_hrefs = re.findall(pattern, content, re.IGNORECASE)         # 构造正则表达式，从content中匹配关键词patternhset = set(raw_hrefs)                                           # set函数消除重复元素"""urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)url：外部或者本地urlfilename：指定了保存到本地的路径（如果未指定该参数，urllib会生成一个临时文件来保存数据）；reporthook：是一个回调函数，当连接上服务器、以及相应的数据块传输完毕的时候会触发该回调。我们可以利用这个回调函数来显示当前的下载进度。data：指post到服务器的数据。该方法返回一个包含两个元素的元组(filename, headers)，filename表示保存到本地的路径，header表示服务器的响应头。""""""关于urllib.request.Request()urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)url：url 地址。data：发送到服务器的其他数据对象，post请求时使用，默认为 None。headers：HTTP 请求的头部信息，字典格式。(重点要知道UA,cookie，Referer)origin_req_host：请求的主机地址，IP 或域名。unverifiable：很少用整个参数，用于设置网页是否需要验证，默认是False。。method：请求方法， 如 GET、POST、DELETE、PUT等"""# 下载链接for href in hset:# 之所以if else 是为了区别只有一个链接的特别情况if (len(hset) > 1):link = url + href[0]filename = os.path.join(Directory, href[0])print("正在下载", filename)urllib.request.urlretrieve(link, filename)print("成功下载！")elif(len(hset) == 1):link = url + hreffilename = os.path.join(Directory, href)print("正在下载", filename)urllib.request.urlretrieve(link, filename)print("成功下载！")# 无sleep间隔，网站认定这种行为是攻击，反反爬虫time.sleep(1)BatchDownload('http://download.alleninstitute.org/informatics-archive/current-release/mouse_ccf/annotation/ccf_2017/structure_masks/structure_masks_10/','(structure_(\d+).nrrd)',r'C:\Users\戚世兴\Desktop\Request_Module')

import requests
def BatchDownload(url, pattern, Directory):#可在代理前面加上账号&密码#proxies='username:password@127.0.0.1:9743'proxies = {"http": "http://127.0.0.1:8080", "https": "http://127.0.0.1:8080"}headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36','Cookie': 'JSESSIONID=B851AE4A99FB9222C81FE446688D4CBC; fxbdLocal=zh; islogin=true; username=F1243792; sign=sw0fLXElJRVJd2fmpL04UA==; zh_choose=n'}response = requests.request("GET", url, headers=headers,verify=False)print(response.text)print(requests.request("GET", "https://iedu.foxconn.com/public/user/userInfo",proxies=proxies, headers=headers,verify=False).text)BatchDownload('https://iedu.foxconn.com/public/user/studyTask?','(structure_(\d+).nrrd)',r'C:\Users\戚世兴\Desktop\Request_Module')

删除&移动文件

import os.path
import zipfile
import shutilzip_file = zipfile.ZipFile(r'C:\Users\戚世兴\Desktop\新建文件夹.zip')
f_content = zip_file.namelist()
print(f_content)
f_size = zip_file.getinfo(r"新建文件夹/S3.xlsx").file_size
print(f_size)
zip_file.extractall(r"C:\Users\戚世兴\Desktop\11111")
zip_file.close()size=0
for root, dirs, files in os.walk(r"C:\Users\戚世兴\Desktop\11111"):for file in files:if(os.path.isfile(root+"\\"+file)):size+=os.path.getsize(root+"\\"+file)# print(root+"\\"+file)# os.remove(root+"\\"+file)if(not os.path.exists(r"C:\Users\戚世兴\Desktop\11111\\"+file)):shutil.move(root+"\\"+file,r"C:\Users\戚世兴\Desktop\11111")
print(size)

tell application "System Events"trytell window 1 of process "loginwindow"repeat until not (value of static text 4 is equal to "")set value of static text 4 to "this is a test"delay 0.5end repeatend tellend try
end tell

myPythonVariable = 10
cmd ="""osascript -e 'tell application"System Events"set activeApp to name of first application process whose frontmost is trueif"MyApp" is in activeApp thenset stepCount to {0}repeat with i from 1 to stepCount-- do somethingend repeatend ifend tell'""".format(myPythonVariable)

使用python Request Module自动下载网站数据相关推荐

使用Python爬虫示例-自动下载网页音频文件
使用Python爬虫示例-自动下载网页音频文件使用库目标网站获取并解析网页源代码访问下载链接使用库 requests 用来发送http请求. BeautifulSoup 一个灵活又方便的网页 ...
python 通达信自动下载收盘和财务数据
python 通达信自动下载收盘和财务数据,自动启动通达信,鼠标自动操作: 通达信直接从官网下载免费版,可下载财务数据. 自动识别屏幕尺寸(目前为1440x900.1920x1080.1366*768 ...
mac用python爬虫下载图片_使用Python爬虫实现自动下载图片
python爬虫支持模块多.代码简洁.开发效率高 ,是我们进行网络爬虫可以选取的好工具.对于一个个的爬取下载,势必会消耗我们大量的时间,使用Python爬虫就可以解决这个问题,即可以实现自动下载.本文 ...
宝塔环境使用微软OneDrive云盘免费自动备份网站数据最佳方案！
目前宝塔内用插件自动备份网站数据免费的方案有微软OneDrive:注册就有5.5GB永久空间(通过下面邀请链接注册有5.5GB,直接官网注册是5GB)免费版邀请好友可以扩容到15.5GB,用来备份网 ...
宝塔环境挂载阿里云盘（webdav协议）给服务器扩容自动备份网站数据！
webdav-aliyundriver 本项目实现了阿里云盘的webdav协议,只需要简单的配置一下,就可以让阿里云盘变身为webdav协议的文件服务器. 基于此,你可以把阿里云盘挂载为Windows ...
Python Djang 搭建自动词性标注网站（基于Keras框架和维基百科中文预训练词向量Word2vec模型，分别实现由GRU、LSTM、RNN神经网络组成的词性标注模型）
引言本文基于Keras框架和维基百科中文预训练词向量Word2vec模型,分别实现由GRU.LSTM.RNN神经网络组成的词性标注模型,并且将模型封装,使用python Django web框架搭建 ...
Pycharm + python 爬虫简单爬取网站数据
本文主要介绍简单的写一个爬取网站图片并将图片下载的python爬虫示例. 首先,python爬虫爬取数据,需要先了解工具包requests以及BeautifulSoup requests中文文档:ht ...
python discuz_Python爬虫自动下载Discuz论坛附件。
121,278 因工作需要,要定期收集卡饭论坛的病毒样本板块的病毒样本,所以就考虑用 Python做个爬虫,然后自动下载附件. 核心功能有3个: 1· 登录 2· 伪造cookie保持session ...
每天自动备份网站数据，发现问题一键恢复 ——阿里云虚拟主机推出网站数据自动备份功能...
摘要: 近日,阿里云宣布推出虚拟主机网站自动备份功能,可自动备份用户的网站和数据库数据至单独的备份区域,用户可随时恢复前三天的网站和数据库的数据.一旦出现意外或者数据丢失情况,可将损失降低到最低. 数 ...

使用python Request Module自动下载网站数据

使用python Request Module自动下载网站数据相关推荐

最新文章

热门文章