13行MATLAB代码实现网络爬虫爬取NASA画廊星图

2021/04/18上传

2021/04/21更新：修改N的输入方式，增加对png格式图片的下载支持，增加了自动处理几种错误情况的代码，能够将下载过程与报错记录保存到日志中。

源代码

N = input('Input the number you want to download:');URL = 'https://www.nasa.gov/api/2/ubernode/_search';
mainURL = 'https://www.nasa.gov/sites/default/files/';
opt = weboptions('Timeout',10);
for i=1:Ndata = webread(URL,'size',num2str(N),'from','0','sort','promo-date-time:desc','q','((ubernode-type:image) AND (routes:1446))','_source_include','promo-date-time,master-image,nid,title,topics,missions,collections,other-tags,ubernode-type,primary-tag,secondary-tag,cardfeed-title,type,collection-asset-link,link-or-attachment,pr-leader-sentence,image-feature-caption,attachments,uri',opt);imgURL = append(mainURL,data.hits.hits(i).x_source.master_image.uri(10:end));img = webread(imgURL,opt);filename = append('Img_',num2str(i),'_',data.hits.hits(i).x_source.master_image.title,'.jpg');imwrite(img,filename);disp(append('FINISHED:',num2str(i),'/',num2str(N)));
end
disp('Completed!');

使用方法

将.m脚本文件所在路径添加到MATLAB路径中，运行脚本，命令行提示：Input the number you want to download:，输入你想下载的图片数量后，爬虫自动开始运行并显示进度，进度读完则显示Completed!，图片保存在脚本所在目录下。

讲解

本爬虫仅适用于爬取NASA画廊每日图片，但只要取得了图片链接，用此方法可以爬取其他更多网站。

在https://www.nasa.gov/multimedia/imagegallery/iotd.html使用F12中Network工具，可以抓取到网页获取图片信息的网址接口URL，它的参数由几个部分组成，其中size对应一次获取的图片数量，则可通过变更size来获得不同的图片数量。

URL的响应中，包含我们要获取的图片链接的一部分，即uri。

通过mainURL与uri(10:end)组合可以得到不同编号的图片链接，使用webread()函数读入即可。

weboption()函数用于设置访问方式为Get与超时响应时间Timeout。

append()合并字符串，imwrite()将图片写入指定文件并重命名。

via nasa.gov

2021/04/21更新：修改N的输入方式，增加对png格式图片的下载，增加了自动处理几种错误情况的代码，能够将下载过程与报错记录保存到日志中。

2021/04/21源代码：

disp('Input the number you want to download:[N1-N2]');
N1 = input('N1:');
N2 = input('N2:');
disp(append('From ',num2str(min(N1,N2)),' to ',num2str(max(N1,N2)),' There are ',num2str(max(N1,N2)-min(N1,N2)+1),' pictures.'));
URL = 'https://www.nasa.gov/api/2/ubernode/_search';
mainURL = 'https://www.nasa.gov/sites/default/files/';
opt = weboptions('Timeout',10);
ispng=1;
path = 'F:\PictureDownload\PictureDownload';
for i=min(N1,N2):max(N1,N2)trydata = webread(URL,'size',num2str(i),'from','0','sort','promo-date-time:desc','q','((ubernode-type:image) AND (routes:1446))','_source_include','promo-date-time,master-image,nid,title,topics,missions,collections,other-tags,ubernode-type,primary-tag,secondary-tag,cardfeed-title,type,collection-asset-link,link-or-attachment,pr-leader-sentence,image-feature-caption,attachments,uri',opt);catchdisp('[ERROR]Failed to connect to the website. Check your web connection.');breakendimgURL = append(mainURL,data.hits.hits(i).x_source.master_image.uri(10:end));tryimg = webread(imgURL,opt);catchdisp(append('[WARN]Failed to download the ',num2str(i),'th picture. It has been skipped up.'));disp(append('[LINK]',imgURL));i = i+1;continueendfilename = append(path,'Img_',num2str(i),'_',data.hits.hits(i).x_source.master_image.title,'.jpg');tryimwrite(img,filename);disp(append('[',num2str(i),']FINISHED:',num2str(i-min(N1,N2)+1),'/',num2str(max(N1,N2)-min(N1,N2)+1)));catchfilename = append(path,'Img_',num2str(i),'_',data.hits.hits(i).x_source.master_image.title,'.png');tryimwrite(img,filename);catchispng=0;endif ispng==1disp(append('[WARN]The ',num2str(i),'th picture is the format of png, it has been download successfully.'))elsedisp(append('[WARN]Failed to write in img file, The No.',num2str(i),' picture has been skipped up.'));disp(append('[LINK]:',imgURL));i = i+1;endend
end
disp('Completed!');