抓取网页并保存静态资源

文件名：index.php

<?php
if($_GET['action'] == 'curl'){/*** 抓取指定页面的静态资源* @param $url* @return bool|string*/function getUrl($url){$headerArray = array("Content-type:application/json;", "Accept:application/json");$ch = curl_init();curl_setopt($ch, CURLOPT_URL, $url);//curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);//curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);//curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (iPhone; CPU iPhone OS 13_2_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.3 Mobile/15E148 Safari/604.1');//curl_setopt($ch, CURLOPT_HTTPHEADER, $headerArray);$output = curl_exec($ch);//echo '错误:',curl_error($ch),$output,PHP_EOL;curl_close($ch);return $output;}/*** 保存文件* @param $file* @param $content*/function save($file, $content){if (0 === strpos($file, 'http')) {echo $file, PHP_EOL;return;}$last = strripos($file, '/');$file = substr($file, 1);if (!$file)return;$dir = substr($file, 0, $last);if (!$dir) {echo $dir, PHP_EOL;return;}if (!file_exists($dir)){mkdir($dir, 0777, true);chmod($dir,0777);}if (!file_exists($file))file_put_contents($file, $content);//echo $file,PHP_EOL;}/*** css文件* @param $arr //匹配结果数组* @param $url*/function saveCss($file, $url,$static){$content = getUrl($url);preg_match_all("/url\((.*)\)/U", $content, $arr4);//var_dump($arr4);foreach ($arr4[1] as $item) {if(strpos($item,'/') === 0){//echo $item,PHP_EOL;if(!is_file($item)){save($item, file_get_contents($static.$item));}}}}$url = $_POST['static'];//资源地址$curlUrl = $_POST['target'];//采集目标$fileName = $_POST['name'];//保存为if(is_file($fileName)){echo '该文件已经存在。<script>history.back();</script>';exit;}$html = getUrl($curlUrl);//抓取页面//1、匹配src地址，图片或者jspreg_match_all("/src=\"(.*)\"/U", $html, $arr1);//$arr1[0]:包含开始和结束,$arr1[1]：不包含开始和结束//2、匹配css地址preg_match_all("/href=\"(.*)\"/U", $html, $arr2);//3、匹配页面内css的背景图片preg_match_all("/url\((.*)\)/U", $html, $arr3);//$arr1[0]:包含开始和结束,$arr1[1]：不包含开始和结束//4、合并结果数组//$contents = array_merge($arr1[0],$arr2[0],$arr3[0]);$contents = array_merge($arr1[1],$arr2[1],$arr3[1]);//5、保存页面htmlfile_put_contents($fileName,preg_replace('/(http:\/\/img.yigouf.com)|(http:\/\/pfghouse.pinfangw.com)|(\/\/script.crazyegg.com)/i','',$html));//6、采集静态资源foreach($contents as $item){if(strpos($item,'script.crazyegg.com'))continue;if(preg_match("/(\.js)|(\.css)|(\.jpg)|(\.png)|(\.gif)|(\.ico)/i",$item)){if(preg_match(".baidu.",$item))continue;//排除百度代码$newFile = preg_replace('/\?.*/i','',$item);//替换？及以后的字符if(preg_match("/([(http:\/\/)(\/\/)])+.*\.com/i",$item))$newFile = preg_replace('/([(http:\/\/)(\/\/)])+.*\.com/i','',$newFile);//跨域资源存到本地的名字if(preg_match("(//)",$item)){$staticUrl = $item;}else{$staticUrl = $url.$item;}//保存静态资源if(!is_file($newFile)){save($newFile,file_get_contents($staticUrl));}//保存css文件中的背景图片资源if(preg_match("/(\.css)/i",$item)){saveCss($newFile,$staticUrl,$url);}}}echo "采集完成！<button οnclick='history.back();'>返回</button>";
}else{?>
<!doctype html>
<html>
<head><meta charset="utf-8"><title>网页抓取</title><style>.container {width: 60%;margin: 10% auto 0;background-color: #f0f0f0;padding: 2% 5%;border-radius: 10px}ul {padding-left: 20px;}ul li {line-height: 2.3}a {color: #20a53a}input{width: 80%;}</style>
</head>
<body><div class="container"><h1>表单</h1><h3>请填写相关页面内容</h3><form action="/index.php?action=curl" method="post"><ul><li><label>资源地址:<input type="text" value="http://yunnan.yigouf.com" name="static"></label></li><li><label>目标地址:<input type="text" value="http://yunnan.yigouf.com/index/" name="target"></label></li><li><label>页面名字:<input type="text" value="index.html" name="name"></label></li><li><label><input type="submit" value="提交"></label></li></ul></form></div>
</body>
</html>
<?php } ?>

抓取网页并保存静态资源相关推荐

Python抓取网页并保存为PDF
1.开发环境搭建 (1)Python2.7.13的安装:参考<廖雪峰老师的网站> (2)Python包管理器pip的安装:参考<pip安装文档说明> 因为基于版本2.7.13, ...
python抓取网页信息保存为xml文件_用Python抓取XML文件
如果您能够对文档运行xslt-我想您可以-另一种方法将使这变得非常简单:<?xml version="1.0" encoding="utf-8"?> ...
python3实现抓取网页资源的 N 种方法（内附200GPython学习资料）
这两天学习了python3实现抓取网页资源的方法,发现了很多种方法,所以,今天添加一点小笔记. 文章最后为各位小伙伴提供超级彩蛋!不要错过了! 1.最简单 import urllib.request ...
Python利用bs4批量抓取网页图片并下载保存至本地
Python利用bs4批量抓取网页图片并下载保存至本地使用bs4抓取网页图片,bs4解析比较简单,需要预先了解一些html知识,bs4的逻辑简单,编写难度较低.本例以抓取某壁纸网站中的壁纸为例.(b ...
正则表达式抓取网页资源
分享一个工具类,用户抓取网页上的图片.js.css等路径传入 package lab2; import java.util.ArrayList; import java.util.List; imp ...
php curl_setopt抓取内容,PHP的CURL方法curl_setopt()函数案例介绍(抓取网页,POST数据)
通过curl_setopt()函数可以方便快捷的抓取网页(采集很方便),curl_setopt 是php的一个扩展库使用条件:需要在php.ini 中配置开启.(PHP 4 >= 4.0.2) ...
Python 爬虫篇#笔记02# | 网页请求原理和抓取网页数据
目录一. 网页请求原理 1.1 浏览网页的过程 1.2 统一资源定位符URL 1.3 计算机域名系统DNS 1.4 分析浏览器显示完整网页的过程 1.5 客户端THHP请求格式 1.6 服务端HTT ...
Linux 抓取网页实例（shell+awk）
分享一下我老师大神的人工智能教程!零基础,通俗易懂!http://blog.csdn.net/jiangjunshow 也欢迎大家转载本篇文章.分享知识,造福人民,实现我们中华民族伟大复兴! 上一篇博 ...
python抓取网站重要url_[Python]网络爬虫（一）：抓取网页的含义和URL基本构成
一.网络爬虫的定义网络爬虫,即Web Spider,是一个很形象的名字. 把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛. 网络蜘蛛是通过网页的链接地址来寻找网页的. 从网站某一 ...

抓取网页并保存静态资源

抓取网页并保存静态资源相关推荐

最新文章

热门文章