php_curl扩展的使用

http://www.runoob.com/php/php-ref-curl.html

https://www.cnblogs.com/manongxiaobing/p/4698990.html

1.curl_init函数用法

https://www.cnblogs.com/bwteacher/p/5685086.html

2.iconv的使用

https://www.cnblogs.com/aademeng/articles/6233218.html

3.乱码问题解决：

https://www.cnblogs.com/joshua317/articles/3958588.html

4.注意：

如果是抓取批量网页，出现不稳定情况（内容缺失），可以设置适量的延时

这是由于目标页面发生了跳转，而php_curl默认是不跳转的，所以要在curl对象设置自动跳转：

curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, 1 );

5.模拟get和post

https://aibk.cnblogs.com/p/7215462.html

5.动态页面如何爬取？

①这是个很蠢的方法，但是应该可以实现：

首先，curl获取到页面。添加js代码到页面内，实现动态跳转连接内要有附加的一些信息，而跳转目标的php代码需要通过我们附加在链接里的信息继续爬取新的页面。（简而言之就是爬到原页面，到浏览器运行一遍，再把需要的内容发送回来，不断循环）

②结合phantomjs（无显示浏览器）

https://www.cnblogs.com/miqi1992/p/8093958.html

使用：

https://www.jianshu.com/p/0254391918f7

https://blog.csdn.net/jiedao_liyk/article/details/78850684

http://www.mafutian.net/267.html

API:

http://phantomjs.org/api/webpage/

4.基本操作：

注意：下面的抓取方法是无法获取js处理后的内容的（如果有需要请看上面phantomJS的内容，简单看一下在PHP里使用的情况就好了。当然，phantomJS是个挺好用的东西，可以学一下）

// 爬取页面全部数据
function curlGetData($url){$ch = curl_init();curl_setopt($ch, CURLOPT_URL, $url);curl_setopt ($ch,  CURLOPT_HEADER,  false);curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, 1 );//自动跳转$response = curl_exec($ch);curl_close($ch);//判断是什么编码并转码// $response = str_replace('gb2312','utf-8',$response);// $response =  iconv("gb2312","utf-8//IGNORE",$response);return $response;
}

5.抓取的内容一般有以下几点需要修正：①编码方式转换（iconv）②链接地址（绝对与相对）

6.有时候发现拿不到内容，不要急，分析一下情况：

①内容是经过js动态生成的=>用phantom来获取，放弃curl

②页面显示“对象已移动，请点击。。。。”=>这是发生了重定向，检查curl设置里的CURLOPT_FOLLOWLOCATION是否设置1

③爬取贫乏的话，如果出现内容缺失，可能是由于目标服务器有限制=>设置延时看看

④如果提示https什么的=>检查curl设置里的CURLOPT_SSL_VERIFYPEER是否设置为false（这是控制是否进行SSL证书验证）

7.以下是获取页面数据的几个API（更新）：

①输出函数(实例https://www.hkdm688.com/rhdm/13405):以‘好看动漫网’的爬取为例

function yzm(){$res1=get_page_data('https://www.hkdm688.com/rhdm/13405',['title'=>[['tab'=>'div','attr'=>'class="wrap mb"','op'=>'into'],['tab'=>'ul','attr'=>'class="ipic"','op'=>'into'],['tab'=>'a','op'=>'into'],['eq'=>'2','get_attr'=>'title','trim'=>'"']],'title2'=>[['tab'=>'div','attr'=>'class="wrap mb"','op'=>'into'],['tab'=>'ul','attr'=>'class="ipic"','op'=>'into'],['tab'=>'a','op'=>'into'],],'intro'=>[['tab'=>'div','attr'=>'class="wrap mb"','op'=>'into'],['tab'=>'div','attr'=>'class="Content_des pslpx"','op'=>'into'],['tab'=>'div','attr'=>false,'op'=>'remove'],['remove_tab_without'=>''],],'img'=>[['tab'=>'li']]]);var_dump($res1);return ;
}

②获取页面内容的函数（下面的几个函数为基础）：

//=====================================//
//页面内容获取
/*$url:页面网址$req_array:关联数组(所求项目名=>对应的条件)操作优先级：tab>attr>op>eq>get_attr=get_content>remove>remove_tab_without>trim参数用法：如下例子中$req_array=['title'=>[['tab'=>'div','attr'=>'class="ipc"','op'=>'into'],['tab'=>'li','attr'=>'id="xxx"','op'=>'into'],['tab'=>'span','attr'=>'id="xxx"','op'=>'remove'],['remove'=>'简介：','remove_tab_without'=>'<a><p>']]]；是一个数组它的索引：'title'代表你要查找的属性，输出时会作为结果数组的索引它值是个二维数组；注意：这个二维数组按照顺序一行一行执行。开始：以页面所有内容为基础，经过二维数组内筛选后输出结果每一个行指的是一次完全操作，即“谁 干了 什么”，安装上面的优先级关系：1.'tab'=>标签名，'attr'=>属性：找到标签2.'op'=>'into'：进入1中找到的目标,'op'=>'into'：进入1中找到的目标3.'eq'=>数字:选择找到的第几个目标4.'remove'=>'字符串'：删除字符串5.'remove_tab_without'=>标签（'<a><p>'），去掉标签，除了xxxxx6.'trim'=>字符串:去掉两边的特定字符串最后：这个函数输出的是个二维数组，比如title项，结果是所有符合的字符串如果赛选精确，可以直接[0]取到$ret:返回值(所求项目名=>对应的条件得到的内容)，格式出错返回false，没有找到返回空数组例子：get_page_data('baidu.com',['title'=>[['tab'=>'div','attr'=>'class="ipc"','op'=>'into'],['tab'=>'li','attr'=>'id="xxx"','op'=>'into'],['tab'=>'span','attr'=>'id="xxx"','op'=>'remove'],['remove'=>'简介：','remove_tab_without'=>'<a><p>']],'intro'=>[['tab'=>'div','attr'=>'class="ipc"','op'=>'into'],['tab'=>'li','attr'=>'id="xxx"','op'=>'into'],['tab'=>'span','attr'=>'id="xxx"','op'=>'remove'],['remove'=>'简介：','remove_tab_without'=>'<a><p>']]]);*/
function get_page_data($url,$req_array){//效率测试$t1=microtime(true);$m1=memory_get_usage();//结果$ret=[];//获取页面内容$page_data1=curlGetData($url);//去掉注释$pattern='/<!--[\s\S]*?-->/';$page_data2=preg_replace($pattern,'', $page_data1);//找到目标内容并返回foreach($req_array as $item1 => $value1){//记录时时匹配结果$final_str=[$page_data2];//判断是否是数组if(!is_array($value1)){return false;}//按要求匹配foreach($value1 as $item2 => $value2){//判断是否是数组if(!is_array($value2)){return false;}//判断查找的对象是否是数组if(!is_array($final_str)){//构造成数组$final_str=[$final_str];}//查找新标签if(isset($value2['tab'])){$tab=$value2['tab'];//属性$attr='';if(isset($value2['attr'])){$attr=$value2['attr'];}//判断操作$op='into';//默认是：进入标签if(isset($value2['op'])){$op=$value2['op'];}//寻找标签$find_res=[];foreach($final_str as $item3 => $value3){$find_res=array_merge($find_res,find_all($value3,$tab,$attr));}//操作switch($op){case 'into'://进入$final_str=$find_res;break;case 'remove'://删除foreach($find_res as $item3 => $value3){$final_str=str_replace($value3,'',$final_str);}break;default :break;}}//对查找完标签的目标进行操作====//取标签组中的一个if(isset($value2['eq'])){$final_str=[$final_str[$value2['eq']]];}//获取标签属性操作if(isset($value2['get_attr'])){foreach($final_str as $item3 => $value3){$final_str[$item3]=get_attr_m($value3)[$value2['get_attr']];}}//删除内容if(isset($value2['remove'])){foreach($final_str as $item3 => $value3){$final_str[$item3]=str_replace($value2['remove'],'',$value3);}}//去除标签if(isset($value2['remove_tab_without'])){foreach($final_str as $item3 => $value3){$final_str[$item3]=strip_tags($value3,$value2['remove_tab_without']);}}//trim操作if(isset($value2['trim'])){foreach($final_str as $item3 => $value3){$final_str[$item3]=trim($value3,$value2['trim']);}}}//赋值$ret[$item1]=$final_str;}//效率测试$t2=microtime(true);$m2=memory_get_usage();// var_dump($t2-$t1);// var_dump($m2-$m1);return $ret;
}

③php_curl扩展的使用函数(代码里注释部分可以自己看情况打开)

function curlGetData($url,$sleepT=2){$ch = curl_init();curl_setopt($ch, CURLOPT_URL, $url);curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);curl_setopt($ch, CURLOPT_ENCODING ,'gzip');// curl_setopt($ch, CURLOPT_USERAGENT,$user_agent);curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, 1 );//自动跳转// curl_setopt($ch, CURLOPT_COOKIE,$cookie);// curl_setopt($ch,CURLOPT_CAINFO,'F://Web/web/baixiu/admin/static/cacert.pem');// var_dump(curl_getinfo($ch));// curl_setopt($ch, CURLOPT_MAXCONNECTS ,100);// curl_setopt($ch, CURLOPT_CONNECTTIMEOUT ,0);// curl_setopt($ch, CURLOPT_CONNECTTIMEOUT_MS ,0);// curl_setopt($ch, CURLOPT_DNS_CACHE_TIMEOUT ,2);// curl_setopt($ch, CURLOPT_LOW_SPEED_LIMIT ,0);// curl_setopt($ch, CURLOPT_MAXREDIRS ,100);// curl_setopt($ch, CURLOPT_TIMEOUT ,1000);// curl_setopt($ch,CURLOPT_IPRESOLVE,CURL_IPRESOLVE_V4);$response = curl_exec($ch);curl_close($ch);//判断是什么编码并转码// $response = str_replace('gb2312','utf-8',$response);// $response =  iconv("gb2312","utf-8//IGNORE",$response);//输出错误到本地文件// if(!$response){//     file_put_contents($GLOBALS['hkdm']['error_file'],PHP_EOL,FILE_APPEND);//     file_put_contents($GLOBALS['hkdm']['error_file'],$response,FILE_APPEND);//     file_put_contents($GLOBALS['hkdm']['error_file'],'ttt',FILE_APPEND);// }//设置请求延迟sleep($sleepT);return $response;
}

④查找html标签函数：

总共三个函数：

find函数：找到第一个符合的标签

find_all函数：找到所有符合条件的标签（在find的基础上）

find_deep函数：可以提供一系列条件进行多次查找，返回所有匹配的结果（在find_all的基础上）

//单次匹配(第一个)
function find($str,$tab_name,$tab_attr=''){//属性内容柔性化处理，提高容错率===================//等号两两边允许空格存在$tab_attr=preg_replace('/=/','[\s]*=[\s]*',$tab_attr);//无视单双引号的区别$tab_attr=preg_replace('/[\'|\"]/','[\s]*[\\"\\\'][\s]*',$tab_attr);//单空格可匹配多空格$tab_attr=preg_replace('/[\s]/','[\s]+',$tab_attr);//判断标签属性是否是空(没给参数代表任意，不代表没有)$filling='';//用于填充目标标签内标签名外的内容if($tab_attr===false){$filling='[\s]*';}else if($tab_attr===''){$filling='([\s]+[^<>]*)?';}else{$filling='[\s]+([^<>]*)'.'('.$tab_attr.')'.'([^<>]*)';}//填充标签（不一定是目标）除标签名外的内容$arbfill='([\s]+[^<>]*)?';//判断是否是单边标签if($tab_name=='img'){//暂时只有img标签//直接匹配$result=[];$pattern='/<'.$tab_name.$filling.'>/i';if(!preg_match($pattern,$str,$result)){return false;}return $result[0];}//一次筛选找到目标开始的字符串$matches1=[];$attr_str='';$pattern='/(<'.$tab_name.')'.$filling.'>/i';if(!preg_match($pattern,$str,$matches1)){//没找到return false;}//以上面的开头分割原字符串并获取后半段$str=strchr($str,$matches1[0]);//目标开始字符串//判断最终筛选目标的标签深度(需要匹配多少</*>)$matches2=[];$pattern='/(<'.$tab_name.$arbfill.'>)|(<\/'.$tab_name.')>/i';if(!preg_match_all($pattern,$str,$matches2)){return false;}$deep=0;$num=0;//所需个数foreach($matches2[0] as $value){//碰到<div>+1,</div>-1,直到$deep=0;if(preg_match('/(<'.$tab_name.$arbfill.'>)/i',$value)){$deep++;$num++;}if(preg_match('/(<\/'.$tab_name.'>)/i',$value)){$deep--;}if($deep==0){break;}}//最终匹配$result=[];$pattern='/<'.$tab_name.$filling.'>([\s\S]*?<\/'.$tab_name.'>){'.$num.'}/';if(!preg_match($pattern,$str,$result)){return false;}return $result[0];
}
//单次匹配所有html标签
function find_all($str,$tab_name,$tab_attr=''){//结果$ret=[];$res='';//产生随机数用于缩小范围时分割字符串用$uniqid=uniqid();//依次找到所有for($i=0;;$i++){//找到第一个$res=find($str,$tab_name,$tab_attr);if(!$res){break;}array_push($ret,$res);//缩小范围$replace_str='$'.$uniqid.'$';$str=str_replace($res,$replace_str,$str);$str=preg_replace('/[\s\S]*?'.$replace_str.'/','',$str);}//输出return $ret;
}
//多层匹配所有html标签
function find_deep($str,$array){//结果$ret=[];//用于存放每次被查找的目标字符串数组,直到成为结果$str_array=[$str];//初始化用用户提供的字符串//一次次往里找,直到条件用尽foreach($array as $array_item => $array_val){//临时变量,存放下一次循环被查找的数组$temp_array=[];//查找foreach($str_array as $str_array_val){$result=find_all($str_array_val,$array_item,$array_val);$temp_array=array_merge($temp_array,$result);}//修改字符串数组$str_array=$temp_array;}$ret=$temp_array;return $ret;
}

⑤获取a=b格式的内容（例如属性等）

//获取:a=b形式
/*$source:被查找字符串$div:分隔符,默认空格例:$source="<a href='baidu.com' title='abc'>我是文科生</a>$div=' ';结果：['href'=>'baidu.com','title'=>'abc']
*/
function get_attr_m($source,$div=' '){//结果$ret=[];//消除=两边的空格$source=preg_replace('/[\s]*=[\s]*/','=',$source);//分割$array=explode($div,$source);//匹配目标foreach($array as $val){if(preg_match('/[^=]*?=[\s\S]*/',$val,$match)){$item=trim(strchr($match[0],'=',true),"\" ");$value=trim(strchr($match[0],'=',false),"=\" ");$ret=array_merge($ret,[$item=>$value]);}}return $ret;
}