node实现爬取当前页面链接实现

首先说明这是自己学习node的过程中自己的小练习，想通过自己学习的几个模块，简单实现爬取页面链接的小工具，若有不足之处，希望大家多多指教。

const superAgent = require('superagent')//superagent是nodejs里一个非常方便的客户端请求代理模块(类似ajax)，当你想处理get,post,put,delete,head请求时,你就应该想起该用它了.
const cheerio = require('cheerio')//为服务器特别定制的，快速、灵活、实施的jQuery核心实现.
const fs = require('fs')
const path = require('path')const testUrl = "http://ah.10086.cn/m"//测试链接 function getLinkByUrl(url){var readLink = new Promise(function(resolve,reject){superAgent.get(testUrl).end((err,res)=>{if (err){console.log('无效地址111')reject('无效地址222')}else{console.log('=========html=============\n ',res.text)let $ = cheerio.load(res.text)let obj = {title:'',linkArry:[],count:0}obj.title = $('title').text()$('a').each(function(ind,element){let href = $(element).attr('href')||''let name = $(element).text().trim()let a = {name,href}obj.linkArry.push(a)obj.count++});resolve(obj);}})})return readLink}function writeJsonFile(data){let promise = new Promise(function(resolve,reject){fs.exists('./data',function(exists){if(!exists){console.log('data文件夹不存在。。。')fs.mkdir('./data',function(err){if (err) return console.log(err)console.log("文件夹创建成功");var file = path.join(__dirname,`data/${data.title}.json`)fs.writeFile(file,JSON.stringify(data),function(err){if(err){return reject('json文件创建失败。。。')}else{resolve('json文件创建成功！')}})})}else{console.log('data文件夹存在！')fs.exists(`./data/${data.title}.json`,function(exists){if(!exists){console.log('json文件不存在。。。')var file = path.join(__dirname,`data/${data.title}.json`)fs.writeFile(file,JSON.stringify(data),function(err){if(err){return reject('json文件创建失败2。。。')}else{resolve('json文件创建成功2！')}})}else{resolve('json文件存在！')}})}})})return promise
}getLinkByUrl(testUrl).then(function(resolve){let obj = resolveconsole.log('=============resolve1=',obj)return writeJsonFile(obj)},function(reject){console.log('=============reject1=',reject)}).then(function(resolve){console.log('===========resolve2=',resolve)},function(reject){console.log('===========resolve2=',reject)}).catch(function(err){console.log('=========err=',err)})

执行后，生成文件内容

{"title": "安徽移动个人触屏版网厅","linkArry": [{"name": "","href": "javascript:void\n\n(window.location.href='http://ah.10086.cn/mpad/pad/num/number_list.html');"},{"name": "","href": "javascript:void\n\n(window.location.href='http://ah.10086.cn/mpad/pad/num/number_list.html');"},{"name": "+充话费","href": ""},{"name": "+充流量","href": ""},{"name": "业务办理","href": "http://ah.10086.cn/m/pages/pad/operate/openBusiIndex.html"},{"name": "手机卖场","href": "http://ah.10086.cn/mpad/pad/index.html"},{"name": "宽带专区","href": "http://ah.10086.cn/m/pages/pad/kdzq/index.html"},{"name": "选号入网","href": "http://ah.10086.cn/mpad/pad/num/number_list.html"},{"name": "流量专区","href": "http://ah.10086.cn/m/pages/pad/operate/flowZQ/index.html"},{"name": "流量红包","href": "http://ah.10086.cn/m/pages/draw/downloadkhd/downloadkhd.html?code=4&&WT.mc_ev=GXHXZY4"},{"name": "4G特惠","href": "http://ah.10086.cn/dt/khd"},{"name": "下载手厅","href": "http://ah.10086.cn/dt/khd"},{"name": "","href": "http://ah.10086.cn/dt/khd"},{"name": "","href": "http://ah.10086.cn/mpad/pad/act/haokarwy/index2.html"},{"name": "","href": "http://ah.10086.cn/mpad/hhg"},{"name": "","href": "http://ah.10086.cn/m/pages/draw/broadpromotion/index.html"},{"name": "","href": "http://ah.10086.cn/zsyyt/ahmobile/download/mobileDownLoadApk.do"},{"name": "","href": ""},{"name": "马上下载","href": "http://ah.10086.cn/zsyyt/ahmobile/download/mobileDownLoadApk.do"}],"count": 19
}

转载于:https://www.cnblogs.com/shichangchun/p/9700009.html

node实现爬取当前页面链接实现相关推荐

node 没有界面的浏览器_node.js爬虫入门（二）爬取动态页面(puppeteer)
之前第一篇爬虫教程node.js爬虫入门(一)爬取静态页面讲解了静态网页的爬取,十分简单,但是遇到一些动态网页(ajax)的话,直接用之前的方法发送请求就无法获得我们想要的数据.这时就需要通过爬取动态 ...
python爬虫--小白爬取csdn页面题目与链接
爬取csdn页面题目与链接前言随着人工智能的不断发展,爬虫这门技术越来越重要-哈哈哈,太过官方.新手小白,过程较曲折,代码较不专业,欢迎批评与指教! 进入正题:本文主要爬取csdn博客某专栏下的题 ...
nodejs爬虫 node + cheerio 爬取滚动加载页面
最近在学习nodejs,然后了解到nodejs也可以做爬虫就试了一试还可以就记录一下爬取爱奇艺首页视频标题用到的是node+cheerio,cheerio是jq核心功能的一个快速灵活而又简洁的实现 ...
node.js爬取美女图片（一）
node.js爬取美女图片一.准备工作首先找一个美女图片网站,这里我选用的是唯美女生,看起来像一个个人维护的网站. 分析页面结构: 1.主页主体部分就是图集列表: 2.URL的形式为 BaseUr ...
node爬虫爬取小说
node爬虫爬取小说 node爬虫爬取小说直接上代码 node爬虫爬取小说最近发现自己喜欢的一个小说无法下载,网页版广告太多,操作太难受,只能自己写个爬虫把内容爬下来放在阅读器里面看项目下载地址 ...
python数据爬虫——如何爬取二级页面（三）
前面两篇文章讲了单页面如何爬取,那么我们来试试如何爬取二级页面. 在爬取页面的时候,需要有个良好的习惯,提前对爬取的页面和爬取思路进行一个分析. 目的:爬取携程无忧数据分析师的二级页面,获取每个岗位的 ...
python实战-HTML形式爬虫-批量爬取电影下载链接
文章目录一.前言二.思路 1.网站返回内容 2.url分页结构 3.子页面访问形式 4.多种下载链接判断三.具体代码的实现四.总结一.前言喜欢看片的小伙伴,肯定想打造属于自己的私人影院 ...
Web scraper使用教程-进阶用法（二）-爬取二级页面内容
进阶用法(二)-爬取二级页面内容 1. 爬取网址 https://docs.microsoft.com/en-us/officeupdates/update-history-microsoft365- ...
项目三：爬取视频磁力链接
项目三:爬取视频磁力链接标签(空格分隔): 爬虫 BeautifulSoup -具体技术实现原理类似项目二 1. 项目任务分析类似上一个小项目中爬取图片的技术原理,本次小项目尝试对相同网站上的可供 ...

node实现爬取当前页面链接实现

node实现爬取当前页面链接实现相关推荐

最新文章

热门文章