node crawler简单使用

需求：获取某一个网站，商品的名称+价格；

以下按京东商品列表URL进行测试，

1、搭建node环境，此node安装不多介绍

2、node爬虫工具，安装 npm install crawler

3、创建index.js，直接贴代码

/*** 此js主要是通过PATH_URL，根据源码中的html风格，根据特定的标签获取HTML中的href，写入JSON文件中*/
const fs = require('fs');
const Crawler = require('crawler');
const _ = require('lodash')//在爬相关图片信息时，需要得到指定的URL：https://search.jd.com,
const PATH_URL = 'https://search.jd.com/Search?keyword=%E5%8D%8E%E4%B8%BA%E6%89%8B%E6%9C%BA&enc=utf-8&suggest=5.def.0.base&wq=huawie%E6%89%8B%E6%9C%BA&pvid=b314d64bbf02446187feba4eed246377';// 为了console输出而定义的变量
let cnt = 0;
// 查找到的HTML地址
let listDataRes = []// 爬虫抓取
const c = new Crawler({maxConnections: 10,retries: 3,  // 失败重连3次callback: function (error, res, done) {if (error) {console.log(error)} else {// 注：抓取图片的规则，需要自己定义const $ = res.$;console.log(' ------------>title: ', $('title').text())const dataList = $('.goods-list-v2 li');dataList.each((index, dataItem) => {let dataRes = {};const idKey = dataItem.attribs['data-sku']// 获取li标签下的标签集合const firstChildren = dataItem.childrenfirstChildren.forEach(twoItem => {// 得到相关div标签if (twoItem.type === 'tag' && twoItem.name === 'div') {const twoChildren = twoItem.childrentwoChildren.forEach(threeItem => {// -------> 获取商品价格if (threeItem.type === 'tag' && threeItem.name === 'div' && threeItem.attribs.class === 'p-price') {const threeChildren = threeItem.children// 获取strong标签threeChildren.forEach(fourItem1 => {if (fourItem1.type === 'tag' && fourItem1.name === 'strong') {const fourItem1Children = fourItem1.children// 获取i标签fourItem1Children.forEach(fiveItem1 => {if (fiveItem1.type === 'tag' && fiveItem1.name === 'i') {const price = (fiveItem1.children[0]).datadataRes.price = price}})}})}// -------> 获取商品名称if (threeItem.type === 'tag' && threeItem.name === 'div' && threeItem.attribs.class === 'p-name p-name-type-2') {const threeChildren2 = threeItem.children// 获取strong标签threeChildren2.forEach(fourItem2 => {if (fourItem2.type === 'tag' && fourItem2.name === 'a') {const fourItem2Children = fourItem2.children// 获取i标签fourItem2Children.forEach(fiveItem3 => {if (fiveItem3.type === 'tag' && fiveItem3.name === 'em') {const fiveItem3Children = fiveItem3.childrenconst fiveItem3ChildrenObj = fiveItem3Children.find(f => f.type === 'text')const name = fiveItem3ChildrenObj ? fiveItem3ChildrenObj.data : ''dataRes.name = name}})}})}})}})if (!_.isEmpty(dataRes)) {listDataRes.push(dataRes);}})console.log(`${cnt++}`); //这里就是为了自己在console中看到进度，没有实际用处。}done(); // 函数在回调中完成工作后必须调用它}
});// 将其相关href写入json文件
const writeListJson = () => {console.log(' =================> 队列为空时，数据处理完成')// 写入文件内容(如果文件不存在会创建一个文件)fs.writeFile('./jd_data/jd_goods_list.json', JSON.stringify(listDataRes), function (err) {if (err) {throw err;}console.log('all requests done and json saved!');});
}// 指定爬取一个Url,将其添加到队列中
//绝大多数网站，都有反爬机制。只有小众网站没有。所以我们需要使用以下配置
//浏览器可以下载，但是服务端爬虫无效。反爬：检测你这个请求是通过浏览器发出来，还是服务端（解决方案：让服务端伪装成浏览器来发这个请求）
c.queue({url: PATH_URL,headers: { 'User-Agent': 'requests' }//让服务端伪装成客户端
});// 在队列为空时，调用以下函数
c.on('drain', writeListJson);

4、cmd进入到index.js目录，执行 node index.js，生成如下

注意：简单的页面抓取数据，很容易实现。效果也不错，主要是根据HTML标签规则自定义获取数据。

目前存在一个问题：

例如：打开京东商品URL时，源码中展示的为30个商品信息列表，在鼠标向下滑动时，会自动追加商品数量，滑动到底部时查看源码中商品数量已增加到60个。

（刚开始研究node抓取数据）问题为在以上例子中只能抓取首屏的数据，那么如何滑动滚轮动态加载内容爬取？哪位大哥知道的，劳驾留个言指教一下小弟，先在此谢谢了。

node crawler简单使用相关推荐

Node.js 简单入门
目录一. 什么是Nodejs 二. Nodejs组成图(对比jdk) 三. Nodejs的安装四. 第一个Nodejs程序五. Node实现请求响应六. Node操作MYSQL数据库 1. 安 ...
运用node实现简单爬虫
node.js的强大就无需再去重复了,越来越多的公司在使用node.js,还有一点不得不提的优势就是node用的是javascript的语言,对于前端开发工程师来说,没有理由不去get这一强大的技能. ...
node.js简单爬虫
这里假设你已经安装好node.js和npm,如果没有安装,请参阅其他教程安装. 配置首先是来配置package.json文件,这里使用express,request和cheerio. package ...
Node路由简单的处理
看过node很多例子,都是将路由直接放到入口文件中处理,使得文件显得很大很乱,特别是当一个项目变大,有上百甚至上千的路由,那该怎么办? 最近在想如何将一个个的路由放到一个单独的模块中处理,比如'/us ...
Socket总结 node搭建简单的http服务器
网络中的进程 socket解决的是网络中进程间的通信,其首要解决的就是如何在网络中找到目标进程,这就要求进程拥有唯一性的索引,方便查找连接. 一台机器上,进程与进程之间通行,以PID作为唯一标识.但是 ...
express+node+mysql简单博客系统（一）：登录接口
今年一直想学一下node,现在马上就到年底了,赶紧安排! 准备使用node.express和mysql开发简单的博客系统: 1.先安装node.express和mysql: 2.创建node项目,也 ...
node搭建简单服务器
node中提供了一些核心模块,基于这些模块可以在服务器端进行一些操作 fs模块 fs是node提供的一个核心模块,可以用来读取文件,现有一个需求,要把同目录下的test中org.txt中数据进行处理, ...
7 1学会使用 Node 编写简单的前端应用
1.创建一个目录 $ mkdir simple-app-demo $ cd simple-app-demo 复制代码 2.在该目录下,新建一个package.json文件使用命令行来新建 $ npm ...
node实现简单的群体聊天工具
一.使用的node模块 1.express当做服务器 2.socket.io 前后通信的桥梁 3.opn默认打开浏览器的模块(本质上用不到) 难点:前后通信源码地址:https://github.c ...

node crawler简单使用

node crawler简单使用相关推荐

最新文章

热门文章