golang 网页解析 goquer包简介

安装

加载页面

获得document对象

选择元素

Selection类型提供的方法

goquery github地址 https://github.com/PuerkitoBio/goquery

安装

由于它依赖 Go语言的 net/html 包以及css选择库 cascadia，因此我们要先手动安装net/html包，后者不需要我们手动安装。
运行

go get https://github.com/PuerkitoBio/goquery

之后可能会出现golang.org\x失败相关的，那里是由于被墙了导致（好像又不是o_o ....），那里自己百度下吧，具体错误我当时也没记录(￣、￣)

然后应该就可以使用goquery包了

语法相关这里就不过分说明，直接上用法吧(●'◡'●)

首先导入该包

import  "github.com/PuerkitoBio/goquery"

加载页面

就用官方的例子吧，我比较懒

  // 请求html页面res, err := http.Get("http://metalsucks.net")if err != nil {// 错误处理log.Fatal(err)}defer res.Body.Close()if res.StatusCode != 200 {log.Fatalf("status code error: %d %s", res.StatusCode, res.Status)}

获得document对象

有多种获得document对象的方法，这里是比较常见的一种

  // 加载 HTML document对象doc, err := goquery.NewDocumentFromReader(res.Body)if err != nil {log.Fatal(err)}

选择元素

选择器语法就是css选择器语法，和jsoup中的类似

  // Find the review itemsdoc.Find(".sidebar-reviews article .content-block").Each(func(i int, s *goquery.Selection) {// For each item found, get the band and titleband := s.Find("a").Text()title := s.Find("i").Text()fmt.Printf("Review %d: %s - %s\n", i, band, title)})

要爬取的即是图中的内容

运行结果

Selection类型提供的方法

这些方法是页面解析最重要，最核心的方法

1）类似函数的位置操作

- Eq(index int) *Selection //根据索引获取某个节点集

- First() *Selection //获取第一个子节点集

- Last() *Selection //获取最后一个子节点集

- Next() *Selection //获取下一个兄弟节点集

- NextAll() *Selection //获取后面所有兄弟节点集

- Prev() *Selection //前一个兄弟节点集

- Get(index int) *html.Node //根据索引获取一个节点

- Index() int //返回选择对象中第一个元素的位置

- Slice(start, end int) *Selection //根据起始位置获取子节点集

2）扩大 Selection 集合（增加选择的节点）

- Add(selector string) *Selection //将匹配到的节点添加当前节点集合中

- AndSelf() *Selection //将堆栈上的前一组元素添加到当前的

- Union() *Selection //which is an alias for AddSelection()

3）过滤方法，减少节点集合

- End() *Selection

- Filter…() //过滤

- Has…()
- Intersection() //which is an alias of FilterSelection()
- Not…()

4）循环遍历选择的节点

- Each(f func(int, *Selection)) *Selection //遍历

- EachWithBreak(f func(int, *Selection) bool) *Selection //可中断遍历

- Map(f func(int, *Selection) string) (result []string) //返回字符串数组

5）修改文档

- After…()            //在匹配元素之后追加元素
- Append…()         //将选择器指定的元素添加到匹配元素集合的每个元素的末尾
- Before…()          //在匹配元素之前追加元素
- Clone()             //创建匹配节点的副本
- Empty()            //清空子节点
- Prepend…()
- Remove…()
- ReplaceWith…()
- Unwrap()
- Wrap…()
- WrapAll…()
- WrapInner…()

6）检测或获取节点属性值

- Attr(), RemoveAttr(), SetAttr()  //获取，移除，设置属性的值
- AddClass(), HasClass(), RemoveClass(), ToggleClass()
- Html()  //获取该节点的html
- Length() //返回该Selection的元素个数
- Size(), which is an alias for Length()
- Text()  //获取该节点的文本值

7）查询或显示一个节点的身份

- Contains() //包含
- Is…()

8）在文档树之间来回跳转（常用的查找节点方法）

- Children…()
- Contents()
- Find…()
- Next…()
- Parent[s]…()
- Prev…()
- Siblings…()