Crux

Crux parses Web pages to identify the crux of an article — the very essential points — minus all the fluff. The library consists of multiple independent APIs. You can pick and choose which ones you want to use. If you use Crux in an Android app, they are designed to be independent so that Proguard or other minification tools can strip out the parts you don’t use.

Article Extraction API

Rich formatted content available, not just plain text.

Support for more sites & better parsing overall.

Support for more metadata formats: OpenGraph, Twitter Cards, Schema.org.

Small footprint and code size: JSoup is the only required dependency.

Fewer setters/getters, to keep the method count low (this is important for Android).

The ability to use HTTP libraries besides the default HttpUrlConnection, such as OkHttp, under the hood.

Cleaner, leaner code (compared to other libraries not optimized for Android)

First-class support for importing into Android Studio projects via Gradle.

Continuous integration with unit tests and golden file tests.

Sample Code

In a background thread, make a network request and obtain the rawHTML of the page you would like to analyze.

String url = "https://example.com/article.html";

String rawHTML = "

This is an article";

Article article = ArticleExtractor.with(url, rawHTML)

.extractMetadata()

.extractContent() // If you only need metadata, you can skip `.extractContent()`

.article();

On the UI thread:

// Use article.document, article.title, etc.

Image URL Extractor API

From a single DOM Element root, the Image URL API inspects the sub-tree and returns the best possible image URL candidate available within it. It does this by scanning within the DOM tree for interesting src & style tags.

All URLs are resolved as absolute URLs, even if the HTML contained relative URLs.

ImageUrlExtractor.with(url, domElement).findImage().imageUrl();

Anchor Links Extractor API

From a single DOM Element root, the Image URL API inspects the sub-tree and returns the best possible link URL candidate available within it. It does this by scanning within the DOM tree for interesting href tags.

All URLs are resolved as absolute URLs, even if the HTML contained relative URLs.

LinkUrlExtractor.with(url, domElement).findLink().linkUrl();

URL Heuristics API

This API examines a given URL (without connecting to the server), and returns heuristically-determined answers to questions such as:

Is this URL likely a video URL?

Is this URL likely an image URL?

Is this URL likely an audio URL?

Is this URL likely an executable URL?

Is this URL likely an archive URL?

This API also supports resolving redirects for certain well-known redirectors, with the precondition that the target URL be available as part of this candidate URL. In other words, this API will not be able to resolve redirectors that perform a HTTP 301 redirect.

CruxURL cruxUrl = CruxURL.parse("https://example.com/article.html");

cruxUrl.resolveRedirects();

cruxUrl.isLikelyArticle(); // Returns true.

cruxUrl.isLikelyImage(); // Returns false.

Usage

Include Crux in your project, then see sample code for each API provided above.

Crux uses semantic versioning. If the API changes, then the major version will be incremented. Upgrading from one minor version to the next minor version within the same major version should not require any client code to be modified.

Import Crux via Gradle

Project/build.gradle:

allprojects {

repositories {

maven { url "https://jitpack.io" }

}

}

Module/build.gradle:

From the Releases page, copy the latest version number & use it below:

dependencies {

compile 'com.github.chimbori:crux:'

}

History

Crux began as a fork of Snacktory with the goal of making it more performant on Android devices, but it has quickly gained several new features that are not available in Snacktory.

Snacktory (and thus Crux) borrow ideas and test cases from Goose and JReadability.

License

Copyright 2016, Chimbori, makers of Hermit, the Lite Apps Browser.

Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License.

You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

java页面正文提取_Crux 是一个 HTML 正文内容提取库,并确定一篇文章的关键内容...相关推荐

  1. mysql 一个文章多个分类_jdbc mysql 插入一篇文章并与多个标签,一个分类建立关联关系。...

    问题描述 我用servlet/jsp开发一个博客系统,我希望用一条语句:插入一篇文章并与多个标签,一个分类建立关联关系. 问题出现的环境背景及自己尝试过哪些方法 我一步步的做过,先建立文章,返回自增列 ...

  2. 学了一个学期《数据分析方法》课程,一篇文章总结完!

    本篇博客主要是用来记录自己学习了一学期<数据分析方法>后得到的一些总结,总结各种数据分析方法的概念,思想和SAS求解及解读方面的知识点.(因为我们教的内容有点少,所以在这里我只总结了我有学 ...

  3. 我的第一个ASP类(显示止一篇下一篇文章)

    面向对象是现今编程语言的潮流,不过,ASP对面向对象的支持可是寒碜地很.现在感觉ASP的类也不过是一堆函数而已. 不过,在学校时没有学过面向对象的语言,我对面向对象的认识仍然是从ASP开始的. 记下我 ...

  4. java页面分页显示代码_通用分页jsp页面显示

    注:本章内容都是在上一篇文章 通用分页后台显示:https://www.cnblogs.com/ly-0919/p/11058942.html  的基础上进行改进,所以有许多的类都在上一篇, 带来不便 ...

  5. 一篇文章通俗易懂的让你彻底理解 Java 注解

    很多Java程序员,对Java的注解一知半解,更有甚者,有的人可能连注解是什么都不知道 本文我们用最简单的 demo , 最通俗最短的语言,带你了解注解到底是什么? 先来简单回顾一下基础,我们知道,J ...

  6. 用java的io做一个代码计数器,如何制作Java页面计数器_java

    大庆采油六厂采油工艺研究所 王兵 王波 常常逛WWW的人,一定对许多起始页上的计数器感兴趣.每当你光临某个站点的起始页时,它的计数器就很亲切地告诉你,从某年某月某日开始,你是第几位光临的人.你可能也想 ...

  7. java 页面输出一个页面_java学习之:一个完整页面输出信息的过程(以输出Doctor表中信息为例)...

    最近在练习java程序,总结一下从数据库查询信息并输出到jsp页面的过程.主要数据处理在src.cn.javatest包下面 项目预览 1,配置项目根目录src目录下的druid.properties ...

  8. GUI实战|Python做一个文档图片提取软件

    大家好,本文将进一步讲解如何用Python提取PDF与Word中图片,并结合之前讲解过的GUI框架PysimpleGUI,做一个多文件图片提取软件,效果如下: 本文主要将分为以下部分讲解: PDF.W ...

  9. 从一个jsp页面跳转到另一个jsp页面时的参数传递

      1.从一个jsp页面跳转到另一个jsp页面时的参数传递 (1)使用request对象获取客户端提交的信息 login.jsp页面代码如下: [java] view plaincopy <%@ ...

最新文章

  1. 如何把自己的经历写成小说_《诡秘小说》:读者与作者共同创作,难分真假,午夜不敢独自看!...
  2. LeetCode 738. 单调递增的数字(贪心)
  3. 初学Linux第三周
  4. ZZULIOJ 1083: 数值统计(多实例测试)
  5. 一种内存泄露检查和定位的方法
  6. p46_IPv4地址
  7. orm框架设计、分析与开发
  8. 设计模式——建造者模式 1
  9. 手动触发事件_HBase中MemStore的刷写触发机制
  10. 浅谈 wxWindows FrameWork
  11. 强化学习实战——Q learning 实现倒立摆
  12. android studio更改代码字体,Android Studio怎么改变代码字体大小?
  13. C++实现各种进制转换
  14. 简历里计算机能力,简历上计算机能力怎么写
  15. 基于数组判断字符串是否是回文
  16. python反序数函数_python逆序函数
  17. 【数据api】数据API企业关键字模糊查询
  18. 小米红米4(标准版)解BL锁教程申请BootLoader解锁教程
  19. 线圈拉链行业调研报告 - 市场现状分析与发展前景预测(2021-2027年)
  20. Scrapy爬虫爬取电影天堂

热门文章

  1. gitlab-ci添加安卓项目构建流程
  2. 千里送人头---厦门美团一面挂
  3. JS禁止浏览器打开控制台或查看源代码
  4. PHP学习案例二 商品价格计算
  5. 腾讯音乐被“锤”后,“网文霸主”阅文如何突破版权垄断?
  6. 臭屁净化器——arduino实现
  7. 基于Navicat管理工具实现以客户为导向的数据库基本操作
  8. 百度云爬虫_python
  9. 操作系统形式化验证实践教程(7) - C代码的自动验证
  10. CSI笔记【6】:Guaranteeing spoof-resilient multi-robot networks论文阅读