继续使用Programming Collection Intelligence (PCI),下一个练习是使用距离得分根据相关博客中使用的单词确定博客列表。

我已经找到Encog作为AI /机器学习算法的框架,为此,我需要一个RSS阅读器和一个HTML解析器。

我最终使用的2个库是:

  1. 罗马

对于一般的其他实用程序和收集操作,我使用了:

  • 谷歌番石榴

我保持简短的博客列表,包括我关注的一些软件博客,只是为了快速进行测试,不得不对(PCI)中的实现进行一些改动,但仍然获得了理想的结果。

使用的博客:

  • http://blog.guykawasaki.com/index.rdf
  • http://blog.outer-court.com/rss.xml
  • http://flagrantdisregard.com/index.php/feed/
  • http://gizmodo.com/index.xml
  • http://googleblog.blogspot.com/rss.xml
  • http://radar.oreilly.com/index.rdf
  • http://www.wired.com/rss/index.xml
  • http://feeds.feedburner.com/codinghorror
  • http://feeds.feedburner.com/joelonsoftware
  • http://martinfowler.com/feed.atom
  • http://www.briandupreez.net/feeds/posts/default

对于实现,我只使用了一个主类和一个阅读器类:

package net.briandupreez.pci.data;import com.google.common.base.Predicates;
import com.google.common.collect.Collections2;
import com.sun.syndication.feed.synd.SyndCategoryImpl;
import com.sun.syndication.feed.synd.SyndContent;
import com.sun.syndication.feed.synd.SyndEntryImpl;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.XmlReader;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;import java.net.URL;
import java.util.*;public class FeedReader {@SuppressWarnings("unchecked")public static Set<String> determineAllUniqueWords(final String url, final Set<String> blogWordList) {boolean ok = false;try {URL feedUrl = new URL(url);SyndFeedInput input = new SyndFeedInput();SyndFeed feed = input.build(new XmlReader(feedUrl));final List<SyndEntryImpl> entries = feed.getEntries();for (final SyndEntryImpl entry : entries) {blogWordList.addAll(cleanAndSplitString(entry.getTitle()));blogWordList.addAll(doCategories(entry));blogWordList.addAll(doDescription(entry));blogWordList.addAll(doContent(entry));}ok = true;} catch (Exception ex) {ex.printStackTrace();System.out.println("ERROR: " + url + "\n" + ex.getMessage());}if (!ok) {System.out.println("FeedReader reads and prints any RSS/Atom feed type.");System.out.println("The first parameter must be the URL of the feed to read.");}return blogWordList;}@SuppressWarnings("unchecked")private static List<String> doContent(final SyndEntryImpl entry) {List<String> blogWordList = new ArrayList<>();final List<SyndContent> contents = entry.getContents();if (contents != null) {for (final SyndContent syndContent : contents) {if ("text/html".equals(syndContent.getMode())) {blogWordList.addAll(stripHtmlAndAddText(syndContent));} else {blogWordList.addAll(cleanAndSplitString(syndContent.getValue()));}}}return blogWordList;}private static List<String> doDescription(final SyndEntryImpl entry) {final List<String> blogWordList = new ArrayList<>();final SyndContent description = entry.getDescription();if (description != null) {if ("text/html".equals(description.getType())) {blogWordList.addAll(stripHtmlAndAddText(description));} else {blogWordList.addAll(cleanAndSplitString(description.getValue()));}}return blogWordList;}@SuppressWarnings("unchecked")private static List<String> doCategories(final SyndEntryImpl entry) {final List<String> blogWordList = new ArrayList<>();final List<SyndCategoryImpl> categories = entry.getCategories();for (final SyndCategoryImpl category : categories) {blogWordList.add(category.getName().toLowerCase());}return blogWordList;}private static List<String> stripHtmlAndAddText(final SyndContent description) {String html = description.getValue();Document document = Jsoup.parse(html);Elements elements = document.getAllElements();final List<String> allWords = new ArrayList<>();for (final Element element : elements) {allWords.addAll(cleanAndSplitString(element.text()));}return allWords;}private static List<String> cleanAndSplitString(final String input) {if (input != null) {final String[] dic = input.toLowerCase().replaceAll("\\p{Punct}", "").replaceAll("\\p{Digit}", "").split("\\s+");return Arrays.asList(dic);}return new ArrayList<>();}@SuppressWarnings("unchecked")public static Map<String, Double> countWords(final String url, final Set<String> blogWords) {final Map<String, Double> resultMap = new TreeMap<>();try {URL feedUrl = new URL(url);SyndFeedInput input = new SyndFeedInput();SyndFeed feed = input.build(new XmlReader(feedUrl));final List<SyndEntryImpl> entries = feed.getEntries();final List<String> allBlogWords = new ArrayList<>();for (final SyndEntryImpl entry : entries) {allBlogWords.addAll(cleanAndSplitString(entry.getTitle()));allBlogWords.addAll(doCategories(entry));allBlogWords.addAll(doDescription(entry));allBlogWords.addAll(doContent(entry));}for (final String word : blogWords) {resultMap.put(word, (double) Collections2.filter(allBlogWords, Predicates.equalTo(word)).size());}} catch (Exception ex) {ex.printStackTrace();System.out.println("ERROR: " + url + "\n" + ex.getMessage());}return resultMap;}
}

主要:

package net.briandupreez.pci.data;import com.google.common.base.Predicates;
import com.google.common.collect.Maps;
import com.google.common.io.Resources;
import com.google.common.primitives.Doubles;
import org.encog.ml.MLCluster;
import org.encog.ml.data.MLDataPair;
import org.encog.ml.data.MLDataSet;
import org.encog.ml.data.basic.BasicMLData;
import org.encog.ml.data.basic.BasicMLDataPair;
import org.encog.ml.data.basic.BasicMLDataSet;
import org.encog.ml.kmeans.KMeansClustering;import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;public class FeedReaderMain {public static void main(String[] args) {final FeedReaderMain feedReaderMain = new FeedReaderMain();try {feedReaderMain.run();} catch (IOException e) {e.printStackTrace();}}public void run() throws IOException {final String file = Resources.getResource("short-feedlist.txt").getFile();final Set<String> blogWords = determineWordCompleteList(file);final Map<String, Map<String, Double>> blogWordCount = countWordsPerBlog(file, blogWords);//strip out the outlying wordsstripOutlyingWords(blogWords, blogWordCount);performCusteringAndDisplay(blogWordCount);}private void performCusteringAndDisplay(final Map<String, Map<String, Double>> blogWordCount) {final BasicMLDataSet set = new BasicMLDataSet();final Map<String, List<Double>> inputMap = new HashMap<>();for (final Map.Entry<String, Map<String, Double>> entry : blogWordCount.entrySet()) {final Map<String, Double> mainValues = entry.getValue();final double[] elements = Doubles.toArray(mainValues.values());List<Double> listInput = Doubles.asList(elements);inputMap.put(entry.getKey(), listInput);set.add(new BasicMLData(elements));}final KMeansClustering kmeans = new KMeansClustering(3, set);kmeans.iteration(150);// Display the clusterint i = 1;for (final MLCluster cluster : kmeans.getClusters()) {System.out.println("*** Cluster " + (i++) + " ***");final MLDataSet ds = cluster.createDataSet();final MLDataPair pair = BasicMLDataPair.createPair(ds.getInputSize(), ds.getIdealSize());for (int j = 0; j < ds.getRecordCount(); j++) {ds.getRecord(j, pair);List<Double> listInput = Doubles.asList(pair.getInputArray());System.out.println(Maps.filterValues(inputMap, Predicates.equalTo(listInput)).keySet().toString());}}}private Map<String, Map<String, Double>> countWordsPerBlog(String file, Set<String> blogWords) throws IOException {BufferedReader reader;String line;reader = new BufferedReader(new FileReader(file));final Map<String, Map<String, Double>> blogWordCount = new HashMap<>();while ((line = reader.readLine()) != null) {final Map<String, Double> wordCounts = FeedReader.countWords(line, blogWords);blogWordCount.put(line, wordCounts);}return blogWordCount;}private Set<String> determineWordCompleteList(final String file) throws IOException {FileReader fileReader = new FileReader(file);BufferedReader reader = new BufferedReader(fileReader);String line;Set<String> blogWords = new HashSet<>();while ((line = reader.readLine()) != null) {blogWords = FeedReader.determineAllUniqueWords(line, blogWords);System.out.println("Size: " + blogWords.size());}return blogWords;}private void stripOutlyingWords(final Set<String> blogWords, final Map<String, Map<String, Double>> blogWordCount) {final Iterator<String> wordIter = blogWords.iterator();final double listSize = blogWords.size();while (wordIter.hasNext()) {final String word = wordIter.next();double wordCount = 0;for (final Map<String, Double> values : blogWordCount.values()) {wordCount += values.get(word) != null ? values.get(word) : 0;}double percentage = (wordCount / listSize) * 100;if (percentage < 0.1 || percentage > 20 || word.length() < 3) {wordIter.remove();for (final Map<String, Double> values : blogWordCount.values()) {values.remove(word);}} else {System.out.println("\t keeping: " + word + " Percentage:" + percentage);}}}
}

结果:

*** Cluster 1 ***[http://www.briandupreez.net/feeds/posts/default]*** Cluster 2 ***[http://blog.guykawasaki.com/index.rdf][http://radar.oreilly.com/index.rdf][http://googleblog.blogspot.com/rss.xml][http://blog.outer-court.com/rss.xml][http://gizmodo.com/index.xml][http://flagrantdisregard.com/index.php/feed/][http://www.wired.com/rss/index.xml]*** Cluster 3 ***[http://feeds.feedburner.com/joelonsoftware][http://feeds.feedburner.com/codinghorror][http://martinfowler.com/feed.atom]
参考: Zen的 JCG合作伙伴 Brian Du Preez 使用Encog,ROME,JSoup和Google Guava进行博客分类 ,这是IT博客的艺术 。

翻译自: https://www.javacodegeeks.com/2013/06/blog-categorisation-using-encog-rome-jsoup-and-google-guava.html

使用Encog,ROME,JSoup和Google Guava进行博客分类相关推荐

  1. epyc rome_使用Encog,ROME,JSoup和Google Guava进行博客分类

    epyc rome 继续进行编程收集情报 ( Programming Collection Intelligence ,PCI),下一个练习是使用距离得分根据相关博客中使用的单词来确定博客列表. 我已 ...

  2. 可以放GOOGLE广告的博客总汇

    . G&%c`D&` 站点名称:站长部落 k`eK+-/ j 站点地址:http://my.chinaz.com +LZs] * A 简单介绍:站长部落采用oblog多用户系统,有几十 ...

  3. android 摇摇棒 之surfaceView vs. View--第二届 Google 暑期大学生博客分享大赛 - 2011 Android 成长篇...

    第二届 Google 暑期大学生博客分享大赛 - 2011 Android 成长篇 我的主题是: Android 应用程序开发经验 一直做的是嵌入式C/C++(Qt)语言开发,Java看了一个月,没想 ...

  4. blogger_如何使用Google Blogger创建博客

    blogger If you want to write blog posts, you need a blog to hold those posts. Google's Blogger is a ...

  5. Google 协作平台 博客和内容管理系统 跟踪代码设置 GA谷歌分析

    Google 协作平台 如果您的网站是通过 Google 协作平台创建的,在使用网站网址设置 Google Analytics(分析)帐户后,请按照以下说明启用 Google Analytics(分析 ...

  6. [第二届 Google 暑期大学生博客分享大赛 - 2011 Android 成长篇]Android 应用程序定制方案(生活类)...

    神马?生活中,我们常常会遇到一些小麻烦,迷路?查询?有木有?伤不起?...正好这时候没有朋友在身旁,又不想凡事都求助陌生人,肿么办???不用急,手头有Android手机就行,根据笔者亲自体验和长期的了 ...

  7. Google 谷歌 AI博客:发布Objectron 3D对象检测模型数据集

    仅通过在照片上训练模型,机器学习(ML)的最新技术就已经在许多计算机视觉任务中实现了卓越的准确性.基于这些成功和不断发展的3D对象理解,在增强现实,机器人技术,自主性和图像检索等广泛应用方面具有巨大潜 ...

  8. 【首届Google暑期大学生博客分享大赛——2010 Andriod篇】我理想中的坦克大战游戏

      在我们这些80后的儿时记忆里,肯定少不了坦克大战这个游戏. 我现在希望在Android手机上也能够有一款这样的游戏,但是它应该是多人互动的游戏,手机和电脑都可以互动的游戏. 具体功能: 1.多人通 ...

  9. 【转】在你的博客中添加Google地图(Use Google Map API On Your Bolg)

    在你的博客中添加Google地图(Use Google Map API On Your Bolg) *+申请一组 Google Maps API Key 在使用 Google Maps API 之前, ...

最新文章

  1. python相似图片聚类分类
  2. geek 创业型网站
  3. UA OPTI501 电磁波 求解Maxwell方程组的波动方程方法
  4. 科普 | USB 协议与接口
  5. 开博首发2017年1月13日开博大吉
  6. Scott Mitchell 的ASP.NET 2.0数据教程之二十一:: 实现开放式并发
  7. java service实例,javaweb后端实例 service
  8. 计算机硬件结构控制信息,计算机硬件的基本结构
  9. 百度链接解析_【集合】百度分享链接解析的方法总结
  10. 学习3D图形引擎中使用的基本数学
  11. JS:ES6-11 数值扩展与对象扩展
  12. 移动架构-迭代器模式
  13. Cocos2d-x学习之创建Android工程和编译
  14. DB2 SQL Error: SQLCODE=-302, SQLSTATE=22001, SQLERRMC=null
  15. Linux常用命令介绍(三)——基础操作命令
  16. JDBC常用接口详解
  17. Nginx代理Grafana常见错误
  18. c语言单片机期末试题及答案,单片机原理与应用期末试题及答案
  19. win10 Security Center服务无法禁用,启动类型灰色不可改解决方法
  20. 内存刺客在哪儿?! 微信11年膨胀575倍,只有微信被发现了

热门文章

  1. 2-7 SpringBoot常用注解讲解
  2. 解决高版本SpringBoot整合swagger时启动报错:Failed to start bean ‘documentationPluginsBootstrapper‘ 问题
  3. hadoop2.6.0+eclipse配置
  4. jcmd_jcmd,大约JDK 11
  5. ip integrator_使用Oracle Data Integrator(和Kafka / MapR流)完善Lambda体系结构
  6. cloud foundry_介绍“又一个” Cloud Foundry Gradle插件
  7. java7和java8切换_仍不切换到Java 8的6个理由
  8. Java大数据处理的流行框架
  9. 排队论游乐场的游乐项目_外汇游乐场
  10. 使用HTTPS和OAuth 2.0保护服务到服务的Spring微服务