题目地址:Web Crawler Multithreaded - LeetCode


Given a url startUrl and an interface HtmlParser, implement a Multi-threaded web crawler to crawl all links that are under the same hostname as startUrl.

Return all urls obtained by your web crawler in any order.

Your crawler should:

  • Start from the page: startUrl
  • Call HtmlParser.getUrls(url) to get all urls from a webpage of given url.
  • Do not crawl the same link twice.
  • Explore only the links that are under the same hostname as startUrl.

As shown in the example url above, the hostname is example.org. For simplicity sake, you may assume all urls use http protocol without any port specified. For example, the urls http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while urls http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such:

interface HtmlParser {// Return a list of all urls from a webpage of given url.// This is a blocking call, that means it will do HTTP request and return when this request is finished.public List<String> getUrls(String url);
}

Note that getUrls(String url) simulates performing a HTTP request. You can treat it as a blocking function call which waits for a HTTP request to finish. It is guaranteed that getUrls(String url) will return the urls within 15ms. Single-threaded solutions will exceed the time limit so, can your multi-threaded web crawler do better?

Below are two examples explaining the functionality of the problem, for custom testing purposes you’ll have three variables urls, edges and startUrl. Notice that you will only have access to startUrl in your code, while urls and edges are not directly accessible to you in code.

Follow up:

  • Assume we have 10,000 nodes and 1 billion URLs to crawl. We will deploy the same software onto each node. The software can know about all the nodes. We have to minimize communication between machines and make sure each node does equal amount of work. How would your web crawler design change?
  • What if one node fails or does not work?
  • How do you know when the crawler is done?

Example 1:

Input:
urls = ["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.google.com","http://news.yahoo.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.yahoo.com/news/topics/"
Output: ["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.yahoo.com/us"
]

Example 2:

Input:
urls = ["http://news.yahoo.com","http://news.yahoo.com/news","http://news.yahoo.com/news/topics/","http://news.google.com"
]
edges = [[0,2],[2,1],[3,2],[3,1],[3,0]]
startUrl = "http://news.google.com"
Output: ["http://news.google.com"]
Explanation: The startUrl links to all other pages that do not share the same hostname.

Constraints:

  • 1 <= urls.length <= 1000
  • 1 <= urls[i].length <= 300
  • startUrl is one of the urls.
  • Hostname label must be from 1 to 63 characters long, including the dots, may contain only the ASCII letters from ‘a’ to ‘z’, digits from ‘0’ to ‘9’ and the hyphen-minus character (’-’).
  • The hostname may not start or end with the hyphen-minus character (’-’).
  • See: https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_hostnames
  • You may assume there’re no duplicates in url library.

这道题目很新颖,意思是写一个模拟爬虫程序,从一个开始页面,爬取属于这个域名的所有网页,LeetCode提供了爬虫的接口。

题目要求必须使用多线程爬虫,不然会超时。

1.用set来存储已经爬取过的网页,这个set需要支持多线程并发修改,我使用ConcurrentHashMap

2.存储结果的List也只要支持多线程并发,我使用Collections.synchronizedList

Java解法如下:

class Solution {private final Set<String> set = Collections.newSetFromMap(new ConcurrentHashMap<String, Boolean>());private final List<String> result = Collections.synchronizedList(new ArrayList<String>());private String HOSTNAME = null;public boolean judgeHostname(String url) {int idx = url.indexOf('/', 7);String hostName = (idx != -1) ? url.substring(0, idx) : url;return hostName.equals(HOSTNAME);}private void initHostName(String url) {int idx = url.indexOf('/', 7);HOSTNAME = (idx != -1) ? url.substring(0, idx) : url;}public void getUrl(String startUrl, HtmlParser htmlParser) {result.add(startUrl);List<String> res = htmlParser.getUrls(startUrl);List<Thread> threads = new ArrayList<>();for (String url : res) {if (judgeHostname(url) && !set.contains(url)) {set.add(url);threads.add(new Thread(() -> {getUrl(url, htmlParser);}));}}for (Thread thread : threads) {thread.start();}try {for (Thread thread : threads) {thread.join();}} catch (InterruptedException e) {e.printStackTrace();}}public List<String> crawl(String startUrl, HtmlParser htmlParser) {initHostName(startUrl);set.add(startUrl);getUrl(startUrl, htmlParser);return result;}
}

这个解法的缺点是可能会在运行过程中产生大量的无用线程,可以使用线程池进行并发操作,减少创建线程的开销。

使用线程池的最大难点在于如何确定所有任务都执行完成,可以关闭线程池。

我尝试使用 Future, CountDownLatch,但效果都不好。最关键的是运行速度没有直接新建线程快!!

LeetCode 1242. Web Crawler Multithreaded--Java 解法--网路爬虫并发系列--ConcurrentHashMap/Collections.synchroni相关推荐

  1. LeetCode 85. Maximal Rectangle --python,java解法

    题目地址: Given a 2D binary matrix filled with 0's and 1's, find the largest rectangle containing only 1 ...

  2. java native方法_并发系列-native函数回调Java方法原理实践

    写在前面 上一篇分享了Java调用native函数过程原理实践,文章最后留了一个问题,本章主要对C程序回调我们的Java程序原理进行实践. 调用C程序之后他是怎么知道来调用我们我们的哪个方法?又是如何 ...

  3. java 检视_Java高并发系列——检视阅读(五)

    JUC中工具类CompletableFuture CompletableFuture是java8中新增的一个类,算是对Future的一种增强,用起来很方便,也是会经常用到的一个工具类,熟悉一下. Co ...

  4. java爬虫面试题_使用Java实现网络爬虫

    网络爬虫 网络爬虫(又被称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本. 另外一些不常使用的名字还有蚂蚁.自动索引.模 ...

  5. LeetCode 1195. Fizz Buzz Multithreaded--并发系列题目--Java 解法--AtomicInteger/CountDownLatch/CyclicBarrier

    题目地址:Fizz Buzz Multithreaded - LeetCode Write a program that outputs the string representation of nu ...

  6. LeetCode 1115. Print FooBar Alternately--多线程并发问题--Java解法--CyclicBarrier, synchronized, Semaphore 信号量

    此文首发于我的个人博客:zhang0peter的个人博客 LeetCode题解专栏:LeetCode题解 LeetCode 所有题目总结:LeetCode 所有题目总结 题目地址:Print FooB ...

  7. LeetCode 136. Single Number--异或--Java,C++,Python解法

    题目地址:Single Number - LeetCode Given a non-empty array of integers, every element appears twice excep ...

  8. LeetCode 161. One Edit Distance--Python,Java,C++解法

    题目地址:One Edit Distance - LeetCode 此题链接:One Edit Distance - LeetCode 这道题目进阶版本Edit Distance - LeetCode ...

  9. LeetCode 929 Unique Email Addresses--python一行解法,Java解法

    题目地址:Unique Email Addresses - LeetCode Every email consists of a local name and a domain name, separ ...

最新文章

  1. matlab求矩阵均值向量,如何求一个矩阵的均值向量
  2. 【转】Failed to load module for FS type ‘bdb’ in TortoiseSVN 1.6.x
  3. 学会数据库读写分离、分表分库
  4. 历届试题 打印十字图(模拟)
  5. bzoj1051 [HAOI2006]受欢迎的牛 tarjan缩点
  6. POJ 3984 迷宫问题
  7. Interpreter(解释器)--类行为型模式
  8. 解决:org.apache.rocketmq.client.exception.MQClientException: No route info of this topic, TopicTest
  9. python传递类的实例_使用Python将变量从一个类实例传递到另一个类实例?
  10. 201671010135 《面向对象程序设计课程学习进度条》
  11. 央行数字货币——DCEP的那些事儿
  12. 【业务安全-04】万能用户名及万能密码实验
  13. python艺术分形数_Python分形框计数 – 分形维数
  14. 电脑蓝屏后的文件数据怎么恢复?电脑蓝屏的原因有哪些
  15. Mysql三种常见备份表方式
  16. Linux查看目录busy,Linux中遇到device is busy的处理方法
  17. ATM (Asynchronous Transfer Mode)异步传输模式
  18. docker多容器操作与强制删除容器的方法步骤
  19. Bugzilla的bug状态
  20. 500个Python模块(库)的详细分类介绍

热门文章

  1. Latex 参考文献,或者最后一页平衡
  2. pyhton 中的字符串切片问题
  3. 如何安装python3.7.4_银河麒麟安装Python3.7.4以及升级自带OpenSSL
  4. Bio+IT 生信科技爱好者知识库
  5. NCBI|转录组原始数据上传
  6. Ensemble-BioMart:得到基因注释信息(有参考基因组的物种)
  7. Gut Microbes l 锻炼或会增加机体内源性大麻素水平和改变肠道菌群从而降低机体慢性炎症!...
  8. PNAS:土壤氮循环微生物功能特征的全球生物地理学
  9. Nature:拟南芥微生物组功能研究
  10. Nature Methods:基于人工重组菌群数据的宏基因组的软件评估金标准