正则表达式抓取文件内容中的http链接地址

转自：https://www.cnblogs.com/akiradunn/p/5855073.html

  1 import java.io.BufferedReader;
  2
  3 import java.io.FileInputStream;
  4
  5 import java.io.FileNotFoundException;
  6
  7 import java.io.FileOutputStream;
  8
  9 import java.io.IOException;
 10
 11 import java.io.InputStreamReader;
 12
 13 import java.net.HttpURLConnection;
 14
 15 import java.net.MalformedURLException;
 16
 17 import java.net.URL;
 18
 19 import java.util.regex.Matcher;
 20
 21 import java.util.regex.Pattern;
 22
 23 //正则表达式抓取网页数据
 24
 25 public class HtmlAddressCatch {
 26
 27
 28 public static void main(String[] args) {
 29
 30   String webaddress = "https://www.zhihu.com/people/Akira_Dunn";
 31   HtmlAddressCatch.getWebTextContent(webaddress);
 32   /*String localaddress = "D:\\test\\test.html";
 33   String targetaddress = "D:\\test\\http.txt";
 34   HtmlAddressCatch.getLocalTextContent(localaddress , targetaddress);*/
 35
 36 }
 37
 38 //给定http链接抓取地址
 39
 40 public static void getWebTextContent(String webaddress){
 41
 42 try {
 43
 44 URL url = new URL(webaddress);
 45
 46 HttpURLConnection con = (HttpURLConnection)url.openConnection();
 47
 48 FileOutputStream file = new FileOutputStream("D:\text.txt");
 49
 50 InputStreamReader read = new InputStreamReader(con.getInputStream());//使用InputStreamReader是为了将InputStream字节流转换成为字符流，一次读取更多的字节
 51
 52 BufferedReader packetreader = new BufferedReader(read);//使用BufferedReader是为了在InputStreamReader的基础上一次读取更多的字节
 53
 54 int i=0;
 55
 56 String regex = "https?://\w+\.\w+\.\w+";
 57
 58 Pattern p = Pattern.compile(regex);
 59
 60 while((i=packetreader.read())!=-1)
 61
 62 {
 63
 64 String str = packetreader.readLine();
 65
 66 Matcher m = p.matcher(str);
 67
 68 while(m.find())
 69
 70 {
 71
 72 file.write((m.group()+"\r\n").getBytes());
 73
 74 }
 75
 76 }
 77
 78 } catch (MalformedURLException e) {
 79
 80 // TODO Auto-generated catch block
 81
 82 e.printStackTrace();
 83
 84 } catch (FileNotFoundException e) {
 85
 86 // TODO Auto-generated catch block
 87
 88 e.printStackTrace();
 89
 90 } catch (IOException e) {
 91
 92 // TODO Auto-generated catch block
 93
 94 e.printStackTrace();
 95
 96 }
 97
 98
 99 }
100
101
102 // 从本地test.html文件抓取http链接和邮箱地址
103
104 public static void getLocalTextContent(String localaddress,String targetaddress){
105
106 try {
107
108 FileInputStream reader = new FileInputStream(localaddress);
109
110 FileOutputStream writer = new FileOutputStream(targetaddress);
111
112 byte[] buf = new byte[200];
113
114 int point = 0;
115
116 //String regex = "https?://\w+\.\w+\.\w+";http链接抓取
117
118 String regex = "\w+@\w+\.\w+";//邮箱地址抓取
119
120 Pattern p = Pattern.compile(regex);
121
122 while((point=reader.read(buf))>0)
123
124 {
125
126 Matcher m = p.matcher(new String(buf));
127
128 while(m.find())
129
130 {
131
132 writer.write((m.group()+"\r\n").getBytes());
133
134 }
135
136 }
137
138 } catch (FileNotFoundException e) {
139
140 // TODO Auto-generated catch block
141
142 e.printStackTrace();
143
144 } catch (IOException e) {
145
146 // TODO Auto-generated catch block
147
148 e.printStackTrace();
149
150 }
151
152 }
153
154 }

转载于:https://www.cnblogs.com/sharpest/p/10390026.html

正则表达式抓取文件内容中的http链接地址相关推荐

php正则获取li,用正则表达式抓取网页中的ul 和 li标签中最终的值！
获取你要抓取的页面 const string URL = "http://www.hn3ddf.gov.cn/price/GetList.html?pageno=1"; ...
c#使用正则表达式获取TR中的多个TD_[Python从零到壹] 四.网络爬虫之入门基础及正则表达式抓取博客案例...
首先祝大家中秋节和国庆节快乐,欢迎大家来到"Python从零到壹",在这里我将分享约200篇Python系列文章,带大家一起去学习和玩耍,看看Python这个有趣的世界.所有文章都 ...
抓取百度页面html,百度会抓取页面代码中的注释内容吗
百度会抓取页面代码中的注释内容吗内容导读:百度会抓取页面代码中的注释内容吗?答案是百度会抓取,但是在提取正文的时候会直接忽略掉,也就是说页面代码的注释内容不会影响到页面质量,所以大家尽管放心. 问题 ...
搜索引擎只能抓取html文件,通过robots屏蔽搜索引擎抓取网站内容
robots协议屏蔽搜索引擎抓取 Robots协议(也称为爬虫协议.机器人协议等)的全称是"网络爬虫排除标准"(Robots Exclusion Protocol),网站通过Rob ...
【BeautifulSoup】、【使用BeautifulSoup抓取QZZN论坛中每个帖子的标题、url及对应帖子的回复内容】
一.数据解析常用的三种方式--③BeautifulSoup 1.使用时需要使用BS转类型 response = requests.get(url,headers=self.headers) #转类型- ...
php 正则抓取页面内容_php 正则表达式抓取网页内容
php 抓取网页内容优化我想在youku网抓取高清视频的链接,然后发到手机客户端那里,可是抓取的时间不理想(大概50个视频,电脑抓取显示在网页都要6秒多,发送到手机更要30秒),想问有什么优化方法呢 ...
[Python从零到壹] 四.网络爬虫之入门基础及正则表达式抓取博客案例
欢迎大家来到"Python从零到壹",在这里我将分享约200篇Python系列文章,带大家一起去学习和玩耍,看看Python这个有趣的世界.所有文章都将结合案例.代码和作者的经验讲 ...
queryList爬虫获取内容的几种方法总结 queryList给抓取的内容增加html追加元素html 代码实例...
//简略内容: 1. $data1 = $ql->find('.two img')->map(function($item){return $item->alt; }); // 等价 ...
iOS开发——网络使用技术OC篇网络爬虫－使用正则表达式抓取网络数据
网络爬虫-使用正则表达式抓取网络数据关于网络数据抓取不仅仅在iOS开发中有,其他开发中也有,也叫网络爬虫,大致分为两种方式实现 1:正则表达 2:利用其他语言的工具包:java/Python 先来看 ...

正则表达式抓取文件内容中的http链接地址

正则表达式抓取文件内容中的http链接地址相关推荐

最新文章

热门文章