用C++解析HTTP下载下来的HTML文档

最近跟朋友一起写了一个批量网站查询工具 BlueCatTools，其中，需要用C++解析HTTP下载下来的HTML文档。

懂的人不用我多说，不懂的我也没能力说道你懂，看代码吧。

BlueCatTools 百度收录批量查询工具

//--caller.cpp--
// to run the program, you should make sure that, there is a "NIKE新浪竞技风暴_新浪网.htm" in your working directory.
// The program run time can be saved about a half if you give a better implementation of the "ofile <<" stament;
#include "HtmlParser.h"
#include <ctime>
#include <iomanip>
using namespace std;
void main()
{clock_t start = clock();map<string, link_info> LinkInfo;multimap<float, link_info, greater<float> > Sorted;string FileName = "NIKE新浪竞技风暴_新浪网.htm";HtmlParser(FileName, LinkInfo);string Result;for(map<string, link_info>::iterator miter = LinkInfo.begin(); miter != LinkInfo.end(); miter++){Sorted.insert(make_pair(miter->second.Value, miter->second));}ofstream ofile;ofile.open("a.txt");for(multimap<float, link_info, greater<float> >::iterator miter = Sorted.begin(); miter != Sorted.end(); miter++){ofile << miter->first << "\t"<<setw(50) << left << miter->second.Title << "\t"<< miter->second.Link << endl;}ofile.close();cout << clock() - start << endl;
}//--HtmlParser.h--/
#pragma once
#include <cstdio>
#include <iostream>
#include <fstream>
#include <string>
#include <map>
using namespace std;
struct link_info
{float  Value;string Link;string Title;
};
const int BUFFERSIZE = 10000;
const int LOOKUP = 100;
const int ASIZE = 300;                     //max length assumed of <a tag,
string RepairTitle(string& Title)
{string Result = "";for(string::iterator siter = Title.begin(); siter != Title.end(); siter++){unsigned char ch = *siter;if(ch == 0x0d || ch == 0x0a || ch == ' ' || ch == '\t'){if(*Result.rbegin() != '_')Result.push_back('_');}else Result.push_back(ch);}return Result;
}bool HtmlParser(const string& FileName, map<string, link_info>& LinkInfo)
{int i = 2000;FILE *fp;size_t ReadIn;char Dst[ASIZE];char buffer[BUFFERSIZE + 1];string Modified_Line;fp = fopen(FileName.c_str(), "rb");while(fp){ReadIn = fread(buffer, 1, BUFFERSIZE, fp);fseek(fp, - LOOKUP, SEEK_CUR);if(ReadIn == LOOKUP) break;buffer[ReadIn] = 0;Modified_Line.clear();char *p = buffer ;while(*p){unsigned ch = *p;if(ch >= 'A' && ch <= 'Z') Modified_Line.push_back(ch + 32);else Modified_Line.push_back(ch);p++;}string::size_type pos0;string::size_type pos1 = 0;while((pos0 = Modified_Line.find("<a", pos1)) != string::npos){string Atag, LAtag;pos1 = Modified_Line.find("</a", pos0);if(pos1 != string::npos){ if(pos1 - pos0 + 4 >= ASIZE)                                                //make sure that Atag.size() < Asizecontinue;memset(Dst, 0, ASIZE);Atag = strncpy(Dst, buffer + pos0, pos1 - pos0 + 4);  LAtag = Modified_Line.substr(pos0, pos1 - pos0 + 4);link_info tmpLink;{string::size_type pos0, pos1;pos1 = LAtag.find("</a");while(LAtag[pos1 - 1] == '>'){pos1 = LAtag.find_last_of("<", pos1 - 1);if(pos1 == 0) break;}pos0 = LAtag.find_last_of(">", pos1);string tmpstr = Atag.substr(pos0 + 1, pos1 - pos0 - 1);tmpLink.Title = RepairTitle(tmpstr);;          }{string::size_type pos0, pos1;pos0 = LAtag.find("href",0);pos0 = LAtag.find_first_not_of("=\"\' ",pos0 + 4);              // ",', ,=pos1 = LAtag.find_first_of("\"\' >", pos0 + 1);                 // ",', ,>tmpLink.Link = Atag.substr(pos0, pos1 - pos0);      }tmpLink.Value = (i--) * 0.0005;if(tmpLink.Title.size() > 3 && tmpLink.Link.size() > 3)             //filter: the filename.size() at least 3LinkInfo.insert(make_pair(tmpLink.Link, tmpLink));              //filter: the Link must be unique}}}return true;
}

用C++解析HTTP下载下来的HTML文档相关推荐

Python实现某du内容下载, 保存到word文档
前言今天来点不一样的用Python实现某du文库vip内容下载, 保存到word文档前期准备环境使用 python 3.8 pycharm 模块使用 requests >>> ...
JS使用技巧-如何解决谷歌浏览器下载图片、PDF文档时只打开不下载的问题？
问题描述: 页面下载跨域的图片.pdf文件,浏览器总是自动打开,并且在下载列表里面不显示. 如何解决谷歌浏览器下载图片.PDF文档时只打开不下载的问题?如何变成直接下载? 解决方法: 使用js获取下载 ...
python 下载道客巴巴文档
python 下载道客巴巴文档环境准备首先,我们会使用到selenium这个库,直接用pip安装即可,有关于selenium的使用还需要安装浏览器驱动和配置环境变量,在这里就不过多阐述,很多博客中 ...
Python爬虫实战下载原力创付费文档---全屏阅览式
下载原力创付费文档-全屏阅览式一.项目需求: 从目标网址下载付费文档,并保存为word形式网址点这里二.思路 1.利用selenium实现异步加载,获取图片url 2.爬取图片 3.将图片写进w ...
Python爬虫实战下载原力创付费文档---滑动式
下载原力创付费文档-滑动式一.项目需求: 从目标网址下载付费文档,并保存为PDF形式网址点这里二.思路 1.利用selenium实现异步加载,获取图片url 2.爬取图片 3.将图片写进word ...
2022年二级建造师考试-冲刺押题（历年真题+习题解析+模考试题+答案+知识点强化+文档总结），共2074份，32.2G（附件中为网盘链接）
2022年二级建造师考试-冲刺押题(历年真题+习题解析+模考试题+答案+知识点强化+文档总结),共2074份,32.2G(附件中为网盘链接). 下载地址:https://download.csdn.n ...
网站不让复制文字？？教你破解复制+白嫖下载百度等各种文档
前阵子需要写篇 xxxx 感想的文章,当然,这种一般都是学校要求写的,作为高中作文在及格边缘徘徊的"好学生",写是不可能写的了,只能拿出我的从 CV 大法,去各大网站搜索白嫖别人 ...
30个值得收藏可免费搜索/下载PDF电子图书（文档）的搜索引擎
« SooPAT专利搜索引擎为学知识搜索 » 30个值得收藏可免费搜索/下载PDF电子图书(文档)的搜索引擎实用酷站 | 评论(0) | 348 views | 一 26th, 2011 PDF全称 ...
Android官方SDK下载（含API文档）
下载Android官方SDK文档的方法: 1.昨天我按照方法二下好了一份,大家可以直接下载:http://yunpan.cn/cy7NNkgfUbfDr (提取码:6075) (如果链接失效,请提醒我 ...

用C++解析HTTP下载下来的HTML文档

BlueCatTools 百度收录批量查询工具

用C++解析HTTP下载下来的HTML文档相关推荐

最新文章

热门文章