1. 简介
- 建立搜索引擎的宏观体系
- 技术栈和项目环境
- 正排索引 and 倒排索引
2. 数据去标签与数据清洗模块 —— Parser
- 数据去标签 parser.cc
- parser.cc 的代码结构
- - EnumFile() 函数 —— 枚举筛选html文件
  - ParseHtml() 函数 —— 解析html代码结构
  - SaveHtml() 函数 —— 保存去标签后的文档
  - 测试
3. 建立索引模块 —— Index
- 获得正排索引
- 获得倒排索引
- 构建索引
- - 构建正排索引
  - 构建倒排索引
4. 搜索引擎模块 —— Searcher
- 初始化搜索对象 —— InitSearcher
- 搜索功能 —— Search
- - 安装json库与使用示例
  - Search 完整代码
- 测试
5. 服务器搭建 —— http_server 模块
- cpp-httplib 的基本使用测试
- 编写 HttpServer 模块
6. 前端模块
- HTML 网页框架
- CSS 网页个性化设计
- JavaScript 编写实现跳转
- 整体效果
7.后端优化
- 搜索去重
- 去除暂停词
添加日志
部署服务
项目扩展方向：
项目代码

1. 简介

常见的搜索引擎：baidu、google、bing，以及常见的一些带有搜索功能的app等。

我们自己单枪匹马实现一个常规的搜索引擎（全网搜索）显然是不可能的，但可以实现一个简单的搜索引擎来进行站内搜索的行为。

比如我们学习C++常用的cplusplus网站就是带有站内搜索功能,搜索的内容更垂直（范围小且相关性更强），数据量更小。

boost库是没有站内搜索的，我们可以自己做一个。

完成后的搜索引擎也将显示每个检索条目的：网页标题，网页内容摘录以及url。

建立搜索引擎的宏观体系

技术栈和项目环境

技术栈：
- 后端：C/C++, C++11,STL,Boost,Jsoncpp,cppjieba分词库，cpp-httplib开源库。
- 前端：html5,css,js,jQuery,Ajax
项目环境：Centos 7云服务器，vim/gcc(g++)/Makefile,vs2019/vs code

正排索引 and 倒排索引

正排索引：由key查询实体的过程

例如通过文档名找到相应的文档内容

文档名	文档内容
XXX公司2021年财报	2021年XXX总营收…
XXX公司2021年产品销售情况	2021年A产品销售量…

例如，用户表：

t_user(uid,name,passwd,age,gender)

由uid查询整行的过程就是正排索引。
例如，网页库：

t_web_page(url, page_content)

由url查询整个网页的过程，也是正排索引查询。

分词：实体内容分词后，会对应一个分词后的集合list。所以简易的正排索引可以理解为 Map。(关键词具有唯一性)

举个例子，假设有3个网页：

url1 -> “我爱北京”

url2 -> “我爱宏伟的天安门”

url3 -> “长城真宏伟啊”

这是一个正排索引Map

分词之后：

url1 -> {我，爱，北京}

url2 -> {我，爱，宏伟，天安门}

url3 -> {长城，宏伟}

这是一个分词后的正排索引Map

停止词：了，的，吗，啊，a，the，一般我们在分词的时候可以不考虑

倒排索引：由实体查询key的过程

例如，网页库：

由查询词快速找到包含这个查询词的网页

分词后倒排索引：

我 -> {url1,url2}

爱 -> {url1,url2}

北京 -> {url1}

宏伟 -> {url2,url3}

长城 -> {url3}

由检索词item快速找到包含这个查询词的网页 Map 就是倒排索引。

模拟一次查找的过程：

用户输入关键词：宏伟 -> 倒排索引 -> 提取出网页{url2,url3} -> 正排索引 -> 分别提取网页内容 -> 分别构建 title + content + url 响应结果 -> 呈现用户时，根据权重划分优先级

2. 数据去标签与数据清洗模块 —— Parser

数据源直接在boost官网下载

打开云服务器，建立项目文件夹，使用rz指令将之前下载的数据报添加进入云服务器中：

使用tar指令解压：

目前只需要 boost_1_79_0/doc/html目录下的html文件，来对它建立索引。

所以创建 data/input 目录，将boost库的 doc/html/*文件放在input目录下即可。

[sjl@VM-16-6-centos boost_searcher]$ cp -rf boost_1_79_0/doc/html/* data/input/

数据去标签 parser.cc

新建去标签程序

[sjl@VM-16-6-centos boost_searcher]$ touch parser.cc
//原始数据  -- > 去标签之后的数据

html文件中被 <> 括起来的就是标签，然而这对于我们执行搜索是没有价值的，需要去掉这些标签。

<td align="center"><a href="../../libs/libraries.htm">Libraries</a></td>

处理完标签的html数据将会存放在 raw_html 目录中

[sjl@VM-16-6-centos data]$ mkdir raw_html
[sjl@VM-16-6-centos data]$ ll
total 16
drwxrwxr-x 58 sjl sjl 16384 Jul 19 16:37 input      //原始html文档
drwxrwxr-x  2 sjl sjl  4096 Jul 19 20:37 raw_html   //去标签之后的html文档

可以看一下data这个文件目前包含多少个html文件：

[sjl@VM-16-6-centos data]$ ls -Rl|grep -E *.html|wc -l
8172

grep : 文本搜索指令 —E 支持正则表达式

wc : 统计文件属性 -l 统计行数

目标

把每个html都去标签，然后写入同一个文件中，注意方便读取，那么我们就把每个文件都各自放在一行里，例子如下，不同的内容以 \3 分隔，不同文件以 \n 分隔：

类似：

title\3content\3url \n title\3content\3url \n title\3content\3url \n

我们知道getline函数可以直接读取一行，直接获取一个文档的全部内容title\3content\3url\3

parser.cc 的代码结构

#include <iostream>
#include <string>
#include <vector>const std::string src_path="data/input";
const std::string output="data/raw_html/raw.txt";typedef struct DocInfo
{std::string title;    //文档标题std::string content;  //文档内容std::string url;      //该文档在官网中的url
}DocInfo_t;//const & : 输入
//* : 输出
//& : 输入输出bool EnumFile(const std::string& src_path,std::vector<std::string>* files_list);bool ParseHtml(const std::vector<std::string>& files_list,std::vector<DocInfo_t>* results);bool SaveHtml(const std::vector<DocInfo_t>& results,const std::string& output);int main()
{//文件名列表std::vector<std::string> files_list;//第一步：递归式地把每个html文件名（带路径），存放到files_list中，方便后期对html文件的读取if(!EnumFile(src_path,&files_list)){std::cerr<<"enum file name error"<<std::endl;return 1;}//第二步：读取files_list的文件名读取每个文件的内容，并解析:title + content + url std::vector<DocInfo_t> results; //files_list中所有文件 去除标签后的结果 存放于此if(!ParseHtml(files_list,&results)){std::cerr<<"parse html error"<<std::endl;return 2;}//第三步：将解析完毕的各个文件的内容,写入到 output路径 ，每个文件结束以 \3 作为每个文档的分隔符if(!SaveHtml(results,output)){std::cerr<<"Save html error"<<std::endl;return 3;}return 0;
}

EnumFile() 函数 —— 枚举筛选html文件

由于C++标准库对文件操作的支持并不完善，所以这里需要使用Boost库的filesystem模块来完成。

boost开发库的安装

[sjl@VM-16-6-centos boost_searcher]$ sudo yum install -y boost-devel

同时在parser.cc中引入头文件

#include <boost/filesystem.hpp>

代码如下

bool EnumFile(const std::string& src_path,std::vector<std::string>* files_list)
{namespace fs=boost::filesystem;fs::path root_path(src_path);//判断路径是否存在，如果不存在就不必往后走了 if(!fs::exists(root_path)){std::cerr<<src_path<<"not exists"<<std::endl;return false;}//定义空的迭代器，用来判断递归结束fs::recursive_directory_iterator end;for(fs::recursive_directory_iterator iter(root_path);iter!=end;iter++){//筛选路径下的普通文件（过滤掉目录文件），html文件都是普通文件if(!fs::is_regular_file(*iter)){continue;}//过滤掉后缀不为".html"的文件if(iter->path().extension()!=".html"){continue;}//打印测试std::cout<<"debug: "<<iter->path().string()<<std::endl; //当前的路径一定是以".html"为后缀而定普通网页文件files_list->push_back(iter->path().string());//将html文件的路径名转为字符串填入files_list中。}return true;
}

Makefile文件如下(注意链接boost库和boost文件库)：

cc=g++parser:parser.cc$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem.PHONY:clean
clean:rm -rf parser

make后查看parser的链接库

我们运行下parser可执行文件（另两个函数先默认 return true），查看输出情况：

这样html的文件就被筛选出来了，共有8171个html文件。

ParseHtml() 函数 —— 解析html代码结构

经过上面函数的筛选后，我们 files_list中存放的都是html文件的路径名了。

ParseHtml()代码的整体框架如下：

函数架构

bool ParseHtml(const std::vector<std::string>& files_list,std::vector<DocInfo_t>* results)
{for(const std::string &file: files_list){//1.读取文件 ReadFilestd::string result;if(!ns_tool::FileTool::ReadFile(file,&result)){continue;}DocInfo_t doc;//2.解析文件，提取titleif(!ParseTitle(result,&doc.title)){continue;}//3.解析文件，提取content,就是去标签if(!ParseContent(result,&doc.content)){continue;}//4.解析指定的文件路径，构建官网urlif(!ParseUrl(file,&doc.url)){continue;}//done 一定是完成了解析任务，当前文档的相关结果都保存在了结构体doc中//将这些结构体存入results中results->push_back(std::move(doc));//bug：todo细节，本质会发生拷贝，效率会比较低}return true;
}

解释：

该函数主要完成4件事：根据路径名依次读取文件内容，提取title，提取content，构建url。

读取文件

遍历files_list中存储的文件名，从中读取文件内容到 result 中，由函数 ReadFile() 完成该功能。

该函数定义于头文件 tool.hpp的类 FileTool中。

//tool.hpp
#pragma once
#include <iostream>
#include <string>
#include <fstream>
namespace ns_tool
{class FileTool{public://输入文件名，将文件内容读取到out中static bool ReadFile(const std::string& file_path,std::string *out){std::ifstream in(file_path,std::ifstream::in);//文件打开失败检查if(!in.is_open()){std::cerr<<"open file: "<<file_path<<std::endl;return false;}//读取文件std::string line;while(getline(in,line)){*out+=line; }//while(bool),getline的返回值istream会重载操作符bool，读到文件尾eofset被设置并返回falsein.close();return true;}};
}

提取title —— ParseTitle()

随意打开一个html文件，可以看到我们要提取的title部分是被title标签包围起来的部分。如下所示：

这里需要依赖函数 —— bool ParseTitle(const std::string& result,&doc.title)，来帮助完成这一工作，函数就定义在parse.cc中。

//解析title
static bool ParseTitle(const std::string& result,std::string* title)
{std::size_t begin=result.find("<title>");if(begin==std::string::npos){return false;}std::size_t end=result.find("/title");if(end==std::string::npos){return false;}begin+=std::string("<title>").size();if(begin>end){return false;}*title = result.substr(begin,end-begin);return true;
}

提取content，实际上是去除标签 —— ParseContent()

即把所有尖括号及尖括号包含的部分全部去除

在遍历的时候，只要碰到了 > ,就意味着，当前的标签被处理完毕. 只要碰到了 < 意味着新的标签开始了。

这里需要依赖函数 —— bool ParseContent(const std::string& result,&doc.content)，来帮助完成这一工作，函数就定义在parse.cc中。

//去标签
static bool ParseContent(const std::string& result,std::string* content)
{//基于一个简易的状态机enum status {LABLE,CONTENT};enum status s;for(char c:result){switch(s){case LABLE:if(c=='>')s=CONTENT;break;case CONTENT:if(c=='<') s=LABLE;else {//不保留 '/n'if(c=='\n') c=' ';content->push_back(c);}break;default:break;}}return true;
}

构建官网url

boost库在网页上的url，和我们下载的文档的路径是有对应关系的：

举个例子：

当我们进入官网中查询 Accumulators,其官网url为：

https://www.boost.org/doc/libs/1_79_0/doc/html/accumulators.html

如果我们在下载的文档中查询该网页文件，那么其路径为：

而我们项目中的所有数据源都拷贝到了 data/input目录下，那么在我们项目中寻找该网页文件的路径为：

data/input/accumulators.html

于是我们可以将url拼接：

url_head = https://www.boost.org/doc/libs/1_79_0/doc/html

url_tail = data/input/accumulators.html

url=url_head + url_tail //相当于形成了一个官网链接

这里需要依赖函数 —— bool ParseUrl(const std::string& file_path,std:string* url)，来帮助完成这一工作，函数就定义在parse.cc中。

//构建官网url :url_head + url_tail
static bool ParseUrl(const std::string& file_path,std:string* url)
{std::string url_head="https://www.boost.org/doc/libs/1_79_0/doc/html";std::string url_tail=file_path.substr(src_path.size());*url=url_head+url_tail;return true;
}

SaveHtml() 函数 —— 保存去标签后的文档

bool SaveHtml(const std::vector<DocInfo_t>& results,const std::string& output)
{#define SEP '\3'std::ofstream out(output,std::ios::out|std::ios::binary);if(!out.is_open()){std::cerr<<"open "<<out<<" error"<<std::endl;return false;}//文档写入磁盘for(auto& item:results){std::string out_string;out_string = item.title;out_string += SEP;out_string += item.content;out_string += SEP;out_string += item.url;out_string += '\n';out.write(out_string.c_str(),out_string.size());}out.close();return true;
}

测试

我们编译下 parser.cc，得到parser可执行文件，随后make。如果成功，那么此时 /data/raw_html目录下的 raw.txt 就会填入所有的处理完的html文档。

[sjl@VM-16-6-centos boost_searcher]$ make
g++ -o parser parser.cc -std=c++11 -lboost_system -lboost_filesystem
[sjl@VM-16-6-centos boost_searcher]$ ll
total 136
drwxr-xr-x 8 sjl sjl   4096 Apr  7 05:33 boost_1_79_0
drwxrwxr-x 4 sjl sjl   4096 Jul 19 20:37 data
-rw-rw-r-- 1 sjl sjl    124 Jul 20 20:03 Makefile
-rwxrwxr-x 1 sjl sjl 112408 Jul 22 12:36 parser
-rw-rw-r-- 1 sjl sjl   6088 Jul 22 12:31 parser.cc
-rw-rw-r-- 1 sjl sjl    889 Jul 21 21:27 tool.hpp
[sjl@VM-16-6-centos boost_searcher]$ cat data/raw_html/raw.txt | wc -l
8171

每个html文档占据一行，显然行数与处理之前的html文件数是匹配的。

'\3’ascii对应的控制字符就是 ^C

3. 建立索引模块 —— Index

[sjl@VM-16-6-centos boost_searcher]$ touch index.hpp

该头文件主要负责三件事：1.构建索引 2.正排索引 3.倒排索引

构建思路框图：

#pragma once
#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
#include <fstream>
#include "tool.hpp"namespace ns_index
{struct DocInfo{std::string title ;   //文档标题std::string content;  //文档去标签内容std::string url;      //文档对应的官网urluint64_t doc_id;      //文档ID};//倒排索引结构体struct InvertedElem{uint64_t doc_id;   // 文档IDstd::string word; // 文档相关关键字int weight;        // 文档权重};//倒排拉链typedef std::vector<InvertedElem> InvertedList;class Index{private://正排索引的数据结构使用数组，下标将对应文档IDstd::vector<DocInfo> forward_index; //正排索引：通过文档ID找到文档内容//倒排索引：一个关键词和一组 InvertedElem 对应(关键字和倒排拉链的映射关系)std::unordered_map< std::string , InvertedList > inverted_index;private://Index作为单例模式Index(){}Index(const Index& )=delete;Index& operator=(const Index& )=delete;static Index* instance;static std::mutex mtx;public://创建单例static Index* Getinstance(){if(nullptr==instance){//instance为临界资源，需为互斥量mtx.lock();if(nullptr==instance){instance=new Index();}mtx.unlock();}return instance;}~Index(){}public://获得正排索引：根据文档的 doc_id 获得文档内容DocInfo* GetForwardIndex(uint64_t doc_id) {return nullptr;}//获得倒排索引：根据关键字word，获得倒排拉链InvertedList* GetInvertedList(const std::string& word){return nullptr;}//构建索引//Parse处理后的文档，用来构建正排与倒排索引//Parse处理后的文档路径存于路径：data/raw_html/raw.txtbool BuildIndex(const std::string& parsed_path){return true;}};}

有了基本思路后我们就可以开始编写函数了

获得正排索引

在 forward_list已经建立好的前提下，获得正排索引的函数并不难写。

//根据文档的 doc_id 获得文档内容
DocInfo* GetForwardIndex(uint64_t doc_id)
{if(doc_id>=forward_index.size()){std::cerr<<"doc_id out of range!"<<std::endl;return nullptr;}return &forward_index[doc_id];
}

获得倒排索引

//根据关键字word，获得倒排拉链
InvertedList* GetInvertedList(const std::string& word)
{std::unordered_map<std::string,InvertedList>::iterator iter=inverted_index.find(word);if(iter==inverted_index.end()){//没有索引结果std::cerr<<word<<"has no InvertedList"<<std::endl;return nullptr;}return &(iter->second);
}

构建索引

显然这部分的难点就是如何构建索引，而构建索引的思路正好和用户使用搜索功能的过程正好相反。

思路：一个一个文档遍历，为其每个构建先正排索引后构建倒排索引。

代码如下：

//Parse处理后的文档，构建正排与倒排索引
//Parse处理后的文档路径存于路径：data/raw_html/raw.txt
bool BuildIndex(const std::string& parsed_path)
{//读取Parse路径的文件std::ifstream in(parsed_path,std::ios::in|std::ios::binary);if(!in.is_open()){std::cerr<<parsed_path<<" open failed"<<std::endl;return false;}std::string line;int count=0;//统计已构成索引的条目数while(std::getline(in,line)){ //构建正排索引：把Parse后的文档读入到正排索引中DocInfo* doc=BuildForwardIndex(line);if(nullptr==doc){std::cerr<<"bulid "<<line<<" error"<<std::endl;//for debugcontinue;}//构建倒排索引：BuildInvertedIndex(*doc);//实时打印已完成构建的索引条目数：进度条count++;printf("已构建索引%d条: %d%%\r",count,count*100/8171);//8171为已解析文件数fflush(stdout);}

构建正排索引

private:DocInfo* BuildForwardIndex(const std::string& line){//1.解析line，字符串切分//line -> title+content+url std::vector<std::string> results;const std::string sep="\3";ns_tool::StringTool::CutString(line,&results,sep);if(results.size()!=3){return nullptr;}//2.切分后填入DocInfoDocInfo doc;doc.title=results[0];doc.content=results[1];doc.url=results[2];doc.doc_id=forward_index.size();//3.DocInfo再插入到正排索引的forward_indexforward_index.push_back(std::move(doc));return &forward_index.back();}

其中 CutString函数定义在tool.hpp中

借用boost库的split函数可以方便我们切分字符串，在此之前我们把title/content/url使用 \3进行了划分。

//tool.hpp
#pragma once
#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <boost/algorithm/string.hpp>
#include "cppjieba/Jieba.hpp"
namespace ns_tool
{//...class StringTool{public:static void CutString(const std::string& src,std::vector<std::string>* dst,const std::string& sep ){//boost splitboost::split(*dst,src,boost::is_any_of(sep),boost::token_compress_on);//token_compress_on 为压缩划分——分隔符的连续出现会视为仅一个分隔符}};
}

构建倒排索引

构建倒排索引是构建索引的难点

原理：

拿到了DocInfo

struct DocInfo
{std::string title ;   //文档标题std::string content;  //文档去标签内容std::string url;      //文档对应的官网urluint64_t doc_id;      //文档ID
};

例如：

title: 吃葡萄
content：吃葡萄不吐葡萄皮
url：http://xxxx
doc_id：123

根据DocInfo涵盖的文档内容形成一个InvertedElem或者多个InvertedElem，

//倒排索引结构体
struct InvertedElem
{uint64_t doc_id;   // 文档IDstd::string word; // 文档相关关键字int weight;        // 文档权重
};//倒排拉链
typedef std::vector<InvertedElem> InvertedList;

由于当前我们是一个一个文档进行处理，一个文档会包含多个词，所以都对应到当前的doc_id .

2.1 首先是对 title && content 分词—— 使用 jieba分词（第三方库）

title: 吃/葡萄/吃葡萄（title_word）

content：吃/葡萄/不吐/葡萄皮（ content_word ）

2.2 词频统计

词和文档的相关性（词频越高或者在标题中出现的词，可以认为相关性高）

伪代码：

//文档分词后统计每个词对应在title和content中出现的频率
struct word_cnt
{title_cnt;content_cnt;
};//每个词 与对应的 词频统计 放在map容器中
unordered_map<std::string , word_cnt> word_stat;//遍历title_word数组，统计每个词在title中的词频
for(auto& word:title_word)
{word_stat[word].title_cnt++;//吃（1）/葡萄 （1）//吃葡萄（1）
}//遍历content_word数组，统计每个词在content的词频
for(auto& word:content_word)
{word_stat[word].content_cnt++;//吃（1）/葡萄（1）/不吐（1）/葡萄皮（1）
}

至此知道了文档中，title和content中的每个词的词频

2.3 自定义相关性

伪代码

for(auto& word:word_stat)
{//具体一个词（word）和文档（ID：123）的对应关系struct InvertedElem elem;elem.doc_id=123;elem.word=word.first;  //当一个词指向多个文档ID时，优先显示谁将由相关性决定elem.weight=10*word.second.title_cnt + word.second.content_cnt ;//相关性，或者说权重的配比是一个很难的课题，这里只做简化处理//为该词建立倒排拉链——一词可对应多个文档inverted_index[word.first].push_back(std::move(elem));
}

jieba分词的使用 —— cppjieba

下载cppjieba库

获取链接：

git clone https://github.com/yanyiwu/cppjieba

下载完cppjieba后，还有一个细节，手动把 cppjieba/deps/limonp/ 的文件拷贝到 cpp/jieba/include/cppjieba/ 目录下，否则会编译报错

我们可以试一下这个第三方库，主要使用 CutForSearch()函数

[sjl@VM-16-6-centos test]$ ll
total 372
-rwxrwxr-x 1 sjl sjl 366424 Jul 23 20:02 a.out
drwxrwxr-x 8 sjl sjl   4096 Jul 23 16:11 cppjieba
-rw-rw-r-- 1 sjl sjl    857 Jul 23 20:07 demo.cpp
lrwxrwxrwx 1 sjl sjl     14 Jul 23 16:23 dict -> cppjieba/dict/
lrwxrwxrwx 1 sjl sjl     17 Jul 23 16:26 inc -> cppjieba/include/
-rw-rw-r-- 1 sjl sjl    424 Jul 23 00:34 test.cc
[sjl@VM-16-6-centos test]$ cat demo.cpp
#include "inc/cppjieba/Jieba.hpp"
#include <iostream>
#include <vector>
#include <string>using namespace std;const char* const DICT_PATH = "./dict/jieba.dict.utf8";
const char* const HMM_PATH = "./dict/hmm_model.utf8";
const char* const USER_DICT_PATH = "./dict/user.dict.utf8";
const char* const IDF_PATH = "./dict/idf.utf8";
const char* const STOP_WORD_PATH = "./dict/stop_words.utf8";int main(int argc, char** argv)
{cppjieba::Jieba jieba(DICT_PATH,HMM_PATH,USER_DICT_PATH,IDF_PATH,STOP_WORD_PATH);vector<string> words;string s;s = "小明硕士毕业于中国科学院计算所，后在日本京都大学深造";cout << s << endl;cout << "[demo] CutForSearch" << endl;jieba.CutForSearch(s, words);cout << limonp::Join(words.begin(), words.end(), "/") << endl;return EXIT_SUCCESS;
}
[sjl@VM-16-6-centos test]$ ./a.out
小明硕士毕业于中国科学院计算所，后在日本京都大学深造
[demo] CutForSearch
小明/硕士/毕业/于/中国/科学/学院/科学院/中国科学院/计算/计算所/，/后/在/日本/京都/大学/日本京都大学/深造

可以看到词语得以很好的划分。

下面引入jieba库来编写倒排索引的代码

将 cppjieba 库存放在根目录的第三方目录 thirdpart 下，然后将库的头文件和词库在本项目目录中创建软连接：

[sjl@VM-16-6-centos boost_searcher]$ ll
total 148
drwxr-xr-x 8 sjl sjl   4096 Apr  7 05:33 boost_1_79_0
drwxrwxr-x 4 sjl sjl   4096 Jul 19 20:37 data
-rw-rw-r-- 1 sjl sjl   4399 Jul 23 00:44 index.hpp
-rw-rw-r-- 1 sjl sjl    124 Jul 20 20:03 Makefile
-rwxrwxr-x 1 sjl sjl 112408 Jul 22 12:36 parser
-rw-rw-r-- 1 sjl sjl   6088 Jul 22 12:31 parser.cc
drwxrwxr-x 3 sjl sjl   4096 Jul 23 20:02 test
-rw-rw-r-- 1 sjl sjl   1244 Jul 23 00:44 tool.hpp
[sjl@VM-16-6-centos boost_searcher]$ ln -s ~/thirdpart/cppjieba/include/cppjieba/ cppjieba
[sjl@VM-16-6-centos boost_searcher]$ ln -s ~/thirdpart/cppjieba/dict/ dict
[sjl@VM-16-6-centos boost_searcher]$ ll
total 148
drwxr-xr-x 8 sjl sjl   4096 Apr  7 05:33 boost_1_79_0
lrwxrwxrwx 1 sjl sjl     46 Jul 23 20:46 cppjieba -> /home/sjl/thirdpart/cppjieba/include/cppjieba/
drwxrwxr-x 4 sjl sjl   4096 Jul 19 20:37 data
lrwxrwxrwx 1 sjl sjl     34 Jul 23 20:47 dict -> /home/sjl/thirdpart/cppjieba/dict/
-rw-rw-r-- 1 sjl sjl   4399 Jul 23 00:44 index.hpp
-rw-rw-r-- 1 sjl sjl    124 Jul 20 20:03 Makefile
-rwxrwxr-x 1 sjl sjl 112408 Jul 22 12:36 parser
-rw-rw-r-- 1 sjl sjl   6088 Jul 22 12:31 parser.cc
drwxrwxr-x 3 sjl sjl   4096 Jul 23 20:02 test
-rw-rw-r-- 1 sjl sjl   1244 Jul 23 00:44 tool.hpp
[sjl@VM-16-6-centos boost_searcher]$ ls cppjieba/
DictTrie.hpp     HMMModel.hpp    Jieba.hpp             limonp          MPSegment.hpp  PreFilter.hpp     SegmentBase.hpp    TextRankExtractor.hpp  Unicode.hpp
FullSegment.hpp  HMMSegment.hpp  KeywordExtractor.hpp  MixSegment.hpp  PosTagger.hpp  QuerySegment.hpp  SegmentTagged.hpp  Trie.hpp
[sjl@VM-16-6-centos boost_searcher]$ ls dict/
hmm_model.utf8  idf.utf8  jieba.dict.utf8  pos_dict  README.md  stop_words.utf8  user.dict.utf8

我们把分词的代码作为一种常用工具放在头文件 tool.hpp中,于是分词的函数代码如下

//tool.hpp
#pragma once
#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <boost/algorithm/string.hpp>
#include "cppjieba/Jieba.hpp"
namespace ns_tool
{//...//分词工具const char* const DICT_PATH = "./dict/jieba.dict.utf8";const char* const HMM_PATH = "./dict/hmm_model.utf8";const char* const USER_DICT_PATH = "./dict/user.dict.utf8";const char* const IDF_PATH = "./dict/idf.utf8";const char* const STOP_WORD_PATH = "./dict/stop_words.utf8";class JiebaTool{private:static cppjieba::Jieba jieba;public:static void SplitToWord(const std::string &src,std::vector<std::string>* out){//使用jieba库函数对src分词，并存于out中jieba.CutForSearch(src,*out);}};cppjieba::Jieba JiebaTool::jieba(DICT_PATH,HMM_PATH,USER_DICT_PATH,IDF_PATH,STOP_WORD_PATH);
}

于是整个构建倒排索引的代码如下：

private:bool BuildInvertedIndex(const DocInfo &doc){//构建完的正排，此时DocInfo[title,content,url,doc_id]// word-> 倒排拉链//每个词在文档中的词频统计 struct word_cnt{int title_cnt;int content_cnt;word_cnt():title_cnt(0),content_cnt(0){}};std::unordered_map<std::string , word_cnt> word_stat;//用来暂存关键词与词频的映射表//标题分词std::vector<std::string> title_word;ns_tool::JiebaTool::SplitToWord(doc.title,&title_word);//标题词频统计for(auto s:title_word){//将标题关键字全部转为小写统一计算词频（使用拷贝，不影响原来的关键字）boost::to_lower(s);word_stat[s].title_cnt++;}//内容分词std::vector<std::string> content_word;ns_tool::JiebaTool::SplitToWord(doc.content,&content_word);//内容词频统计for(auto s:content_word){//将内容关键字全部转为小写统一计算词频（使用拷贝，不影响原来的关键字）boost::to_lower(s);word_stat[s].content_cnt++;}#define X 10
#define Y 1//建立该doc所有关键字对应的倒排拉链for(auto&word_pair:word_stat){InvertedElem elem;elem.doc_id=doc.doc_id;elem.word=word_pair.first;//自定义相关性elem.weight=word_pair.second.title_cnt*X+word_pair.second.content_cnt*Y;//将这个关键字构成的倒排索引元素push到倒排索引表的倒排拉链中//（注意这里的关键字全部转为小写计算了词频）,所以搜索时，需将用户输入的关键字先转为全小写InvertedList &inverted_list=inverted_index[word_pair.first];inverted_list.push_back(std::move(elem));}return true;}

4. 搜索引擎模块 —— Searcher

基本思路

//searcher.hpp
#include "index.hpp"namespace ns_searcher
{class Searcher{private:ns_index::Index *index;public:void InitSearcher(const std::string &input){//1.创建index对象（单例）//2.根据index对象建立索引}//搜索功能//json_string 返回给用户浏览器的搜索结果void Search(const std::string& query,std::string* json_string){//1.[分词]:对搜索关键字query在服务端也要分词，然后查找index//2.[触发]:根据分词的各个词进行index查找//3.[合并排序]:汇总查找结果，按照相关性（权重weight）降序排序//4.[构建]:将排好序的结果，生成json串 —— jsoncpp}};
}

初始化搜索对象 —— InitSearcher

该函数负责两件事，构造索引对象并构建索引

Index为单例模式，调用函数GetInstance生成对象：

调用函数BuildIndex构建索引

void InitSearcher(const std::string &input)
{//1.创建index对象（单例）index=ns_index::Index::Getinstance();std::cout<<"创建index单例完成..."<<std::endl;//2.根据index对象建立索引(将已去除标签处理好的文件路径传入)index->BuildIndex(input);std::cout<<"构建索引完成..."<<std::endl;
}

搜索功能 —— Search

[分词]

继续使用结巴分词工具定义的函数 SplitToWord来对用户输入的索引词进行分词
[触发]

调用 获取倒排索引函数GetInvertedList()获得所有关键词的倒排拉链
[合并排序]

汇总倒排拉链中的所有倒排元素（文档ID相同的去重），按照权重降序排序
[构建]
由倒排元素正排索引得到正文文档，将正文中的content进行摘录。合并所有文档后，使用json库生成序列化字符串，便于后续网络传输。

摘录content的多少部分是我们自己定的规则：找到关键字在content中首次出现的位置pos，然后截取 —— 往前找50个字节（如没有50个，则从begin开始），往后找100个字节(如没有，则截取到end)的内容

安装json库与使用示例

sudo yum install -y jsoncpp-devel

使用json

#include <iostream>
#include <string>
#include <jsoncpp/json/json.h>//Value Reader(反序列化) Writer(序列化)
int main()
{Json::Value root;Json::Value item1;item1["key1"]="value11";item1["key2"]="value12";Json::Value item2;item2["key1"]="value21";item2["key2"]="value22";root.append(item1);root.append(item2);Json::StyledWriter writer;//Json::FastWriter writer;std::string s=writer.write(root);std::cout<<s<<std::endl;return 0;
}

Search 完整代码

public://搜索功能//json_string 返回给用户浏览器的搜索结果void Search(const std::string& query,std::string* json_string){//1.[分词]:对搜索关键字query在服务端也要分词，然后查找indexstd::vector<std::string> words;ns_tool::JiebaTool::SplitToWord(query,&words);//2.[触发]:就是根据分词的各个词进行index查找,忽略大小写，所以关键字需要转换为小写ns_index::InvertedList inverted_list_all;for(std::string word:words){boost::to_lower(word);//获取倒排拉链ns_index::InvertedList *inverted_list=index->GetInvertedList(word);//如果倒排拉链不存在则continueif(nullptr==inverted_list){continue;}//将关键字的倒排拉链的倒排元素汇总//不完美的地方，如果多个关键字出现在一个文档中，那么许多倒排元素中的文档ID其实是会重复的inverted_list_all.insert(inverted_list_all.end(),inverted_list->begin(),inverted_list->end());}//3.[合并排序]:汇总查找结果，按照相关性（权重weight）进行降序排序std::sort(inverted_list_all.begin(),inverted_list_all.end(),[](const ns_index::InvertedElem e1,const ns_index::InvertedElem& e2)->bool{\return e1.weight>e2.weight;\});//4.[构建]:根据查找出的结果，生成json串 —— jsoncpp 完成序列化和反序列化Json::Value root;for(auto& item:inverted_list_all){//正排索引获取文档内容ns_index::DocInfo* doc=index->GetForwardIndex(item.doc_id);if(nullptr==doc){continue;}Json::Value elem;elem["title"]=doc->title;//content是文档去标签的结果，但是内容太多需要提取出摘要GetAbstractelem["abstract"]=GetAbstract(doc->content,item.word);elem["url"]=doc->url;//for debug 查看是否以权重降序排序elem["doc_id"]=(int)item.doc_id;elem["weight"]=item.weight;root.append(elem);}Json::StyledWriter writer;*json_string=writer.write(root);}

提取摘要

public:std::string GetAbstract(const std::string& html_content,const std::string& word){//找到word在html_content中首次出现的位置，//然后截取：往前找50个字节（如没有50个，则从begin开始），往后找100个字节(如没有截取到end)的内容const int prev_step=50;const int post_step=100;//1.找到首次出现位置pos 使用std::search 函数 忽视大小写搜索auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),[](int a,int b){\return (std::tolower(a)==std::tolower(b));});if(iter==html_content.end()){return "Not Found";}int pos=std::distance(html_content.begin(),iter);//2.获取start的位置和last的位置int start=0;int last=html_content.size()-1;//如果之前有50+个字节，更新startif(pos>start+prev_step){start=pos-prev_step;}//如果之后有100+个字节，更新last if(pos+post_step<last){last=pos+post_step;}//3.截取子串返回if(start>=last) return "None"; return html_content.substr(start,last-start);}

测试

在完成网络传输模块之前，我们可以在本地进行测试,搜索关键词时是否能搜到想得到的结果：

//debug.cc
#include "searcher.hpp"
#include <iostream>
#include <cstdio>
#include <string>
#include <cstring>
const std::string input="data/raw_html/raw.txt";
int main()
{//for testns_searcher::Searcher *search=new ns_searcher::Searcher;search->InitSearcher(input);std::string query;char buffer[1024];while(true){std::cout<<"please enter the query"<<std::endl;fgets(buffer,sizeof(buffer)-1,stdin);buffer[strlen(buffer)-1]=0;//去除回车query=buffer;std::string ans;search->Search(query,&ans);std::cout<<ans<<std::endl;}return 0;
}

5. 服务器搭建 —— http_server 模块

cpp-httplib库：https://gitee.com/sumert/cpp-httplib/tree/v0.7.15

（如果链接失效，直接在gitee搜索 cpp-httplib即可）

注意事项：cpp-httplib 在使用的时候需使用较新的gcc，否则会编译出错。

我们使用的云服务的gcc版本默认为 gcc 4.8.5

[sjl@VM-16-6-centos ~]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

所以需要我们升级一下gcc：

CentOS 7上升级/安装gcc

//安装scl
[sjl@VM-16-6-centos ~]$ sudo yum install centos-release-scl scl-utils-build//安装新版本gcc
[sjl@VM-16-6-centos ~]$ sudo yum install -y devtoolset-7-gcc devtoolset-7-gccc++//查看工具集
[sjl@VM-16-6-centos ~]$ ls /opt/rh
devtoolset-7

因为不会覆盖系统默认的gcc,需要手动启动

命令行启动仅在本次会话有效。

[sjl@VM-16-6-centos ~]$ scl enable devtoolset-7 bash
[sjl@VM-16-6-centos ~]$ gcc -v

若想永久有效，则需要启动时自动执行指令,在文件 ~/.bash_profile中添加语句

scl enable devtoolset-7 bash

[sjl@VM-16-6-centos ~]$ vim ~/.bash_profile
[sjl@VM-16-6-centos ~]$ cat ~/.bash_profile
# .bash_profile# Get the aliases and functions
if [ -f ~/.bashrc ]; then. ~/.bashrc
fi# User specific environment and startup programsPATH=$PATH:$HOME/.local/bin:$HOME/binexport PATH#每次启动的时候，都会执行这个scl命令
scl enable devtoolset-7 bash

安装 cpp-httplib

如果gcc不是特别新，可能会有运行时错误的问题。

所以建议使用：cpp-httplib 0.7.15

点击链接下载，

将压缩包放置 thirdpart文件夹中并解压（unzip）：

[sjl@VM-16-6-centos thirdpart]$ ll
total 8
drwxrwxr-x 6 sjl sjl 4096 Jul 28 15:50 cpp-httplib-v0.7.15
drwxrwxr-x 8 sjl sjl 4096 Jul 23 20:45 cppjieba
[sjl@VM-16-6-centos thirdpart]$

在项目文件夹中建立软连接：

[sjl@VM-16-6-centos boost_searcher]$ ln -s ~/thirdpart/cpp-httplib-v0.7.15/ cpp-httplib
[sjl@VM-16-6-centos boost_searcher]$ ll
total 1532
drwxr-xr-x 8 sjl sjl   4096 Apr  7 05:33 boost_1_79_0
lrwxrwxrwx 1 sjl sjl     40 Jul 28 15:54 cpp-httplib -> /home/sjl/thirdpart/cpp-httplib-v0.7.15/
lrwxrwxrwx 1 sjl sjl     46 Jul 23 20:46 cppjieba -> /home/sjl/thirdpart/cppjieba/include/cppjieba/
drwxrwxr-x 4 sjl sjl   4096 Jul 19 20:37 data
-rwxrwxr-x 1 sjl sjl 608144 Jul 28 12:44 debug
-rw-rw-r-- 1 sjl sjl    640 Jul 28 01:05 debug.cc
lrwxrwxrwx 1 sjl sjl     34 Jul 23 20:47 dict -> /home/sjl/thirdpart/cppjieba/dict/
-rwxrwxr-x 1 sjl sjl 409408 Jul 28 12:44 http_server
-rw-rw-r-- 1 sjl sjl     58 Jul 28 12:44 http_server.cc
-rw-rw-r-- 1 sjl sjl   7489 Jul 27 16:08 index.hpp
-rw-rw-r-- 1 sjl sjl    360 Jul 28 12:44 Makefile
-rwxrwxr-x 1 sjl sjl 492840 Jul 28 12:44 parser
-rw-rw-r-- 1 sjl sjl   6088 Jul 22 12:31 parser.cc
-rw-rw-r-- 1 sjl sjl   4654 Jul 28 00:17 searcher.hpp
drwxrwxr-x 3 sjl sjl   4096 Jul 28 15:47 test
-rw-rw-r-- 1 sjl sjl   2047 Jul 27 00:43 tool.hpp
[sjl@VM-16-6-centos boost_searcher]$

新建网页根目录（后续将包含首页及一系列资源）,在WWWROOT的目录下写一个html文件

[sjl@VM-16-6-centos boost_searcher]$ mkdir WWWROOT
[sjl@VM-16-6-centos WWWROOT]$ touch index.html

cpp-httplib 的基本使用测试

//http_server.cc
#include "searcher.hpp"
#include "cpp-httplib/httplib.h"const std::string root_path="./WWWROOT";
int main()
{httplib::Server svr;//设置首页svr.set_base_dir(root_path.c_str());svr.Get("/hi",[](const httplib::Request &req,httplib::Response &rsp){rsp.set_content("gogogogogo","text/plain; charset=utf-8");});svr.listen("0.0.0.0",8081);return 0;
}

<!-- index.html --><!DOCTYPE html>
<html><head><meta charset="UTF-8"><title> for test </title></head><body><h1>Hello World!</h1><p>这是一个httplib测试</p></body>
</html>

编译运行：

[sjl@VM-16-6-centos boost_searcher]$ g++ -o http_server httpserver.cc -std=c++11 -ljsoncpp -lpthread
[sjl@VM-16-6-centos boost_searcher]$ ./http_server

编写 HttpServer 模块

#include "searcher.hpp"
#include "cpp-httplib/httplib.h"const std::string root_path="./WWWROOT";
const std::string input="data/raw_html/raw.txt";int main()
{//创建搜索器并初始化ns_searcher::Searcher search;search.InitSearcher(input);httplib::Server svr;//设置首页 svr.set_base_dir(root_path.c_str());svr.Get("/s",[&search](const httplib::Request &req,httplib::Response &rsp){if(!req.has_param("word"))//请求中若没有参数{rsp.set_content("请输入搜索词！","text/plain; charset=utf-8");//返回Content—Type为文本return;}std::string word=req.get_param_value("word");std::cout<<"用户搜索词: "<<word<<std::endl;//执行搜索服务std::string json_string;search.Search(word,&json_string);rsp.set_content(json_string,"application/json");});svr.listen("0.0.0.0",8081);return 0;
}

OK，至此后端大抵完成，后面来完成前端工作。

6. 前端模块

HTML 网页框架

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>BOOST搜索引擎</title>
</head>
<body><div class="container"><div class="search"><input type="text" value="输入搜索关键字"><button>Search</button></div><div class="result"><div class="item"><a href="#">这是标题</a><p>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要</p><i>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html</i></div><div class="item"><a href="#">这是标题</a><p>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要</p><i>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html</i></div><div class="item"><a href="#">这是标题</a><p>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要</p><i>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html</i></div><div class="item"><a href="#">这是标题</a><p>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要</p><i>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html</i></div><div class="item"><a href="#">这是标题</a><p>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要</p><i>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html</i></div><div class="item"><a href="#">这是标题</a><p>这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要这是摘要</p><i>https://www.boost.org/doc/libs/1_79_0/doc/html/boost/algorithm/make_split_iterator.html</i></div></div></div>
</body>
</html>

CSS 网页个性化设计

设置样式的本质是找到标签设置属性(直接在html代码中的title之后进行编辑)

选择特定标签：类选择器，标签选择，复合选择
设置指定标签的属性

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>BOOST搜索引擎</title>/* css设计 */<style>/* 去掉网页所有默认的内外边距，html的盒子模型 */*{/* 设置外边距 */margin:0;/* 设置内边距 */padding:0;}/* 将body内容和网页呈现吻合 */html,body{height:100%;}/* 以.开头为类选择器 */.container{/* 设置div的宽度 */width:800px;/* 设置外边距达到居中对齐 */margin:0px auto;/* 设置上边距：距离网页顶端15px */margin-top: 15px; }/* 复合选择器，选中类container下的类search */.container .search{/* 宽度与父标签保持一致 */width:100%;/* 高度设置为52px */height:52px;}/* 选中input标签 ,直接设置标签属性，先要选中，input：标签选择器*//* input在进行高度宽度设置时不包含边框的厚度，所以厚度参数也需要考虑在内 */.container .search input{/* 设置left浮动 */float:left;width:600px;height:50px; /* 设置边框属性：宽度，样式，颜色 */border: 1px solid black;/* 去除输入框的右边线 */border-right: none;/* 设置左侧内边距，不让搜索词紧挨边框 */padding-left: 10px;/* 设置input字体 */color:#ccc;font-size: 20px;font-family:'Times New Roman', Times, serif;}/* 选中button标签 ,直接设置标签属性，先要选中，button：标签选择器*/.container .search button{/* 设置left浮动 */float:left;width:150px;height:52px;/* 设置button的背景颜色 */background-color: #4e6ef2;/* 设置butter字体颜色 */color: #fff;font-size: 20px;font-family: "幼圆";}.container .result{width: 100%;}.container .result .item{/* 设置外边框上边距 */margin-top: 15px;   }.container .result .item a{/* 设置为块状元素，单独占据一行 */display:block;/* 去除a标签的下划线 */text-decoration: none;/* 设置a标签文字大小 */font-size: 20px;/* 设置标签颜色 */color:#2440b3;}/* 光标移至标题处出现下划线 */.container .result .item a:hover{text-decoration:underline;}.container .result .item p{margin-top: 5px;font-size: 16px;font-family:Arial, Helvetica, sans-serif;margin-bottom: 5px;}.container .result .item i{/* 设置为块状元素，单独占据一行 */display:block;/* 取消斜体风格 */font-style:normal;/* 设置颜色 */color: green;}</style>
<head>
/* ... */

JavaScript 编写实现跳转

使用原生JS成本较高（xmlhttprequest），这里使用JQuery。

在html中添加外部链接,获取JQuery库

<script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>

在html文件中插入代码：

<!-- ... --></div><script>  function Search(){// 是浏览器的一个弹出框// alert("hello js!");//1.提取数据 $可以理解为JQuery的别称let query = $(".container .search input").val();console.log("query = " + query);//console是浏览器的对话框，查看js的数据//2.发起http请求（把关键字上传给服务器）,JQuery中的ajax：一个与服务器进行数据交互的函数$.ajax({type:"GET",url:"/s?word="+query,//如果请求成功，打印出服务器返回的data（此时服务器一直在后台运行）success:function(data){console.log(data);//将结果构建为网页信息BuildHtml(data);}});}function BuildHtml(data){if(data=="" || data==null){document.write("搜索内容不存在");return ;}//获取result标签let result_label = $(".container .result");//清空历史搜索数据result_label.empty();for(let elem of data){console.log(elem.title);console.log(elem.url);let a_label=$("<a>",{text: elem.title,//标签链接href: elem.url,//点击链接跳转新启一页 target: "_blank"});let p_label=$("<p>",{text: elem.abstract});let i_label=$("<i>",{text: elem.url,});let div_label=$("<div>",{class:"item"});a_label.appendTo(div_label);p_label.appendTo(div_label);i_label.appendTo(div_label);div_label.appendTo(result_label);}}</script>
</body>
</html>

至此整个前端的代码便全部完成。

整体效果

项目所有的文件如下：

makefile文件如下：

PARSER=parser
DUG=debug
HTTP_SERVER=http_server
cc=g++.PHONY:all
all:$(PARSER) $(DUG) $(HTTP_SERVER)$(PARSER):parser.cc$(cc) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem$(DUG):debug.cc$(cc) -o $@ $^ -std=c++11 -ljsoncpp$(HTTP_SERVER):http_server.cc$(cc) -o $@ $^ -std=c++11 -ljsoncpp -lpthread.PHONY:clean
clean:rm -rf $(PARSER) $(DUG) $(HTTP_SERVER)

make之后，运行 ./parse 会将处理好的所有html文件存放在raw.txt中

随后启动服务器程序：./http_server

然后打开网页，输入自己服务器的IP地址即可：

7.后端优化

搜索去重

在之前的search模块中讨论过，搜索的倒排拉链会产生重复，即不同的关键词可能来源于同一个文档，那么这样造成的后果就是搜索的结果可能就是重复的。

为了测试这种可能性，我们自己新建一个test.html文件，并试图搜索这个文档的内容。

test.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><!-- Copyright (C) 2002 Douglas Gregor <doug.gregor -at- gmail.com>Distributed under the Boost Software License, Version 1.0.(See accompanying file LICENSE_1_0.txt or copy athttp://www.boost.org/LICENSE_1_0.txt) --><title>测试用例</title><meta http-equiv="refresh" content="0; URL=http://www.boost.org/doc/libs/master/doc/html/hash.html"></head><body>今天是一个晴天<a href="http://www.boost.org/doc/libs/master/doc/html/hash.html">http://www.boost.org/doc/libs/master/doc/html/hash.html</a></body>
</html>

我们把test.html放在input路径下，并重新编译运行：

[sjl@VM-16-6-centos boost_searcher]$ make
g++ -o parser parser.cc -std=c++11 -lboost_system -lboost_filesystem
g++ -o debug debug.cc -std=c++11 -ljsoncpp
g++ -o http_server http_server.cc -std=c++11 -ljsoncpp -lpthread
[sjl@VM-16-6-centos boost_searcher]$ ./parser
[sjl@VM-16-6-centos boost_searcher]$ ./http_server
创建index单例完成...
构建索引完成....: 100%

可以看到结果是重复的！

所以我们需要避免这种情况的出现

将search.hpp做修改，详情见文末的项目代码链接

改完之后：

去除暂停词

在jieba分词库中包含了暂停词词库：

改动tool.hpp

将暂停词库导入内存，在jieba分词结束后，再用暂停词库将关键词筛一遍，去除暂停词。

具体见文尾的项目代码 tool.hpp

效果展示：

搜索暂停词后，将不会显示结果，

前期构建索引是需要筛一遍暂停词所以会比较慢，但是一旦构建完毕，索引的时间将会大幅缩减，因为省去了暂停词的索引过程。

添加日志

//log.hpp
#pragma once#include <iostream>
#include <string>
#include <ctime>#define NORMAL  1
#define WARNING 2
#define DEBUG   3
#define FATAL   4#define LOG(LEVEL,MESSAGE) log(#LEVEL,MESSAGE,__FILE__,__LINE__)void log(std::string level ,std::string message,std::string file,int line)
{std::cout<<"["<<level<<"]"<<"["<<time(nullptr)<<"]"<<"["<<message<<"]"<<"["<<file<<" : "<<line<<"]"<<std::endl;}

在所有的错误控制处以及信息提示出，使用LOG函数，并给予一定的错误等级与提示。

部署服务

在后台运行服务器，并把日志信息输出在 log.txt中(把错误输出也重定向到此文件中 2>&1)：

[sjl@VM-16-6-centos boost_searcher]$ nohup ./http_server &>log.txt 2>&1

输入一些搜索词后：

[sjl@VM-16-6-centos boost_searcher]$ cat log.txt
nohup: ignoring input
创建index单例完成...
[NORMAL][1659167339][创建index单例完成...][searcher.hpp : 24]
构建索引完成....: 100%
[NORMAL][1659167389][构建索引完成...][searcher.hpp : 28]
用户搜索词: vector
[NORMAL][1659168113][用户搜索词: vector][http_server.cc : 25]
用户搜索词: split
[NORMAL][1659168141][用户搜索词: split][http_server.cc : 25]
用户搜索词: filestream
[NORMAL][1659168148][用户搜索词: filestream][http_server.cc : 25]

项目扩展方向：

该项目的数据源是基于 boost_1_79_0/doc/html/ 目录下的html文件索引。所以可以建立全站索引。
数据源可以定期使用爬虫程序对网页进行爬取，或者在网站更新时设置信号，提醒重新爬取网页。设计在线更新的方案（多线程，多进程）。
不使用组件，自己设计对应的各种方案。
添加竞价排名
热词统计，智能显示搜索关键词（字典树，优先级队列）
设置登录注册

项目代码

已上传至gitee：项目代码链接

【项目】基于BOOST的站内搜索引擎相关推荐

加入一个基于GOOGLE的站内搜索引擎
由于这一次的客户只能提供虚拟主机作为项目运行平台,无法搭配中文分词组件,原来自行开发的站内搜索引擎无法发挥最大的功效(主要是不能自动分析关键词,只能通过指定相关索引字段,以及手工输入TAG的机制来生成 ...
用C++来设计开发的基于boost文档的站内搜索引擎项目，点赞收藏起来！
So Easy搜索引擎项目描述主要技术项目特点 0. 准备工作 1. 预处理模块 2. 索引模块 3. 搜索模块 4. 服务器模块项目难点和提升结束语项目描述 boost官网虽然提供了在线 ...
基于 Es 实现站内全文搜索
点击上方关注 "终端研发部" 设为"星标",和你一起掌握更多数据库知识摘要对于一家公司而言,数据量越来越多,如果快速去查找这些信息是一个很难的问题,在计算机 ...
站内搜索引擎初探：haystack全文检索，whoosh搜索引擎，jieba中文分词
在做django项目当中,不免要使用到站内搜索引擎,网站呈现的内容除了列表,详细页,首页之外,用户也需要通过搜索引擎来找到自己需要的内容. 安装: pip install django-haystac ...
站内搜索引擎之比较〔转〕
有很多网站都在网页上加个"站内搜索引擎"."搜索引擎"."全文检索"等等相关字样. 用户一用,结果发现,既不能多关键组合查询,也不能支持国际 ...
王通：站内搜索引擎的SEO策略
越来越多的大中型网站都有了站内搜索引擎,站内搜索引擎如果采用正确的SEO策略,可以产生大量非常合理的关键词页面,可以在各大搜索引擎中带来巨大的流量.站内搜索引擎该如何SEO呢?很简单,只需要做好以下三 ...
基于swiftype应用于Hexo-Yilia-主题的站内搜索引擎
本文基于Hexo,Yilia主题添加站内搜索功能与使用swiftype实现站内搜索文章之前首先感谢以上两位作者YeHbeats与 huangjunhui swiftype Swiftype 可以为网 ...
如何搭建一个站内搜索引擎(一) 第1章写在最前
搜索引擎,对很多人来说,熟悉又陌生.熟悉,是因为每个人每天都能接触到,比如百度.google.淘宝内部搜索:陌生,是因为鲜有人了解他的原理. 因为工作需要,有幸参与负责了一个站内搜索的项目.所以可以从 ...
使用 LayUI+SpringBoot+Solr 模仿百度、做站内搜索引擎
一.前言全文检索于 sql 模糊查询,最大的区别,在于 ① 前者能将要查询的关键字符串先进行灵活分词,再进行匹配, ② 后者只会直接死板匹配. ③ 很多网站都有站内搜索,每个后台的应该会,故做了个 ...

【项目】 基于BOOST的站内搜索引擎

目录