对全文检索引擎xapian的学习(一)---索引

xapian的文档不算丰富,但也够用了.特别是xapian配套的omega项目,是一个使用xapian和学习xapian的宝库.

先说两个重要的概念,term list 和posting list.

term list索引了一个文档,每一个document都对应一个term list.

posting list列出了一个term索引的文档id,每个term都有一个posting list.

在windows下使用xapian,建议从官网下载mvc下的make文件,放在vc下,修改几个错误后就能编译通过.

但官网没有给出omega在windows下的makefile ,我试着在vc下编译,没有成功.在ubuntu下编译成功了,需要提前安装好xapian-core并且把依赖的库也安装好.

移植omega意义不大,我决定学习一下omega的代码,看一下xapian究竟应该怎么用.

omega提供了两个最主要的工具是omindex和query.对应的源码是omindex.cc和query.cc.

omindex支持的格式非常丰富,包括html,pdf,xml,excel,csv等.

omindex的核心索引操作,大体分下面几步:

1.保存文档的data:

// Put the data in the document
Xapian::Document newdocument;
string record = "url=";
record += url;
record += "\nsample=";
record += sample;
if (!title.empty()) {record += "\ncaption=";record += generate_sample(title, TITLE_SIZE);
}
if (!author.empty()) {record += "\nauthor=";record += author;
}
record += "\ntype=";
record += mimetype;
if (last_mod != (time_t)-1) {record += "\nmodtime=";record += str(last_mod);
}
record += "\nsize=";
record += str(d.get_size());
newdocument.set_data(record);

data里面保存了很多信息,类型,大小,url等都放在一个字符串中保存了起来.

要注意的是,data不适合频繁存取,存取一次需要耗费较多的资源,对于需要频繁存取的数据,xapian建议使用value.

2.接下来对标题正文进行索引:

// Index the title, document text, and keywords.
indexer.set_document(newdocument);
if (!title.empty()) {indexer.index_text(title, 5);indexer.increase_termpos(100);
}
if (!dump.empty()) {indexer.index_text(dump);
}
if (!keywords.empty()) {indexer.increase_termpos(100);indexer.index_text(keywords);
}
// Index the leafname of the file.
{indexer.increase_termpos(100);string leaf = d.leafname();string::size_type dot = leaf.find_last_of('.');if (dot != string::npos)leaf.resize(dot);indexer.index_text(leaf);
}
if (!author.empty()) {indexer.increase_termpos(100);indexer.index_text(author, 1, "A");
}
// mimeType:
newdocument.add_boolean_term("T" + mimetype);

indexer是一个Xapian::TermGenerator类型,在往document中添加term的时候,可以不使用TermGenerator,但很明显,使用TermGenerator更加方便快捷.建议使用.

TermGenerator只能添加概率term,如果需要添加boolean型term,只能在doc中添加.

indexer.index_text(title, 5);

上面的语句中,title是要索引的文本,后面的5是wdf,也就是这个term的权重(具体来说,wdf是这个term在document中出现的次数).
给term一个更大的权重是有意义的,可以影响检索结果的排序.

需要注意,title必须是utf8编码的,否则不能识别.

title可以包含多个term,需要以空格隔开,否则title将作为一个term存入document中.

还要注意的一点是,index_text会记住添加的term的位置(position),如果不想记住term的position可以使用index_text_without_positions函数,这会减小索引库文件的大小.

indexer.increase_termpos(100);

函数将term的position增加了100,如果标题中有2个term,position分别是1和2,那么接下来的正文索引,term的position将会以103开始,
这能避免短语检索或NEAR检索误把标题和正文的词结合在一起.

indexer.index_text(keywords);

索引了关键词,很多分词算法可以取得关键词,关键词对于文章的聚合,寻找相似内容很有用处.

indexer.index_text(leaf);

索引了文件名(去掉了文件路径).

indexer.index_text(author, 1, "A");

索引作者,这里多了一个参数"A",这是前缀,在xapian中会经常遇到前缀,有重要作用.

newdocument.add_boolean_term("T" + mimetype);

这里增加了一个term使用的是boolean类型,相当于增加了一个wdf为0的term.

// Add last_mod as a value to allow "sort by date".
newdocument.add_value(VALUE_LASTMOD, int_to_binary_string((uint32_t)last_mod));

这里增加了一个value,保存的是doc的最后修改时间.可以使用此value将检索结果按照时间日期排序.

// Add MD5 as a value to allow duplicate documents to be collapsed together.
newdocument.add_value(VALUE_MD5, md5);

这里增加了另外一个value,保存的是doc的md5值,可以用来去重.

// Add the file size as a value to allow "sort by size" and size ranges.
newdocument.add_value(VALUE_SIZE, Xapian::sortable_serialise(d.get_size()));

增加了另外一个value,保存doc的大小,可以用来按大小排序或指定大小范围.

bool inc_tag_added = false;
if (d.is_other_readable()) {inc_tag_added = true;newdocument.add_boolean_term("I*");
} else if (d.is_group_readable()) {const char * group = d.get_group();if (group) {newdocument.add_boolean_term(string("I#") + group);}
}
const char * owner = d.get_owner();
if (owner) {newdocument.add_boolean_term(string("O") + owner);if (!inc_tag_added && d.is_owner_readable())newdocument.add_boolean_term(string("I@") + owner);
}

这里加入了权限控制.如果是文档拥有者只读,加入term"I@",如果是拥有者所在组可读,加入term"I#",如果其它人可读,加入term"I*".

在检索的时候,根据这三个term,可以决定哪些文档是允许当前用户检索的.

string ext_term("E");
for (string::iterator i = ext.begin(); i != ext.end(); ++i) {char ch = *i;if (ch >= 'A' && ch <= 'Z')ch |= 32;ext_term += ch;
}
newdocument.add_boolean_term(ext_term);

这里增加扩充term,以"E"开头,term内容为小写字母.

if (!skip_duplicates) {// If this document has already been indexed, update the existing// entry.if (did) {// We already found out the document id above.db.replace_document(did, newdocument);} else if (last_mod <= last_mod_max) {// We checked for the UID term and didn't find it.did = db.add_document(newdocument);} else {did = db.replace_document(urlterm, newdocument);}if (did < updated.size()) {if (usual(!updated[did])) {updated[did] = true;--old_docs_not_seen;}}if (verbose) {if (did <= old_lastdocid) {cout << "updated" << endl;} else {cout << "added" << endl;}}
} else {// If this were a duplicate, we'd have skipped it above.db.add_document(newdocument);if (verbose)cout << "added" << endl;
}

这里是把document入库.对于重复的document,可以跳过,也可以对旧有document进行替换更新.

以上就是index_file函数的主要部分,对于不同格式的文档要进行dump处理,提取出里面的文本内容后再进行索引.

对全文检索引擎xapian的学习(一)---索引相关推荐

在Ubuntu8.10下为PHP安装coreseek全文检索引擎支持的详细步骤
2019独角兽企业重金招聘Python工程师标准>>> 关于sphinx就不多累言了,一套相当优秀的全文检索引擎.无论索引速度还是检索速度真的是非常的快. 至于coreseek ,可 ...
elasticsearch(es)分布式全文检索引擎简介
0. 带着问题上路-ES是如何产生的? (1)思考:大规模数据如何检索? 如:当系统数据量上了10亿.100亿条的时候,我们在做系统架构的时候通常会从以下角度去考虑问题: 1)用什么数据库好?(MyS ...
[摘]全文检索引擎Solr系列—–全文检索基本原理
原文链接--http://www.importnew.com/12707.html 全文检索引擎Solr系列-–全文检索基本原理 2014/08/18 | 分类: 基础技术, 教程 | 2 条评论 | ...
Sphinx全文检索引擎测试
数据表 1.documents CREATE TABLE `documents` ( `id` int(13) NOT NULL auto_increment, `group_id` int(11) ...
lucene 全文检索引擎的架构
Lucene是apache软件基金会4 jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,但它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引 ...
IndexTank全文检索引擎设计分析
2019独角兽企业重金招聘Python工程师标准>>> 简介 IndexTank是一个托管的搜索基础服务.他主要有以下几个特点(从官网介绍翻译过来的): 索引更新实时生效地理位置搜 ...
Apache Lucene Java 全文检索引擎架构
Apache Lucene Java 全文检索引擎架构 Apache Lucene 8.9.0 已发布,Lucene 是完全用 Java 编写的高性能.功能齐全的全文检索引擎架构,提供了完整的查询引擎 ...
一、全文检索引擎的介绍
一.全文检索引擎的介绍 ->ELK:ElasticSearch+Logstash+Kibana ->用于解决日志收集.日志分析处理.展示的日志分析平台 ->ES:日志的存储,聚合分析 ...
讯搜全文检索引擎-服务器部署
==============[讯搜全文检索引擎-部署服务器]=============== 1.存放在 xxx.xxxxxx.com 项目下的 search 模块,访问域名为:search.xxxxx ...

对全文检索引擎xapian的学习(一)---索引

对全文检索引擎xapian的学习(一)---索引相关推荐

最新文章

热门文章