PostgreSQL的全文检索(一)
为什么80%的码农都做不了架构师?>>>
在全文检索没有出来之前,普通的文件检索都是采用的like,~,或者ilike来匹配文档字段中内容,这种检索方法对小数据量的文本检索是OK的,但数据量大了就不行了。
普通检索的劣势:
1.语言不能完全支持,哪怕是英文,比如检索friend时不能检索出friends或者friendly
2.检索出的结果排序功能不好
3.缺少索引支持,查询速度慢,特别是两头加了两个%时根本就不走索引
PostgreSQL在8.3.x版本后开始支持全文检索。执行步骤,主要分三步走:
1.将文档分词(parsing documents into tokens)
2.转换分词规则(converting tokens into lexemes),如去掉复数后缀s/es,以及加入stop词,使之不会在分词中出现,如常用的'的'
3.按一定顺序查询的优化方式存储(storing preprocessed documents optimized for searching) tsvector存储,使用tsquery查询
注:这里tokes是原始的拆分分词,可能包含常用的无意义的词,lexemes是加工过的有价值的分词
一、全文检索的环境和例子:
postgres=# show default_text_search_config ;
default_text_search_config
----------------------------
pg_catalog.english
(1 row)--全文检索配置
postgres=# \dFList of text search configurationsSchema | Name | Description
------------+------------+---------------------------------------
pg_catalog | danish | configuration for danish language
pg_catalog | dutch | configuration for dutch language
pg_catalog | english | configuration for english language
pg_catalog | finnish | configuration for finnish language
pg_catalog | french | configuration for french language
pg_catalog | german | configuration for german language
pg_catalog | hungarian | configuration for hungarian language
pg_catalog | italian | configuration for italian language
pg_catalog | norwegian | configuration for norwegian language
pg_catalog | portuguese | configuration for portuguese language
pg_catalog | romanian | configuration for romanian language
pg_catalog | russian | configuration for russian language
pg_catalog | simple | simple configuration
pg_catalog | spanish | configuration for spanish language
pg_catalog | swedish | configuration for swedish language
pg_catalog | turkish | configuration for turkish language
(16 rows)--全文检索查看russian具体配置
postgres=# \dF+ russian
Text search configuration "pg_catalog.russian"
Parser: "pg_catalog.default"Token | Dictionaries
-----------------+--------------
asciihword | english_stem
asciiword | english_stem
email | simple
file | simple
float | simple
host | simple
hword | russian_stem
hword_asciipart | english_stem
hword_numpart | simple
hword_part | russian_stem
int | simple
numhword | simple
numword | simple
sfloat | simple
uint | simple
url | simple
url_path | simple
version | simple
word | russian_stem--查看全文检索模板
postgres=# \dFt+List of text search templatesSchema | Name | Init | Lexize | Description
------------+-----------+----------------+------------------+-----------------------------------------------------------
pg_catalog | ispell | dispell_init | dispell_lexize | ispell dictionary
pg_catalog | simple | dsimple_init | dsimple_lexize | simple dictionary: just lower case and check for stopword
pg_catalog | snowball | dsnowball_init | dsnowball_lexize | snowball stemmer
pg_catalog | synonym | dsynonym_init | dsynonym_lexize | synonym dictionary: replace word by its synonym
pg_catalog | thesaurus | thesaurus_init | thesaurus_lexize | thesaurus dictionary: phrase by phrase substitution
(5 rows)--全文检索字典
postgres=# \dFd+List of text search dictionariesSchema | Name | Template | Init options | Description
------------+-----------------+---------------------+---------------------------------------------------+-----------------------------------------------------------
pg_catalog | danish_stem | pg_catalog.snowball | language = 'danish', stopwords = 'danish' | snowball stemmer for danish language
pg_catalog | dutch_stem | pg_catalog.snowball | language = 'dutch', stopwords = 'dutch' | snowball stemmer for dutch language
pg_catalog | english_stem | pg_catalog.snowball | language = 'english', stopwords = 'english' | snowball stemmer for english language
pg_catalog | finnish_stem | pg_catalog.snowball | language = 'finnish', stopwords = 'finnish' | snowball stemmer for finnish language
pg_catalog | french_stem | pg_catalog.snowball | language = 'french', stopwords = 'french' | snowball stemmer for french language
pg_catalog | german_stem | pg_catalog.snowball | language = 'german', stopwords = 'german' | snowball stemmer for german language
pg_catalog | hungarian_stem | pg_catalog.snowball | language = 'hungarian', stopwords = 'hungarian' | snowball stemmer for hungarian language
pg_catalog | italian_stem | pg_catalog.snowball | language = 'italian', stopwords = 'italian' | snowball stemmer for italian language
pg_catalog | norwegian_stem | pg_catalog.snowball | language = 'norwegian', stopwords = 'norwegian' | snowball stemmer for norwegian language
pg_catalog | portuguese_stem | pg_catalog.snowball | language = 'portuguese', stopwords = 'portuguese' | snowball stemmer for portuguese language
pg_catalog | romanian_stem | pg_catalog.snowball | language = 'romanian' | snowball stemmer for romanian language
pg_catalog | russian_stem | pg_catalog.snowball | language = 'russian', stopwords = 'russian' | snowball stemmer for russian language
pg_catalog | simple | pg_catalog.simple | | simple dictionary: just lower case and check for stopword
pg_catalog | spanish_stem | pg_catalog.snowball | language = 'spanish', stopwords = 'spanish' | snowball stemmer for spanish language
pg_catalog | swedish_stem | pg_catalog.snowball | language = 'swedish', stopwords = 'swedish' | snowball stemmer for swedish language
pg_catalog | turkish_stem | pg_catalog.snowball | language = 'turkish', stopwords = 'turkish' | snowball stemmer for turkish language--查看全文检索分析器,带加号可看详细配置,命令\dFp+
postgres=# \dFpList of text search parsersSchema | Name | Description
------------+---------------+---------------------pg_catalog | chineseparser | pg_catalog | default | default word parser
(2 rows)
参数和配置文件的具体位置一般在$PGHOME/SHARE里面,stop词是存放在$PGHOME/share/tsearch_data下面的
二、实际例子,以英文例子为例
postgres=# SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat & rat'::tsquery as search;
search
--------
t
(1 row)postgres=# SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector as search;
search
--------
f
(1 row)postgres=# SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat') as search;
search
--------
t
(1 row)postgres=# SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat & rat') as search;
search
--------
f
(1 row)--默认的english分词,to_tevector区别于::tsvector是前者会加工分词,后者默认是加工好了
postgres=# SELECT to_tsvector('english','fat cats ate fat rats') @@ to_tsquery('english','fat & rat') as search;
search
--------
t
(1 row)--plainto_tsquery不却分分隔符,权重标签
postgres=# SELECT plainto_tsquery('english', 'The Fat & Rats:C'); plainto_tsquery
---------------------
'fat' & 'rat' & 'c'
(1 行记录)--分词之间不会区分分隔符,每个分词之间插入&;,::tsquery和to_tsquery则必须要用到
postgres=# SELECT plainto_tsquery('english', 'The Fat Rats');
plainto_tsquery
-----------------
'fat' & 'rat'
(1 行记录)postgres=# SELECT 'The & Fat & Rats'::tsquery;tsquery
------------------------
'The' & 'Fat' & 'Rats'
(1 行记录)postgres=# SELECT to_tsquery('english', 'The & Fat & Rats');to_tsquery
---------------
'fat' & 'rat'
(1 行记录)
三、对全文检索建立索引
有两种办法,一种是对当前文档字段加内置的转换函数,然后建索引,另一种办法是新增一个字段,然后更新原文档内容(需建立触发器和函数转换)上建立索引。推荐后一个。
方法1.原字段上建索引
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', body));
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector(config_name, body)); --组合索引,config_name是表pgweb的一个字段
CREATE INDEX pgweb_idx ON pgweb USING gin(to_tsvector('english', title || ' ' || body));
方法2.新增一列转换后建索引
ALTER TABLE pgweb ADD COLUMN textsearchable_index_col tsvector; --新建字段列类型是tsvector
UPDATE pgweb SET textsearchable_index_col = to_tsvector('english', coalesce(title,'') || ' ' || coalesce(body,'')); CREATE INDEX textsearch_idx ON pgweb USING gin(textsearchable_index_col);
SELECT title FROM pgweb WHERE textsearchable_index_col @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10;
说明:
a.新增字段建的索引还需要创建一个触发器来实时更新新建字段内容
b.表达式索引的优点是简单,占用的空间少,缺点是每次执行需要调用to_tsvector函数来确保索引值关联
c.新建字段索引的有点是查询的速度快(无需每次去调用to_tsvevtor),尤其是使用Gist索引的时候。缺点是新建一个单独的列,消耗更多的存储空间。
四、内置实用函数示例
诸如to_tsvector,to_tsquery,tsvector_update_trigger,tsvector_update_trigger_column,ts_stat等等
--tsvector_update_trigger示例
CREATE TABLE messages (
title text,
body text,
tsv tsvector
);CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);INSERT INTO messages VALUES('title here','the body text is here');postgres=# select * from messages;title | body | tsv
------------+-----------------------+----------------------------
title here | the body text is here | 'bodi':4 'text':5 'titl':1
(1 row)postgres=# SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body');title | body
------------+-----------------------
title here | the body text is here
(1 row)
--ts_stat的使用
--寻找文档中出现词汇的排序
-- nentry是总的出现次数
-- ndoc是文档中(tsvector)出现的次数,重复的记为1次
postgres=# select * from messages;title | body | tsv
----------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------
title here | the body text is here | 'bodi':4 'text':5 'titl':1
kenyon | a chinese boy | 'boy':4 'chines':3 'kenyon':1
Andy Roddick retired | Andy Roddick retired,a former rank number 1 player in tennis | '1':11 'andi':1,4 'former':8 'number':10 'player':12 'rank':9 'retir':3,6 'roddick':2,5 'tenni':14
kenyon retired | kenyon retired,a open-source lover,inserting in this area | 'area':13 'insert':10 'kenyon':1,3 'lover':9 'open':7 'open-sourc':6 'retir':2,4 'sourc':8
Michael Jordan | MJ is an American former professional basketball player | 'american':6 'basketbal':9 'former':7 'jordan':2 'michael':1 'mj':3 'player':10 'profession':8
(5 rows)postgres=# SELECT * FROM ts_stat('SELECT tsv FROM messages') ORDER BY nentry DESC, ndoc DESC, word LIMIT 10;word | ndoc | nentry
-----------+------+--------
retir | 2 | 4
kenyon | 2 | 3
former | 2 | 2
player | 2 | 2
andi | 1 | 2
roddick | 1 | 2
1 | 1 | 1
american | 1 | 1
area | 1 | 1
basketbal | 1 | 1
(10 rows)
五、全文检索的限制
1.The length of each lexeme must be less than 2K bytes
2.The length of a tsvector (lexemes + positions) must be less than 1 megabyte
3.The number of lexemes must be less than 264
4.Position values in tsvector must be greater than 0 and no more than 16,383 No more than 256 positions per lexeme 5.The number of nodes (lexemes + operators) in a tsquery must be less than 32,768
六、总结:
以上是PostgreSQL内置的全文检索的环境和实际使用例子,目前对中文的全文检索并不支持,但已经有比较好的第三方工具结合使用,下一篇继续PostgreSQL中文全文检索环境搭建和实际使用。
转载于:https://my.oschina.net/Kenyon/blog/80904
PostgreSQL的全文检索(一)相关推荐
- Postgresql杂谈 23——Postgresql中的全文检索
今天我们来聊一下全文检索,想必做搜索相关业务朋友对这个概念不会陌生,尤其是做搜索引擎,或者类似CSDN.知乎类的社区网站,全文检索是逃不开的业务.文,即文章.文档.全文搜索就是给定关键词,在所有的文档 ...
- PostgreSQL何以支持丰富的NoSQL特性?
一.引言 上篇文章 介绍了PostgreSQL的典型高级SQL特性,PostgreSQL不仅是关系型数据库,同时支持丰富的NoSQL特性,本文将从 <PostgreSQL实战> 一书的&q ...
- PostgreSQL的json和jsonb比较
PostgreSQL何以支持丰富的NoSQL特性? 一.引言 PostgreSQL不仅是关系型数据库,同时支持丰富的NoSQL特性 本文主要包含以下三部分内容: PostgreSQL的 JSON和JS ...
- 大数据利器2018版
2019独角兽企业重金招聘Python工程师标准>>> 类别 名称 (可重点关注加粗部分) 官网 备注 查询引擎 Phoenix https://phoenix.apache.o ...
- 开源的搜索引擎——详细概述
开源的搜索引擎 搜索服务主要分为两个部分:爬虫crawler和查询searcher. 爬虫的工作策略一般则可以分为累积式抓取(cumulative crawling)和增量式抓取(incrementa ...
- 开源大数据处理工具汇总
类别 名称 官网 备注 查询引擎 Phoenix http://phoenix.incubator.apache.org/ Salesforce公司出品,Apache HBase之上的一个SQL中间层 ...
- 中韩印尼6大子论坛齐聚 | PGConf.Asia亚洲技术大会DAY3迎来收官
12月16日 PGConf.Asia2021 DAY3 6场分论坛火爆举行 接下来 小编带你重温各场分论坛 中文论坛应用实践专场(一) 腾讯云高级工程师黄辉,主题为<<构建PostgreS ...
- 数据科学工具包(万余字介绍几百种工具,经典收藏版!)
本文简介:数据科学家的常用工具与基本思路,数据分析师和数据科学家使用的工具综合概述,包括开源的技术平台相关工具.挖掘分析处理工具.其它常见工具等几百种,几十个大类,部分网址.为数据科学教育和知识分享, ...
- 使用PostgreSQL进行中文全文检索
2019独角兽企业重金招聘Python工程师标准>>> 前言 PostgreSQL 被称为是"最高级的开源数据库",它的数据类型非常丰富,用它来解决一些比较偏门的 ...
- postgresql 分词_PostgreSQL全文检索使用
1. Psql 安装 (CentOS-7.x) 采用yum安装psql是最简洁高效的 # 更新一下yum yum update -y # 直接安装 yum install postgresql-ser ...
最新文章
- 多路I/O转接服务器——epoll
- nullnulle-人事管理系统-人事档案-变更管理-人员合同变更
- 快速查询ABAP transport request lock status
- 【牛客 - 157C】PH试纸(前缀和,或权值线段树,主席树)
- 互联网架构设计漫谈 (1)-概述
- 【分布式计算】关于Hadoop、Spark、Storm的讨论
- ModuleNotFoundError: No module named ‘pyemd‘ 解决
- [转]CG编程概念 ,及CG编译器与VC6.0集成方法
- GPS定位的优点和缺点与室内定位
- 网络安全等级保护拓扑图大全
- EnterpriseArchitect画图工具-活动图使用(一)
- linux挂载光盘镜像到mnt目录,CentOS系统中挂载光盘镜像ISO文件的教程
- 企业微信开源系统,让开发者快速搭建基于企业微信的私域流量运营系统
- Ubuntu16.04 查看硬盘序列号以及系统版本与安装时间
- python设置分辨率和调整画布大小
- 什么是java字符串_什么是java字符串
- 大学教育和IT培训有何区别?
- 月入过万——网店推广实战方法(第2版)
- 参与IPFS项目最好的方式是买矿机挖Fil,而不是直接买币
- linux搭建erp教程,10个最好的自由Linux平台ERP软件 - 51CTO.COM
热门文章
- 【TJOI2019】唱、跳、rap和篮球(DP)(容斥)
- CF379C-New Year Ratings Change
- android视频裁剪工具类,裁剪切视频工具
- css 背景图片虚化
- 光伏逆变器设计资料,原理图,PCB,源代码 DC-DC采用Boost升压,DCAC采用全桥逆变电路结构
- 人工智能项目商业价值,主要体现在哪几个方面?
- eclipse工程图标上有个红色感叹号
- 2021 年人工智能全球最具影响力学者榜单 AI 2000 发布
- 路由器管理页面html,电脑怎么进入路由器设置界面_怎么登录路由器管理界面?-192路由网...
- html中的embed标签属性,html中Embed标签的语法和属性设置