KingbaseES 内置的缺省的分词解析器采用空格分词,因为中文的词语之间没有空格分割,所以这种方法并不适用于中文。要支持中文的全文检索需要额外的中文分词插件:zhparser and sys_jieba,其中zhparser 支持 GBK 和 UTF8 字符集,sys_jieba 支持 UTF8 字符集。
test=# SELECT to_tsvector('English','Try not to become a man of success, but rather try to become a man of value');
'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17
(1 row)
test=# SELECT to_tsvector('simple','Try not to become a man of success, but rather try to become a man of value');
'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value');
'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)
这里可以看到,如果词干分析器是english ,会采取词干标准化的过程;而simple 只是转换成小写。默认是 simple。
test=# show default_text_search_config;
pg_catalog.simple
(1 row)
标准化过程会完成以下操作:
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'become';
t
(1 row)
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ 'becom';
f
(1 row)
test=# select 'become'::tsquery,to_tsquery('become'),to_tsquery('english','become');
tsquery | to_tsquery | to_tsquery
----------+------------+------------
'become' | 'become' | 'becom'
(1 row)
to_tsquery 也会进行标准化转换,在搜索时必须用 to_tsquery,确保数据不会因为标准化转换而搜索不到。
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('become');
t
(1 row)
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('!become');
f
(1 row)
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('tri & become');
t
(1 row)
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try & !becom');
f
(1 row)
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('Try | !become');
t
(1 row)
test=# SELECT to_tsvector('Try not to become a man of success, but rather try to become a man of value') @@ to_tsquery('bec:*');
t
(1 row)
test=# SELECT to_tsvector('simple','Try not to become a man of success, but rather try to become a man of value');
'a':5,14 'become':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rather':10 'success':8 'to':3,12 'try':1,11 'value':17
(1 row)
test=# SELECT to_tsvector('english','Try not to become a man of success, but rather try to become a man of value') ;
'becom':4,13 'man':6,15 'rather':10 'success':8 'tri':1,11 'valu':17
(1 row)
^
test=# SELECT to_tsvector('french','Try not to become a man of success, but rather try to become a man of value') ;
'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17
(1 row)
^
test=# SELECT to_tsvector('french'::regconfig,'Try not to become a man of success, but rather try to become a man of value') ;
'a':5,14 'becom':4,13 'but':9 'man':6,15 'not':2 'of':7,16 'rath':10 'success':8 'to':3,12 'try':1,11 'valu':17
(1 row)
simple并不忽略禁用词表,它也不会试着去查找单词的词根。使用simple时,空格分割的每一组字符都是一个语义;simple 只做了小写转换;对于数据来说,simple文本搜索配置项很实用。
在开始介绍中文检索前,我们先来看个例子:
test=# select to_tsvector('人大金仓致力于提供高可靠的数据库产品');
'人大金仓致力于提供高可靠的数据库产品':1
因为内置的分词器是按空格分割的,而中文间没有空格,因此,整句话就被看做一个分词。
create extension zhparser;
create text search configuration zhongwen_parser (parser = zhparser);
alter text search configuration zhongwen_parser add mapping for n,v,a,i,e,l,j with simple;
上面 for 后面的字母表示分词的token,上面的token映射只映射了名词(n),动词(v),形容词(a),成语(i),叹词(e),缩写(j) 和习用语(l)6种,这6种以外的token全部被屏蔽。词典使用的是内置的simple词典。具体的token 如下:
test=# select ts_token_type('zhparser');
(97,a,adjective)
(98,b,differentiation)
(99,c,conjunction)
(100,d,adverb)
(101,e,exclamation)
(102,f,position)
(103,g,root)
(104,h,head)
(105,i,idiom)
(106,j,abbreviation)
(107,k,tail)
(108,l,tmp)
(109,m,numeral)
(110,n,noun)
(111,o,onomatopoeia)
(112,p,prepositional)
(113,q,quantity)
(114,r,pronoun)
(115,s,space)
(116,t,time)
(117,u,auxiliary)
(118,v,verb)
(119,w,punctuation)
(120,x,unknown)
(121,y,modal)
(122,z,status)
(26 rows)
创建text search configuration 后,可以在视图pg_ts_config 看到如下信息:
test=# select * from pg_ts_config;
oid | cfgname | cfgnamespace | cfgowner | cfgparser
-------+-----------------+--------------+----------+-----------
3748 | simple | 11 | 10 | 3722
13265 | arabic | 11 | 10 | 3722
13267 | danish | 11 | 10 | 3722
13269 | dutch | 11 | 10 | 3722
13271 | english | 11 | 10 | 3722
13273 | finnish | 11 | 10 | 3722
13275 | french | 11 | 10 | 3722
13277 | german | 11 | 10 | 3722
13279 | hungarian | 11 | 10 | 3722
13281 | indonesian | 11 | 10 | 3722
13283 | irish | 11 | 10 | 3722
13285 | italian | 11 | 10 | 3722
13287 | lithuanian | 11 | 10 | 3722
13289 | nepali | 11 | 10 | 3722
13291 | norwegian | 11 | 10 | 3722
13293 | portuguese | 11 | 10 | 3722
13295 | romanian | 11 | 10 | 3722
13297 | russian | 11 | 10 | 3722
13299 | spanish | 11 | 10 | 3722
13301 | swedish | 11 | 10 | 3722
13303 | tamil | 11 | 10 | 3722
13305 | turkish | 11 | 10 | 3722
16390 | parser_name | 2200 | 10 | 16389
24587 | zhongwen_parser | 2200 | 10 | 16389
test=# select to_tsvector('zhongwen_parser','人大金仓致力于提供高可靠的数据库产品');
'产品':7 '人大':1 '可靠':5 '提供':3 '数据库':6 '致力于':2 '高':4
test=# \df+ contains
List of functions
Schema | Name | Result data type | Argument data types | Type | Volatility | Parallel | Owner | Security | Access privileges | Language | Source code
| Description
--------+----------+------------------+---------------------+------+------------+----------+--------+----------+-------------------+----------+------------------------------------------+-------------
sys | contains | boolean | text, text | func | immutable | safe | system | invoker | | sql | select to_tsvector($1) @@ to_tsquery($2) |
sys | contains | boolean | text, text, integer | func | immutable | safe | system | invoker | | sql | select to_tsvector($1) @@ to_tsquery($2) |
sys | contains | boolean | text, tsquery | func | immutable | safe | system | invoker | | sql | select $1::tsvector @@ $2 |
sys | contains | boolean | tsvector, text | func | immutable | safe | system | invoker | | sql | select $1 @@ $2::tsquery |
sys | contains | boolean | tsvector, tsquery | func | immutable | safe | system | invoker | | sql | select $1 @@ $2 |
默认contains 函数使用的是空格分词解析器,因此,无法使用contains 进行中文判断
test=# select contains('人大金仓致力于提供高可靠的数据库产品','产品');
f
手机扫一扫
移动阅读更方便
你可能感兴趣的文章