停用词是在索引和搜索过程中被忽略的词,通常由于其高频率且对搜索结果价值较低。
Manticore Search 默认对停用词应用词干提取,这可能导致不理想的结果,但可以通过使用stopwords_unstemmed来关闭此功能。
小型停用词文件存储在表头中,嵌入文件的大小有限制,该限制由embedded_limit选项定义。
停用词不被索引,但会影响关键词的位置。例如,如果“the”是停用词,文档1包含短语“in office”,而文档2包含短语“in the office”,搜索“in office”作为精确短语时只会返回第一个文档,尽管第二个文档中的“the”作为停用词被跳过。此行为可以通过stopword_step指令进行修改。
stopwords=path/to/stopwords/file[ path/to/another/file ...]
stopwords 设置是可选的,默认为空。它允许你指定一个或多个停用词文件的路径,路径之间用空格分隔。所有文件都会被加载。在实时模式下,只允许使用绝对路径。
停用词文件格式为简单的 UTF-8 编码纯文本。文件数据将根据charset_table设置进行分词,因此你可以使用与索引数据相同的分隔符。
停用词文件可以手动或半自动创建。indexer 提供了一个模式,可以创建按关键词频率排序的表频率字典。该字典中的热门关键词通常可以用作停用词。详情请参见--buildstops和--buildfreqs开关。该字典中的热门关键词通常可以用作停用词。
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => '/usr/local/manticore/data/stopwords.txt stopwords-ru.txt stopwords-en.txt'
]);utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'')await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt\'');utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'", true);utils_api.sql("CREATE TABLE products(title text, price float) stopwords = '/usr/local/manticore/data/stopwords.txt /usr/local/manticore/data/stopwords-ru.txt /usr/local/manticore/data/stopwords-en.txt'", Some(true)).await;table products {
stopwords = /usr/local/manticore/data/stopwords.txt
stopwords = stopwords-ru.txt stopwords-en.txt
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}或者你可以使用 Manticore 自带的默认停用词文件。目前提供了50种语言的停用词。以下是它们的完整别名列表:
- af - 南非荷兰语
- ar - 阿拉伯语
- bg - 保加利亚语
- bn - 孟加拉语
- ca - 加泰罗尼亚语
- ckb- 库尔德语
- cz - 捷克语
- da - 丹麦语
- de - 德语
- el - 希腊语
- en - 英语
- eo - 世界语
- es - 西班牙语
- et - 爱沙尼亚语
- eu - 巴斯克语
- fa - 波斯语
- fi - 芬兰语
- fr - 法语
- ga - 爱尔兰语
- gl - 加利西亚语
- hi - 印地语
- he - 希伯来语
- hr - 克罗地亚语
- hu - 匈牙利语
- hy - 亚美尼亚语
- id - 印度尼西亚语
- it - 意大利语
- ja - 日语
- ko - 韩语
- la - 拉丁语
- lt - 立陶宛语
- lv - 拉脱维亚语
- mr - 马拉地语
- nl - 荷兰语
- no - 挪威语
- pl - 波兰语
- pt - 葡萄牙语
- ro - 罗马尼亚语
- ru - 俄语
- sk - 斯洛伐克语
- sl - 斯洛文尼亚语
- so - 索马里语
- st - 索托语
- sv - 瑞典语
- sw - 斯瓦希里语
- th - 泰语
- tr - 土耳其语
- yo - 约鲁巴语
- zh - 中文
- zu - 祖鲁语
例如,要使用意大利语的停用词,只需在配置文件中添加以下行:
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) stopwords = 'it'POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'it'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => 'it'
]);utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'')await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'it\'');utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'it'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = 'it'", true);utils_api.sql("CREATE TABLE products(title text, price float) stopwords = 'it'", Some(true)).await;table products {
stopwords = it
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}如果你需要使用多种语言的停用词,应列出所有别名,RT模式下用逗号分隔,普通模式下用空格分隔:
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => 'en, it, ru'
]);utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'')await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en, it, ru\'');utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'", true);utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'", true);utils_api.sql("CREATE TABLE products(title text, price float) stopwords = 'en, it, ru'", Some(true)).await;table products {
stopwords = en it ru
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}stopword_step={0|1}
stopwords 的 position_increment 设置是可选的,允许的值为0和1,默认值为1。
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopword_step = '1'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => 'en, it, ru',
'stopword_step' => '1'
]);utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'')await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'');utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'", true);utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'", true);utils_api.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopword_step = \'1\'", Some(true)).await;table products {
stopwords = en
stopword_step = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}stopwords_unstemmed={0|1}
Whether to apply stop words before or after stemming. Optional, default is 0 (apply stop word filter after stemming).
By default, stop words are stemmed themselves, and then applied to tokens after stemming (or any other morphology processing). This means that a token is stopped when stem(token) is equal to stem(stopword). This default behavior can lead to unexpected results when a token is erroneously stemmed to a stopped root. For example, "Andes" might get stemmed to "and", so when "and" is a stopword, "Andes" is also skipped.
However, you can change this behavior by enabling the stopwords_unstemmed directive. When this is enabled, stop words are applied before stemming (and therefore to the original word forms), and the tokens are skipped when the token is equal to the stopword.
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'POST /cli -d "
CREATE TABLE products(title text, price float) stopwords = 'en' stopwords_unstemmed = '1'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'stopwords' => 'en, it, ru',
'stopwords_unstemmed' => '1'
]);utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'')await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'');utilsApi.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'", true);utils_api.sql("CREATE TABLE products(title text, price float) stopwords = \'en\' stopwords_unstemmed = \'1\'", Some(true)).await;table products {
stopwords = en
stopwords_unstemmed = 1
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}词形变化在通过 charset_table 规则对输入文本进行分词后应用。它们本质上允许你用另一个词替换一个词。通常,这用于将不同的词形归一化为单一的标准形式(例如,将所有变体如 "walks"、"walked"、"walking" 规范化为标准形式 "walk")。它也可以用来实现 词干提取 的例外情况,因为词干提取不会应用于词形变化列表中的词。
wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt
词形变化字典。可选,默认为空。
词形变化字典用于在索引和搜索过程中规范化输入词。因此,对于 plain table 来说,必须旋转表以应用词形变化文件中的更改。
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'POST /cli -d "
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'wordforms' => [
'/var/lib/manticore/wordforms.txt',
'/var/lib/manticore/alternateforms.txt',
'/var/lib/manticore/dict*.txt'
]
]);utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')await utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');utilsApi.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);utils_api.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", Some(true)).await;table products {
wordforms = /var/lib/manticore/wordforms.txt
wordforms = /var/lib/manticore/alternateforms.txt
wordforms = /var/lib/manticore/dict*.txt
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}Manticore 中的词形变化支持设计为能够良好处理大型字典。它们对索引速度有适度影响;例如,包含 100 万条目的字典会使全文索引速度降低约 1.5 倍。搜索速度完全不受影响。额外的内存占用大致等于字典文件大小,且字典在多个表之间共享。例如,如果同一个 50 MB 的词形变化文件被指定给 10 个不同的表,额外的 searchd 内存使用大约为 50 MB。
字典文件应为简单的纯文本格式。每行应包含源词形和目标词形,使用 UTF-8 编码,并以“大于号”分隔。加载文件时会应用 charset_table 中的规则。因此,如果你不修改 charset_table,你的词形变化将是不区分大小写的,类似于其他全文索引的数据。以下是文件内容的示例:
- Example
walks > walk
walked > walk
walking > walk有一个捆绑的工具叫做 Spelldump,它可以帮助你创建 Manticore 可读取格式的字典文件。该工具可以读取以 ispell 或 MySpell 格式的源 .dict 和 .aff 字典文件,这些文件随 OpenOffice 一起捆绑提供。
你可以将多个源词映射到一个目标词。该过程作用于分词后的词元,而非源文本,因此空白和标记的差异会被忽略。
你可以使用 => 符号代替 >。也允许注释(以 # 开头)。最后,如果一行以波浪号(~)开头,词形变化将在形态学处理之后应用,而非之前(注意此情况下只支持单个源词和目标词)。
- Example
core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'如果你需要将 >、= 或 ~ 作为普通字符使用,可以通过在它们前面加反斜杠(\)来转义。> 和 = 都应以此方式转义。示例如下:
- Example
a\> > abc
\>b > bcd
c\=\> => cde
\=\>d => def
\=\>a \> f \> => foo
\~g => bar你可以指定多个目标词元:
- Example
s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3你可以指定多个文件,而不仅仅是一个。可以使用通配符作为模式,所有匹配的文件将按简单升序处理:
在 RT 模式下,只允许使用绝对路径。
如果使用多字节编码页且文件名包含非拉丁字符,结果顺序可能不完全是字母顺序。如果在多个文件中发现相同的词形变化定义,后面的定义将覆盖之前的。
- SQL
- Config
create table tbl1 ... wordforms='/tmp/wf*'
create table tbl2 ... wordforms='/tmp/wf, /tmp/wf2'wordforms=/tmp/wf
wordforms=/tmp/wf2
wordforms=/tmp/wf_new*异常(也称为同义词)允许将一个或多个标记(包括通常会被排除的字符的标记)映射到单个关键词。它们类似于wordforms,因为它们也执行映射,但有一些重要的区别。
与wordforms的区别简要总结如下:
| 异常 | 词形 |
|---|---|
| 区分大小写 | 不区分大小写 |
| 可以使用不在charset_table中的特殊字符 | 完全遵守charset_table |
| 在庞大词典中表现较差 | 设计用于处理数百万条目 |
exceptions = path/to/exceptions.txt
分词异常文件。可选,默认为空。 在RT模式下,只允许使用绝对路径。
预期的文件格式是纯文本,每行一个异常。行格式如下:
map-from-tokens => map-to-token
示例文件:
at & t => at&t
AT&T => AT&T
Standarten Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus
\=\>abc\> => abc
这里的所有标记都是区分大小写的,并且不会被charset_table规则处理。因此,以上示例异常文件中,at&t文本将被分词为两个关键词at和t,因为小写字母的原因。另一方面,AT&T将完全匹配并生成单个AT&T关键词。
如果需要将>或=作为普通字符使用,可以通过在它们前面加反斜杠(\)来转义。>和=都应以这种方式转义。
请注意,映射到的关键词:
- 总是被解释为单个词
- 对大小写和空格都敏感
在上述示例中,ms windows查询将不会匹配包含MS Windows文本的文档。该查询将被解释为两个关键词ms和windows的查询。MS Windows的映射是单个关键词ms windows,中间有空格。另一方面,standartenfuhrer将检索包含Standarten Fuhrer或Standarten Fuehrer内容(大小写完全如示例所示),或关键词本身的任何大小写变体的文档,例如staNdarTenfUhreR。(但它不会匹配standarten fuhrer:该文本由于大小写敏感性不匹配任何列出的异常,因此被索引为两个独立的关键词。)
映射来源标记列表中的空白符很重要,但其数量无关紧要。映射来源列表中的任意数量空白符都将匹配索引文档或查询中的任意数量空白符。例如,AT & T映射来源标记将匹配AT & T文本,无论映射来源部分和索引文本中的空格数量如何。因此,这样的文本将被索引为特殊的AT&T关键词,这得益于示例中的第一条目。
异常还允许捕获特殊字符(这些字符是一般charset_table规则的例外;因此得名)。假设你通常不想将+视为有效字符,但仍想能够搜索一些此规则的例外,如C++。上述示例完全满足此需求,完全独立于表中包含哪些字符。
使用plain table时,必须旋转表以合并异常文件的更改。对于实时表,更改只会应用于新文档。
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'POST /cli -d "
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'exceptions' => '/usr/local/manticore/data/exceptions.txt'
]);utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');utilsApi.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);utils_api.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", Some(true)).await;table products {
exceptions = /usr/local/manticore/data/exceptions.txt
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}形态学预处理器可以在索引期间应用于单词,以规范同一单词的不同形式并改进分词。例如,英文词干提取器可以将 "dogs" 和 "dog" 规范为 "dog",从而使这两个关键词的搜索结果相同。
Manticore 有四种内置的形态学预处理器:
- 词形还原器:将单词还原为其词根或词元。例如,"running" 可以还原为 "run","octopi" 可以还原为 "octopus"。注意,有些单词可能有多个对应的词根形式。例如,"dove" 既可以是 "dive" 的过去式,也可以是表示鸟类的名词,如 "A white dove flew over the cuckoo's nest." 在这种情况下,词形还原器可以生成所有可能的词根形式。
- 词干提取器:通过移除或替换某些已知后缀,将单词还原为词干。得到的词干不一定是有效单词。例如,Porter 英文词干提取器将 "running" 还原为 "run",将 "business" 还原为 "busi"(不是有效单词),并且不会对 "octopi" 进行还原。
- 语音算法:用语音编码替换单词,即使单词不同但发音相近,编码也相同。
- 分词算法:将文本拆分成单词。目前仅支持中文。
morphology = morphology1[, morphology2, ...]
morphology 指令指定要应用于被索引单词的形态学预处理器列表。这是一个可选设置,默认不应用任何预处理器。
Manticore 内置了以下形态学预处理器:
- 英语、俄语和德语词形还原器
- 英语、俄语、阿拉伯语和捷克语词干提取器
- SoundEx 和 MetaPhone 语音算法
- 中文分词算法
- 还提供了 Snowball(libstemmer)词干提取器,支持超过 15 种其他语言。
词形还原器需要字典 .pak 文件,可以通过 manticore-language-packs 软件包安装,或从 Manticore 网站下载。在后者情况下,字典需要放置在由 lemmatizer_base 指定的目录中。
此外,lemmatizer_cache 设置可以通过使用更多内存缓存未压缩字典来加速词形还原。
中文分词可以使用 ICU 或 Jieba(需要 manticore-language-packs 包)。这两个库提供比 n-gram 更准确的分词,但速度稍慢。charset_table 必须包含所有中文字符,可以使用 cont、cjk 或 chinese 字符集来实现。当设置 morphology=icu_chinese 或 morphology=jieba_chinese 时,文档首先由 ICU 或 Jieba 预处理。然后,分词器根据 charset_table 处理结果,最后应用 morphology 选项中的其他形态学处理器。只有包含中文的文本部分会传递给 ICU/Jieba 进行分词,其他部分可以通过不同方式修改,如不同的形态学处理或 charset_table。
内置的英语和俄语词干提取器比 libstemmer 版本更快,但可能产生略有不同的结果。
Soundex 实现与 MySQL 一致。Metaphone 实现基于 Double Metaphone 算法,索引主编码。
要使用 morphology 选项,请指定一个或多个内置选项,包括:
- none:不执行任何形态学处理
- lemmatize_ru - 应用俄语词形还原器并选择单一词根形式
- lemmatize_uk - 应用乌克兰语词形还原器并选择单一词根形式(先在 Centos 或 Ubuntu/Debian 安装)。为确保词形还原器正常工作,请确保在
charset_table中保留特定的乌克兰字符,因为默认不包含。可通过如下方式覆盖:charset_table='non_cont,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'。这里 有一个关于如何安装和使用乌克兰词形还原器的交互式课程。 - lemmatize_en - 应用英语词形还原器并选择单一词根形式
- lemmatize_de - 应用德语词形还原器并选择单一词根形式
- lemmatize_ru_all - 应用俄语词形还原器并索引所有可能的词根形式
- lemmatize_uk_all - 应用乌克兰语词形还原器并索引所有可能的词根形式。安装链接见上,并注意
charset_table设置。 - lemmatize_en_all - 应用英语词形还原器并索引所有可能的词根形式
- lemmatize_de_all - 应用德语词形还原器并索引所有可能的词根形式
- stem_en - 应用 Porter 英语词干提取器
- stem_ru - 应用 Porter 俄语词干提取器
- stem_enru - 同时应用 Porter 英语和俄语词干提取器
- stem_cz - 应用捷克语词干提取器
- stem_ar - 应用阿拉伯语词干提取器
- soundex - 用 SOUNDEX 代码替换关键词
- metaphone - 用 METAPHONE 代码替换关键词
- icu_chinese - 使用 ICU 进行中文分词
- jieba_chinese - 使用 Jieba 进行中文分词(需要
manticore-language-packs包) - libstemmer_* 。详情请参阅 支持语言列表
可以指定多个词干提取器,用逗号分隔。它们将按照列出的顺序应用于输入的单词,并且一旦其中一个词干提取器修改了单词,处理将停止。此外,当启用wordforms功能时,单词将首先在词形字典中查找。如果字典中有匹配的条目,则根本不会应用词干提取器。wordforms可用于实现词干提取的例外情况。
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'POST /cli -d "CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology' => 'stem_en, libstemmer_sv'
]);utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'');utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'", true);utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'", Some(true)).await;table products {
morphology = stem_en, libstemmer_sv
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}morphology_skip_fields = field1[, field2, ...]
跳过形态学预处理的字段列表。可选,默认值为空(对所有字段应用预处理器)。
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'POST /cli -d "
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology_skip_fields' => 'name',
'morphology' => 'stem_en'
]);utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'');utilsApi.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'", true);utils_api.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'", Some(true)).await;table products {
morphology_skip_fields = name
morphology = stem_en
type = rt
path = tbl
rt_field = title
rt_field = name
rt_attr_uint = price
}min_stemming_len = length
启用词干提取的最小单词长度。可选,默认值为1(对所有单词进行词干提取)。
词干提取器并不完美,有时可能产生不理想的结果。例如,将“gps”关键词通过英语的Porter词干提取器处理,结果是“gp”,这并非真正的意图。min_stemming_len功能允许您根据源单词长度抑制词干提取,即避免对过短的单词进行词干提取。比给定阈值更短的关键词将不会被词干提取。注意,长度恰好等于指定值的关键词会被词干提取。因此,为了避免对3个字符的关键词进行词干提取,您应将值指定为4。有关更细粒度的控制,请参阅wordforms功能。
- SQL
- JSON
- PHP
- Python
- Python-asycnio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'POST /cli -d "
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'min_stemming_len' => '4',
'morphology' => 'stem_en'
]);utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'');utilsApi.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'", true);utils_api.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'", Some(true)).await;table products {
min_stemming_len = 4
morphology = stem_en
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}index_exact_words = {0|1}
此选项允许对原始关键词及其形态学修改版本进行索引。但是,被wordforms和exceptions重新映射的原始关键词无法被索引。默认值为0,表示默认情况下此功能被禁用。
这允许在查询语言中使用精确形式操作符。启用此功能将增加全文索引的大小和索引时间,但不会影响搜索性能。
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'POST /cli -d "
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'index_exact_words' => '1',
'morphology' => 'stem_en'
]);utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'');utilsApi.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'", true);utils_api.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'", Some(true)).await;table products {
index_exact_words = 1
morphology = stem_en
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}jieba_hmm = {0|1}
启用或禁用Jieba分词工具中的HMM。可选,默认值为1。
在Jieba中,HMM(隐马尔可夫模型)选项指的是一种用于分词的算法。具体来说,它允许Jieba通过识别未知词,尤其是词典中不存在的词,来执行中文分词。
Jieba主要使用基于词典的方法来分割已知词,但当启用HMM选项时,它会应用统计模型来识别词典中不存在的词或短语的可能词边界。这对于分割新词、罕见词、姓名和俚语特别有用。
总之,jieba_hmm选项有助于提高分词准确性,但会牺牲索引性能。它必须与morphology = jieba_chinese一起使用,详见中文、日文和韩文(CJK)及泰语。
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'POST /cli -d "
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology' => 'jieba_chinese',
'jieba_hmm'='1'
]);utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_hmm = \'0\'')await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_hmm = \'0\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_hmm = \'0\'');utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'", true);utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'", Some(true)).await;table products {
morphology = jieba_chinese
jieba_hmm = 0
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}jieba_mode = {accurate|full|search}
Jieba segmentation mode. Optional; the default is accurate.
In accurate mode, Jieba splits the sentence into the most precise words using dictionary matching. This mode focuses on precision, ensuring that the segmentation is as accurate as possible.
In full mode, Jieba tries to split the sentence into every possible word combination, aiming to include all potential words. This mode focuses on maximizing recall, meaning it identifies as many words as possible, even if some of them overlap or are less commonly used. It returns all the words found in its dictionary.
In search mode, Jieba breaks the text into both whole words and smaller parts, combining precise segmentation with extra detail by providing overlapping word fragments. This mode balances precision and recall, making it useful for search engines.
jieba_mode should be used with morphology = jieba_chinese. See Chinese, Japanese, Korean (CJK) and Thai languages.
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'POST /cli -d "
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology' => 'jieba_chinese',
'jieba_mode'='full'
]);utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_mode = \'full\'')await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_mode = \'full\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_mode = \'full\'');utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'", true);utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'", Some(true)).await;table products {
morphology = jieba_chinese
jieba_mode = full
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}jieba_user_dict_path = path/to/stopwords/file
Path to the Jieba user dictionary. Optional.
Jieba, a Chinese text segmentation library, uses dictionary files to assist with word segmentation. The format of these dictionary files is as follows: each line contains a word, split into three parts separated by spaces — the word itself, word frequency, and part of speech (POS) tag. The word frequency and POS tag are optional and can be omitted. The dictionary file must be UTF-8 encoded.
Example:
创新办 3 i
云计算 5
凱特琳 nz
台中
jieba_user_dict_path should be used with morphology = jieba_chinese. For more details, see Chinese, Japanese, Korean (CJK), and Thai languages.
- SQL
- JSON
- PHP
- Python
- Python-asyncio
- javascript
- Java
- C#
- Rust
- CONFIG
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'POST /cli -d "
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'"$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
'title'=>['type'=>'text'],
'price'=>['type'=>'float']
],[
'morphology' => 'jieba_chinese',
'jieba_user_dict_path' = '/usr/local/manticore/data/user-dict.txt'
]);utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_user_dict_path = \'/usr/local/manticore/data/user-dict.txt\'')await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_user_dict_path = \'/usr/local/manticore/data/user-dict.txt\'')res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_user_dict_path = \'/usr/local/manticore/data/user-dict.txt\'');utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'", true);utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'", true);utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'", Some(true)).await;table products {
morphology = jieba_chinese
jieba_user_dict_path = /usr/local/manticore/data/user-dict.txt
type = rt
path = tbl
rt_field = title
rt_attr_uint = price
}