NLP 和分词 > 词法 | Manticore Search Manual

词形转换是在通过字符集表规则对传入文本进行分词之后应用的。它们本质上允许你将一个词替换为另一个词。通常，这用于将不同的词形转换为一个标准形式（例如，将所有变体如 "walks"、"walked"、"walking" 转换为标准形式 "walk"）。它也可以用于实现词干提取的例外情况，因为词干提取不会应用于在 forms 列表中找到的词。

wordforms = path/to/wordforms.txt
wordforms = path/to/alternateforms.txt
wordforms = path/to/dict*.txt

词形转换字典。可选，缺省为空。

词形转换字典在索引和搜索期间都用于规范化传入的词。因此，当涉及到普通表时，需要旋转表以获取词形转换文件中的更改。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt /var/lib/manticore/dict*.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'wordforms' => [
                '/var/lib/manticore/wordforms.txt',
                '/var/lib/manticore/alternateforms.txt',
                '/var/lib/manticore/dict*.txt'
            ]
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float)wordforms = \'/var/lib/manticore/wordforms.txt\' wordforms = \'/var/lib/manticore/alternateforms.txt\' wordforms = \'/var/lib/manticore/dict*.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) wordforms = '/var/lib/manticore/wordforms.txt' wordforms = '/var/lib/manticore/alternateforms.txt' wordforms = '/var/lib/manticore/dict*.txt'", Some(true)).await;

table products {
  wordforms = /var/lib/manticore/wordforms.txt
  wordforms = /var/lib/manticore/alternateforms.txt
  wordforms = /var/lib/manticore/dict*.txt
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

Manticore 的词形转换支持旨在良好处理大型字典。它们对索引速度有中等影响；例如，一个包含 100 万个条目的字典会使全文索引变慢约 1.5 倍。搜索速度完全不受影响。额外的 RAM 影响大致等于字典文件的大小，并且字典在表之间共享。例如，如果为 10 个不同的表指定了相同的 50 MB 词形转换文件，额外的 searchd RAM 使用量将约为 50 MB。

字典文件应采用简单的纯文本格式。每一行应包含源词和目标词形式，使用 UTF-8 编码，并用“大于”符号分隔。加载文件时将应用来自字符集表的规则。因此，如果你不修改 charset_table，你的词形转换将不区分大小写，类似于你的其他全文索引数据。以下是文件内容的示例：

‹›

Example

Example

📋

walks > walk
walked > walk
walking > walk

有一个名为 Spelldump 的捆绑实用程序，可以帮助你创建 Manticore 可读的字典文件格式。该实用程序可以从 OpenOffice 捆绑的 ispell 或 MySpell 格式的源 .dict 和 .aff 字典文件中读取。

你可以将多个源词映射到一个目标词。该过程发生在标记上，而不是源文本上，因此忽略空格和标记的差异。

你可以使用 => 符号代替 >。还允许使用注释（以 # 开头）。最后，如果一行以波浪号（~）开头，词形转换将在形态学之后应用，而不是之前（注意在这种情况下仅支持单个源词和目标词）。

‹›

Example

Example

📋

core 2 duo > c2d
e6600 > c2d
core 2duo => c2d # Some people write '2duo' together...
~run > walk # Along with stem_en morphology enabled replaces 'run', 'running', 'runs' (and any other words that stem to just 'run') to 'walk'

如果你需要将 >、= 或 ~ 作为普通字符使用，可以通过在每个字符前加上反斜杠（\）来转义它们。> 和 = 应该以这种方式转义。以下是一个示例：

‹›

Example

Example

📋

a\> > abc
\>b > bcd
c\=\> => cde
\=\>d => def
\=\>a \> f \> => foo
\~g => bar

你可以指定多个目标形式：

‹›

Example

Example

📋

s02e02 > season 2 episode 2
s3 e3 > season 3 episode 3

你可以指定多个文件，而不仅仅是一个文件。可以使用通配符作为模式，所有匹配的文件将按简单升序处理：

在 RT 模式下，仅允许使用绝对路径。

如果使用多字节代码页且文件名包含非拉丁字符，结果顺序可能不是完全按字母顺序排列。如果在多个文件中找到相同的词形转换定义，后者将被使用并覆盖之前的定义。

‹›

SQL
Config

📋

create table tbl1 ... wordforms='/tmp/wf*'
create table tbl2 ... wordforms='/tmp/wf, /tmp/wf2'

wordforms_list = 'source-form > destination-form; ...'

wordforms_list 设置允许你在 CREATE TABLE 语句中直接指定词形转换。它仅在 RT 模式中支持。

值必须用分号（;）分隔。由于词形转换可能包含 > 或 => 作为分隔符，以及可能的其他特殊字符，请确保如果分号是形式本身的一部分（例如 \;），则转义分号。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust

📋

CREATE TABLE products(title text, price float) wordforms_list = 'walks > walk; walked > walk'

POST /cli -d "
CREATE TABLE products(title text, price float) wordforms_list = 'walks > walk; walked > walk'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'wordforms_list' => 'walks > walk; walked > walk'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) wordforms_list = \'walks > walk; walked > walk\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) wordforms_list = \'walks > walk; walked > walk\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) wordforms_list = \'walks > walk; walked > walk\'');

utilsApi.sql("CREATE TABLE products(title text, price float) wordforms_list = 'walks > walk; walked > walk'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) wordforms_list = 'walks > walk; walked > walk'", true);

utils_api.sql("CREATE TABLE products(title text, price float) wordforms_list = 'walks > walk; walked > walk'", Some(true)).await;

异常

Last modified: February 07, 2026

异常（也称为同义词）允许将一个或多个标记（包括通常会被排除的字符的标记）映射到一个关键字。它们与 wordforms 类似，因为它们也执行映射，但有一些重要的区别。

与 wordforms 的差异简要总结如下：

异常	词形
区分大小写	不区分大小写
可以使用不在 charset_table 中的特殊字符	完全遵守 charset_table
在大型词典中性能较差	设计用于处理数百万条目

exceptions = path/to/exceptions.txt

标记异常文件。可选，默认为空。在 RT 模式下，仅允许使用绝对路径。

预期的文件格式是纯文本，每行一个异常。行格式如下：

source-form => destination-form

示例文件：

at & t => at&t
AT&T => AT&T
Standarten   Fuehrer => standartenfuhrer
Standarten Fuhrer => standartenfuhrer
MS Windows => ms windows
Microsoft Windows => ms windows
C++ => cplusplus
c++ => cplusplus
C plus plus => cplusplus
\=\>abc\> => abc

此处的所有形式都是区分大小写的，不会被 charset_table 规则处理。因此，使用上面的示例异常文件，at&t 文本将被标记为两个关键字 at 和 t，因为是小写字母。另一方面，AT&T 将精确匹配并生成一个 AT&T 关键字。

如果需要将 > 或 = 作为普通字符使用，可以通过在每个字符前加上反斜杠 (\) 来转义它们。这两种字符都应以这种方式转义。

请注意目标形式的关键字：

总是被解释为一个单个词
既区分大小写又区分空格

在上面的示例中，ms windows 查询将不匹配包含 MS Windows 文本的文档。该查询将被解释为两个关键字的查询，ms 和 windows。MS Windows 的映射是一个单关键字 ms windows，中间有空格。另一方面，standartenfuhrer 将检索包含 Standarten Fuhrer 或 Standarten Fuehrer 内容的文档（完全像这样大写），或者关键字本身的任何大小写变体，例如 staNdarTenfUhreR。（它不会捕获 standarten fuhrer，因为大小写敏感，该文本不匹配任何列出的异常，并被索引为两个独立的关键字。）

源形式列表中的空格很重要，但其数量不重要。源形式列表中的任何数量的空格都将匹配索引文档或查询中的任何其他数量的空格。例如，AT & T 源形式将匹配 AT & T 文本，无论源形式部分和索引文本中的空格数量如何。因此，这种文本将被索引为一个特殊的关键字 AT&T，感谢示例中的第一个条目。

异常还允许捕获特殊字符（这些字符是通用 charset_table 规则的例外；因此得名）。假设通常你不希望将 + 视为有效字符，但仍希望能够搜索某些例外，例如 C++。上面的示例将做到这一点，完全独立于表中包含哪些字符以及不包含哪些字符。

当使用 plain table 时，需要旋转表以合并异常文件中的更改。在实时表的情况下，更改仅适用于新文档。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions' => '/usr/local/manticore/data/exceptions.txt'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions = \'/usr/local/manticore/data/exceptions.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) exceptions = '/usr/local/manticore/data/exceptions.txt'", Some(true)).await;

table products {
  exceptions = /usr/local/manticore/data/exceptions.txt
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

exceptions_list = 'source-form => destination-form; ...'

exceptions_list 设置允许您在 CREATE TABLE 语句中直接指定异常。它仅支持 RT 模式。

值必须用分号 (;) 分隔。由于异常可能包含 > 或 => 作为分隔符，并且可能包含其他特殊字符，请确保如果分号是形式本身的一部分（例如 \;），则转义分号。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust

📋

CREATE TABLE products(title text, price float) exceptions_list = 'at & t => at&t; MS Windows => ms windows'

POST /cli -d "
CREATE TABLE products(title text, price float) exceptions_list = 'at & t => at&t; MS Windows => ms windows'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'exceptions_list' => 'at & t => at&t; MS Windows => ms windows'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) exceptions_list = \'at & t => at&t; MS Windows => ms windows\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions_list = \'at & t => at&t; MS Windows => ms windows\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) exceptions_list = \'at & t => at&t; MS Windows => ms windows\'');

utilsApi.sql("CREATE TABLE products(title text, price float) exceptions_list = 'at & t => at&t; MS Windows => ms windows'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) exceptions_list = 'at & t => at&t; MS Windows => ms windows'", true);

utils_api.sql("CREATE TABLE products(title text, price float) exceptions_list = 'at & t => at&t; MS Windows => ms windows'", Some(true)).await;

词形词法

Last modified: January 21, 2026

形态学预处理器可以在索引时应用于单词，以规范化同一单词的不同形式并改进分词。例如，英文词干提取器可以将 "dogs" 和 "dog" 规范化为 "dog"，从而这两个关键词的搜索结果相同。

Manticore 有四种内置的形态学预处理器：

词形还原器（Lemmatizer）：将单词还原为其根或词元。例如，“running”可还原为“run”，“octopi”可还原为“octopus”。注意，有些单词可能有多个对应的根形式。例如，“dove”既可以是“dive”的过去式，也可以是名词“鸽子”，如在句子“A white dove flew over the cuckoo's nest.”中。此时，词形还原器可以生成所有可能的根形式。
词干提取器（Stemmer）：通过移除或替换某些已知的后缀，将单词还原为词干。所得的词干不一定是有效词。例如，Porter 英文词干提取器会将“running”还原为“run”，“business”还原为“busi”（非有效词），且不会还原“octopi”。
语音算法：将单词替换为语音编码，即使单词不同但发音相近，编码也相同。
分词算法：将文本拆分成词。目前仅对中文有效。

morphology = morphology1[, morphology2, ...]

morphology 指令指定要应用于被索引单词的一系列形态学预处理器。这是一个可选设置，默认是不应用任何预处理器。

Manticore 具有内置的形态学预处理器，支持：

英语、俄语和德语的词形还原器
英语、俄语、阿拉伯语和捷克语的词干提取器
SoundEx 和 MetaPhone 语音算法
中文分词算法
Snowball（libstemmer）词干提取器，支持超过15 种其他语言。

词形还原器需要字典 .pak 文件，可以通过 manticore-language-packs 包安装，或者从Manticore官网下载。后一种情况需要将字典放入由 lemmatizer_base 指定的目录。

此外，设置 lemmatizer_cache 可以通过使用更多内存缓存未压缩字典，加快词形还原速度。

中文分词可以使用 ICU 或 Jieba（需要安装 manticore-language-packs 包）。这两个库提供比 n-gram 更精准的分词，但速度稍慢。 charset_table 必须包含所有中文字符，可以使用 cont、cjk 或 chinese 字符集来实现。当设置 morphology=icu_chinese 或 morphology=jieba_chinese 时，文档首先由 ICU 或 Jieba 预处理，然后分词器根据 charset_table 处理结果，最后应用 morphology 选项的其他形态处理器。仅包含中文的文本部分会传递给 ICU/Jieba 进行分词，其他部分可以通过不同方法（如不同形态学处理或 charset_table）进行修改。

内置的英语和俄语词干提取器速度快于对应的 libstemmer 版本，但可能产生略有不同的结果。

Soundex 实现与 MySQL 一致。Metaphone 实现基于双重 Metaphone 算法，索引其主码。

要使用 morphology 选项，请指定一个或多个内置选项，包括：

none：不执行任何形态学处理
lemmatize_ru - 使用俄语词形还原器，选择单一根形式
lemmatize_uk - 使用乌克兰语词形还原器，选择单一根形式（请先在 Centos 或 Ubuntu/Debian 安装）。为保证词形还原器正常工作，务必保留 charset_table 中的乌克兰特有字符，因默认不会保留。可通过如下方法覆盖：charset_table='non_cont,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491'。此处有一个关于如何安装和使用乌克兰词形还原器的交互课程。
lemmatize_en - 使用英语词形还原器，选择单一根形式
lemmatize_de - 使用德语词形还原器，选择单一根形式
lemmatize_ru_all - 使用俄语词形还原器，索引所有可能的根形式
lemmatize_uk_all - 使用乌克兰语词形还原器，索引所有可能的根形式。安装链接见上，注意 charset_table 设置。
lemmatize_en_all - 使用英语词形还原器，索引所有可能的根形式
lemmatize_de_all - 使用德语词形还原器，索引所有可能的根形式
stem_en - 使用 Porter's 英语词干提取器
stem_ru - 使用 Porter's 俄语词干提取器
stem_enru - 使用 Porter's 英语和俄语词干提取器
stem_cz - 使用捷克语词干提取器
stem_ar - 使用阿拉伯语词干提取器
soundex - 用 SOUNDEX 代码替换关键词
metaphone - 用 METAPHONE 代码替换关键词
icu_chinese - 使用 ICU 进行中文分词
jieba_chinese - 使用 Jieba 进行中文分词（需安装 manticore-language-packs 包）
libstemmer_* 。详情请参考支持语言列表

可以指定多个词干提取器，用逗号分隔。它们将按列出的顺序应用于传入的单词，一旦其中一个词干提取器修改了单词，处理将停止。此外，当启用wordforms功能时，将首先在词形变化字典中查找该词。如果字典中有匹配的条目，则完全不会应用词干提取器。wordforms可以用来实现词干提取的例外情况。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'

POST /cli -d "CREATE TABLE products(title text, price float)  morphology = 'stem_en, libstemmer_sv'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology' => 'stem_en, libstemmer_sv'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'stem_en, libstemmer_sv\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'stem_en, libstemmer_sv'", Some(true)).await;

table products {
  morphology = stem_en, libstemmer_sv
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

morphology_skip_fields = field1[, field2, ...]

要跳过形态学预处理的字段列表。可选，默认是空（对所有字段应用预处理器）。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'

POST /cli -d "
CREATE TABLE products(title text, name text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'morphology_skip_fields' => 'name',
            'morphology' => 'stem_en'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology_skip_fields = \'name\' morphology = \'stem_en\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology_skip_fields = 'name' morphology = 'stem_en'", Some(true)).await;

table products {
  morphology_skip_fields = name
  morphology = stem_en
  type = rt
  path = tbl
  rt_field = title
  rt_field = name
  rt_attr_uint = price
}

min_stemming_len = length

启用词干提取的最小单词长度。可选，默认值为1（对所有单词进行词干提取）。

词干提取器并不完美，有时可能产生不理想的结果。例如，将"gps"关键词通过英语的Porter词干提取器处理会得到"gp"，这并不是预期的结果。min_stemming_len功能允许您根据源单词长度抑制词干提取，即避免对过短的单词进行词干提取。比给定阈值短的关键词将不会被词干提取。请注意，长度恰好等于指定值的关键词会被词干提取。所以要避免对3字符关键词进行词干提取，您应该将值设为4。若需更细粒度的控制，请参考wordforms功能。

‹›

SQL
JSON
PHP
Python
Python-asycnio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'

POST /cli -d "
CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'min_stemming_len' => '4',
             'morphology' => 'stem_en'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) min_stemming_len = \'4\' morphology = \'stem_en\'');

utilsApi.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'", true);

utils_api.sql("CREATE TABLE products(title text, price float) min_stemming_len = '4' morphology = 'stem_en'", Some(true)).await;

table products {
  min_stemming_len = 4
  morphology = stem_en
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_exact_words = {0|1}

此选项允许对原始关键词以及它们经过形态学修改的版本进行索引。但因wordforms和exceptions重映射的原始关键词无法被索引。默认值为0，表示默认禁用此功能。

这允许在查询语言中使用精确形式操作符。启用此功能将增加全文索引的大小和索引时间，但不会影响搜索性能。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'

POST /cli -d "
CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'index_exact_words' => '1',
             'morphology' => 'stem_en'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_exact_words = \'1\' morphology = \'stem_en\'');

utilsApi.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'", true);

utils_api.sql("CREATE TABLE products(title text, price float) index_exact_words = '1' morphology = 'stem_en'", Some(true)).await;

table products {
  index_exact_words = 1
  morphology = stem_en
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

jieba_hmm = {0|1}

启用或禁用Jieba分词工具中的HMM。可选，默认是1。

在Jieba中，HMM（隐马尔可夫模型）选项指的是用于分词的算法。具体来说，它允许Jieba通过识别未知词，特别是词典中不存在的词，进行中文分词。

Jieba主要使用基于词典的方法对已知词进行分词，但在启用HMM选项时，它会应用统计模型来识别词典中不存在的词语或短语的可能词边界。这对于分割新词、罕见词、名称和俚语特别有用。

总之，jieba_hmm选项有助于提高分词准确性，但会牺牲索引性能。它必须与morphology = jieba_chinese一起使用，详见中文、日文、韩文（CJK）和泰国语言。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'

POST /cli -d "
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'morphology' => 'jieba_chinese',
             'jieba_hmm'='1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_hmm = \'0\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_hmm = \'0\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_hmm = \'0\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_hmm = '0'", Some(true)).await;

table products {
  morphology = jieba_chinese
  jieba_hmm = 0
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

jieba_mode = {accurate|full|search}

Jieba 分词模式。可选；默认是 accurate。

在精准模式下，Jieba 使用词典匹配将句子切分为最精确的词语。该模式侧重于精准度，确保分词尽可能准确。

在全模式下，Jieba 尝试将句子切分成所有可能的词语组合，旨在包含所有潜在的词语。该模式侧重于最大化召回率，即尽可能识别所有词语，即便其中有重叠或较少使用的词。它返回词典中所有找到的词语。

在搜索模式下，Jieba 将文本切分为完整词和较小部分，结合精准分词与额外细节，通过提供重叠的词片段。该模式在精准度和召回率之间取得平衡，适合搜索引擎使用。

jieba_mode 应与 morphology = jieba_chinese 一起使用。参见中文、日文、韩文（CJK）和泰语。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'

POST /cli -d "
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'morphology' => 'jieba_chinese',
             'jieba_mode'='full'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_mode = \'full\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_mode = \'full\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_mode = \'full\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_mode = 'full'", Some(true)).await;

table products {
  morphology = jieba_chinese
  jieba_mode = full
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

jieba_user_dict_path = path/to/stopwords/file

Jieba 用户词典的路径。可选。

Jieba 是一个中文文本分词库，使用词典文件辅助分词。这些词典文件的格式如下：每行包含一个词，分为三部分用空格分隔——词语本身、词频和词性标签。词频和词性标签是可选的，可以省略。词典文件必须是 UTF-8 编码。

示例：

创新办 3 i
云计算 5
凱特琳 nz
台中

jieba_user_dict_path 应与 morphology = jieba_chinese 一起使用。详情见中文、日文、韩文（CJK）和泰语。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'

POST /cli -d "
CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'morphology' => 'jieba_chinese',
             'jieba_user_dict_path' = '/usr/local/manticore/data/user-dict.txt'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_user_dict_path = \'/usr/local/manticore/data/user-dict.txt\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_user_dict_path = \'/usr/local/manticore/data/user-dict.txt\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) morphology = \'jieba_chinese\' jieba_user_dict_path = \'/usr/local/manticore/data/user-dict.txt\'');

utilsApi.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'", true);

utils_api.sql("CREATE TABLE products(title text, price float) morphology = 'jieba_chinese' jieba_user_dict_path = '/usr/local/manticore/data/user-dict.txt'", Some(true)).await;

table products {
  morphology = jieba_chinese
  jieba_user_dict_path = /usr/local/manticore/data/user-dict.txt
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

异常高级 HTML 分词

Last modified: August 28, 2025

html_strip = {0|1}

此选项确定是否应从传入的全文数据中去除HTML标记。默认值为0，表示禁用去除。要启用去除，请将值设置为1。

HTML标签和实体被视为标记，并将被处理。

HTML标签被移除，而它们之间的内容（例如，<p>和</p>之间的内容）保持不变。可以选择保留并索引标签属性（例如，A标签中的HREF属性或IMG标签中的ALT属性）。一些常见的内联标签，如A、B、I、S、U、BASEFONT、BIG、EM、FONT、IMG、LABEL、SMALL、SPAN、STRIKE、STRONG、SUB、SUP和TT，将完全移除。所有其他标签被视为块级标签，并用空格替换。例如，文本te<b>st</b>将被索引为单个关键词'test'，而te<p>st</p>将被索引为两个关键词'te'和'st'。

HTML实体被解码并替换为其相应的UTF-8字符。去除器支持实体的数字形式（例如ï）和文本形式（例如ó或 ），并支持HTML4标准中指定的所有实体。

去除器旨在与正确形成的HTML和XHTML一起使用，但在处理不规范的输入（例如带有游离的<'s或未关闭的>'s的HTML）时可能会产生意外结果。

请注意，仅移除标签本身以及HTML注释。要移除标签的内容，包括嵌入的脚本，请参阅html_remove_elements选项。标签名称没有限制，这意味着任何看起来是有效标签开始、结束或注释的内容都将被移除。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) html_strip = '1'");

utils_api.sql("CREATE TABLE products(title text, price float) html_strip = '1'", Some(true)).await;

table products {
  html_strip = 1
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

html_index_attrs = img=alt,title; a=title;

html_index_attrs选项允许您指定即使其他HTML标记被去除，哪些HTML标记属性也应被索引。默认值为空，表示不会索引任何属性。该选项的格式是每个标签的可索引属性的枚举，如上例所示。指定属性的内容将被保留并索引，从而提供从全文数据中提取附加信息的方法。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) html_index_attrs = 'img=alt,title; a=title;' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_index_attrs' => 'img=alt,title; a=title;',
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'");

utils_api.sql("CREATE TABLE products(title text, price float) html_index_attrs = \'img=alt,title; a=title;\' html_strip = '1'", Some(true)).await;

table products {
  html_index_attrs = img=alt,title; a=title;
  html_strip = 1
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

html_remove_elements = element1[, element2, ...]

一个HTML元素列表，其内容及其自身将被移除。可选，默认值为空字符串（不移除任何元素的内容）。

此选项允许您移除元素的内容，即移除它们之间的所有内容。这对于移除嵌入的脚本、CSS等非常有用。空元素的简短标签形式（例如<br/>）得到了适当支持，这样的标签后面的文本不会被移除。

值是一个逗号分隔的元素（标签）名称列表，这些元素的内容应被移除。标签名称是不区分大小写的。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) html_remove_elements = 'style, script' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'html_remove_elements' => 'style, script',
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");

utilsApi.Sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'");

utils_api.sql("CREATE TABLE products(title text, price float) html_remove_elements = \'style, script\' html_strip = '1'", Some(true)).await;

table products {
  html_remove_elements = style, script
  html_strip = 1
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_sp = {0|1}

控制句子和段落边界的检测和索引。可选，默认值为0（不进行检测或索引）。

此指令启用句子和段落边界的检测和索引，使得SENTENCE 和 PARAGRAPH 操作符可以工作。句子边界检测基于纯文本分析，只需设置index_sp = 1即可启用。段落检测依赖于HTML标记，并在HTML去除过程期间进行。因此，要索引段落边界，必须同时设置index_sp指令和html_strip指令为1。

以下规则用于确定句子边界：

问号（?）和感叹号（!）总是表示句子的边界。
结尾的点（.）表示句子的边界，除非出现以下情况：
- 后跟一个字母。这被认为是缩写的一部分（例如 "S.T.A.L.K.E.R." 或 "Goldman Sachs S.p.A."）。
- 后跟一个逗号。这被认为是缩写后跟一个逗号（例如 "Telecom Italia S.p.A., founded in 1994"）。
- 后跟一个空格和一个小写字母。这被认为是句子中的缩写（例如 "News Corp. announced in February"）。
- 前跟一个空格和一个大写字母，后跟一个空格。这被认为是中间名（例如 "John D. Doe"）。

段落边界在每个块级HTML标签处检测，包括：ADDRESS, BLOCKQUOTE, CAPTION, CENTER, DD, DIV, DL, DT, H1, H2, H3, H4, H5, LI, MENU, OL, P, PRE, TABLE, TBODY, TD, TFOOT, TH, THEAD, TR, 和 UL。

句子和段落都会使关键词位置计数器增加1。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) index_sp = '1' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_sp' => '1',
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'", true);

utils_api.sql("CREATE TABLE products(title text, price float) index_sp = \'1\' html_strip = '1'", Some(true)).await;

table products {
  index_sp = 1
  html_strip = 1
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

index_zones = h*, th, title

一个字段内的HTML/XML区域列表，用于索引。默认值为空字符串（不会索引任何区域）。

一个“区域”定义为一对匹配的起始和结束标签之间的内容，所有共享相同标签名的span都称为一个“区域”。例如，文档字段中的 <H1> 和 </H1> 之间的内容属于H1区域。

index_zones 指令启用区域索引，但HTML stripper 也必须启用（通过设置 html_strip = 1）。index_zones 的值应为逗号分隔的标签名和通配符（以星号结尾）的列表，用于作为区域进行索引。

区域可以嵌套和重叠，只要每个起始标签都有一个匹配的标签。区域也可以用于与 ZONE 运算符匹配，如 extended_query_syntax 中所述。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'

POST /cli -d "
CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'index_zones' => 'h*,th,title',
            'html_strip' => '1'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) index_zones = \'h, th, title\' html_strip = \'1\'');

utilsApi.sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'", true);

utils_api.sql("CREATE TABLE products(title text, price float) index_zones = 'h, th, title' html_strip = '1'", Some(true)).await;

table products {
  index_zones = h*, th, title
  html_strip = 1
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

词法创建分布式表

Last modified: August 28, 2025

词形转换

wordforms

wordforms_list

异常

exceptions

exceptions_list

高级形态学

morphology

morphology_skip_fields

min_stemming_len

index_exact_words

jieba_hmm

jieba_mode

jieba_user_dict_path

高级HTML分词

去除HTML标签

html_strip

html_index_attrs

html_remove_elements

从HTML中提取重要部分

index_sp

index_zones