NLP 和分词 > 支持的语言 | Manticore Search Manual

≫ NLP 和分词

Manticore 不会将文本原样存储以进行全文搜索。相反，它会将文本拆分成单词（称为标记），并构建多个内部结构以实现快速全文搜索。这些结构包括一个字典，用于快速检查某个单词是否存在于索引中。其他结构则跟踪包含该单词的文档和字段，甚至该单词在字段中的具体位置。这些都在搜索时用于查找相关结果。

这种拆分和处理文本的过程称为标记化。标记化既发生在向索引添加数据时，也发生在运行搜索时。它同时在字符和单词级别进行。

在字符级别，只有特定字符被允许。此行为由charset_table 控制。其它所有字符都会被替换为空格（空格被视为单词分隔符）。charset_table 还支持将字符转换为小写字母或将一个字符替换为另一个字符。它还可以定义被忽略、混合或视为短语边界的字符。

在单词级别，引擎使用min_word_len 设置来决定应索引的单词最小长度（以字符计）。

Manticore 还支持不同形式的单词匹配。例如，为了将“car”和“cars”视为同一单词，可以使用形态处理器。

如果您希望不同的词被视为相同——例如，“USA”和“United States”——您可以使用word forms 功能进行定义。

非常常见的词（如“the”，“and”，“is”）会减慢搜索速度并增加索引大小。您可以使用停用词将它们过滤掉。这可以使搜索更快，索引更小。

更高级的过滤方法是二元词组，它通过将一个常见词和一个不常见词组合起来创建特殊标记。当常见词出现时，这能显著加快短语搜索。

如果您正在索引 HTML，通常最好不要将 HTML 标签包含在索引中，因为它们添加了很多不必要的内容。您可以使用HTML 去除功能去除标签，但仍能索引某些标签属性或完全跳过特定元素。

请记住 Manticore 的最大标记长度为 42 个字符。任何超过此长度的单词都将被截断。此限制在索引和搜索时均适用，因此确保您的数据和查询考虑到这一点非常重要。

NLP 和分词支持的语言

Last modified: August 28, 2025

Manticore 支持多种语言，大多数语言通过 charset_table = non_cont（默认值）启用基本支持。也可以使用 non_cjk 选项，它是 non_cont 的别名：charset_table = non_cjk。

对于许多语言，Manticore 提供了停用词文件，可用于提升搜索相关性。

此外，对少数语言提供了高级形态学功能，通过使用基于词典的词形还原或词干算法进行更好的分词和规范化，显著提升搜索相关性。

下表列出了所有支持的语言，并指明如何启用：

基本支持（“Supported” 列）
停用词文件（“Stopwords file name” 列）
高级形态学（“Advanced morphology” 列）

语言	支持	停用词文件名	高级词形处理	备注
南非荷兰语	charset_table=non_cont	af	-
阿拉伯语	charset_table=non_cont	ar	morphology=stem_ar（阿拉伯语词干提取器）；morphology=libstemmer_ar
亚美尼亚语	charset_table=non_cont	hy	-
阿萨姆语	手动指定 charset_table	-	-
巴斯克语	charset_table=non_cont	eu	-
孟加拉语	charset_table=non_cont	bn	-
比什努普里亚语	手动指定 charset_table	-	-
布希德语	手动指定 charset_table	-	-
保加利亚语	charset_table=non_cont	bg	-
加泰罗尼亚语	charset_table=non_cont	ca	morphology=libstemmer_ca
使用 ICU 的中文	charset_table=chinese	zh	morphology=icu_chinese	比使用ngrams更准确
使用 Jieba 的中文	charset_table=chinese	zh	morphology=jieba_chinese，需要包 `manticore-language-packs`	比使用ngrams更准确
使用ngrams的中文	ngram_chars=chinese	zh	ngram_chars=1	索引速度更快，但搜索性能可能不如
克罗地亚语	charset_table=non_cont	hr	-
库尔德语	charset_table=non_cont	ckb	-
捷克语	charset_table=non_cont	cz	morphology=stem_cz（捷克语词干提取器）
丹麦语	charset_table=non_cont	da	morphology=libstemmer_da
荷兰语	charset_table=non_cont	nl	morphology=libstemmer_nl
英语	charset_table=non_cont	en	morphology=lemmatize_en（单根形式）；morphology=lemmatize_en_all（所有根形式）；morphology=stem_en（Porter英语词干提取器）；morphology=stem_enru（Porter英语和俄语词干提取器）；morphology=libstemmer_en（来自libstemmer的英语）
世界语	charset_table=non_cont	eo	-
爱沙尼亚语	charset_table=non_cont	et	-
芬兰语	charset_table=non_cont	fi	morphology=libstemmer_fi
法语	charset_table=non_cont	fr	morphology=libstemmer_fr
加利西亚语	charset_table=non_cont	gl	-
加罗语	手动指定 charset_table	-	-
德语	charset_table=non_cont	de	morphology=lemmatize_de（单根形式）；morphology=lemmatize_de_all（所有根形式）；morphology=libstemmer_de
希腊语	charset_table=non_cont	el	morphology=libstemmer_el
希伯来语	charset_table=non_cont	he	-
印地语	charset_table=non_cont	hi	morphology=libstemmer_hi
老挝语	手动指定 charset_table	-	-
霍语	手动指定 charset_table	-	-
匈牙利语	charset_table=non_cont	hu	morphology=libstemmer_hu
印度尼西亚语	charset_table=non_cont	id	morphology=libstemmer_id
爱尔兰语	charset_table=non_cont	ga	morphology=libstemmer_ga
意大利语	charset_table=non_cont	it	morphology=libstemmer_it
日语	ngram_chars=japanese	-	ngram_chars=japanese ngram_len=1	需要基于ngram的分词
科米语	手动指定 charset_table	-	-
韩语	ngram_chars=korean	-	ngram_chars=korean ngram_len=1	需要基于ngram的分词
大花苗语	手动指定 charset_table	-	-
拉丁语	charset_table=non_cont	la	-
拉脱维亚语	charset_table=non_cont	lv	-
立陶宛语	charset_table=non_cont	lt	morphology=libstemmer_lt
马巴语	手动指定 charset_table	-	-
马提利语	手动指定 charset_table	-	-
马拉地语	手动指定 charset_table	-	-
马拉地语	charset_table=non_cont	mr	-
门德语	手动指定 charset_table	-	-
马鲁语	手动指定 charset_table	-	-
美尼语	手动指定 charset_table	-	-
尼泊尔语	手动指定 charset_table	-	morphology=libstemmer_ne
甘巴伊语	手动指定 charset_table	-	-
挪威语	charset_table=non_cont	no	morphology=libstemmer_no
奥里亚语	手动指定 charset_table	-	-
波斯语	charset_table=non_cont	fa	-
波兰语	charset_table=non_cont	pl	-
葡萄牙语	charset_table=non_cont	pt	morphology=libstemmer_pt
罗马尼亚语	charset_table=non_cont	ro	morphology=libstemmer_ro
俄语	charset_table=non_cont	ru	morphology=lemmatize_ru（单根形式）；morphology=lemmatize_ru_all（所有根形式）；morphology=stem_ru（Porter俄语词干提取器）；morphology=stem_enru（Porter英语和俄语词干提取器）；morphology=libstemmer_ru（来自libstemmer）
桑塔利语	手动指定 charset_table	-	-
信德语	手动指定 charset_table	-	-
斯洛伐克语	charset_table=non_cont	sk	-
斯洛文尼亚语	charset_table=non_cont	sl	-
索马里语	charset_table=non_cont	so	-
塞索托语	charset_table=non_cont	st	-
西班牙语	charset_table=non_cont	es	morphology=libstemmer_es
斯瓦希里语	charset_table=non_cont	sw	-
瑞典语	charset_table=non_cont	sv	morphology=libstemmer_sv
锡尔赫提语	手动指定 charset_table	-	-
泰米尔语	手动指定 charset_table	-	morphology=libstemmer_ta
泰语	charset_table=thai	th	-
土耳其语	charset_table=non_cont	tr	morphology=libstemmer_tr
乌克兰语	charset_table=non_cont,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491	-	morphology=lemmatize_uk_all	需要安装 UK词形还原器
越南语	charset_table=non_cont	-	-	使用拉丁字母。越南语变音符号（ă, â, ê, ô, ơ, ư, đ 和声调符号）默认会自动映射到其基本拉丁字符，因此无需额外配置，“tiếng” 会匹配 “tieng”
约鲁巴语	charset_table=non_cont	yo	-
祖鲁语	charset_table=non_cont	zu	-

数据分词连续脚本语言

Last modified: January 20, 2026

Manticore 提供了对使用连续书写的语言（即不使用单词或句子之间分隔符的语言）进行索引的内置支持。这允许您以两种不同的方式处理这些语言的文字：

使用 ICU 库进行精确分词。目前仅支持中文。

‹›

SQL
JSON
PHP
Python
Python-asyncio
Javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cont',
            'morphology' => 'icu_chinese'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'icu_chinese\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'icu_chinese\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'icu_chinese\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'", Some(true)).await;

table products {
  charset_table = cont
  morphology = icu_chinese
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

使用 Jieba 库进行精确分词。与 ICU 类似，它目前仅支持中文。

‹›

SQL
JSON
PHP
Python
Python-asyncio
Javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'cont',
            'morphology' => 'jieba_chinese'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'jieba_chinese\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'jieba_chinese\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'cont\' morphology = \'jieba_chinese\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'", Some(true)).await;

table products {
  charset_table = cont
  morphology = jieba_chinese
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

使用 N-gram 选项 ngram_len 和 ngram_chars 进行基本支持。对于每种使用连续书写的语言，都有单独的字符集表（chinese、korean、japanese、thai），可以使用。或者，您可以使用通用的 cont 字符集表同时支持所有 CJK 和泰语语言，或者使用 cjk 字符集仅包括所有 CJK 语言。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'
/* Or, alternatively */
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'"
/* Or, alternatively */
POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
             'charset_table' => 'non_cont',
             'ngram_len' => '1',
             'ngram_chars' => 'cont'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cont\' ngram_len = \'1\' ngram_chars = \'cont\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cont\' ngram_len = \'1\' ngram_chars = \'cont\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'non_cont\' ngram_len = \'1\' ngram_chars = \'cont\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'", Some(true)).await;

table products {
  charset_table = non_cont
  ngram_len = 1
  ngram_chars = cont
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

此外，还提供了对中文停用词的内置支持，别名 zh。

‹›

SQL
JSON
PHP
Python
Python-asyncio
javascript
Java
C#
Rust
CONFIG

📋

CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'

POST /cli -d "
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'"

$index = new \Manticoresearch\Index($client);
$index->setName('products');
$index->create([
            'title'=>['type'=>'text'],
            'price'=>['type'=>'float']
        ],[
            'charset_table' => 'chinese',
            'morphology' => 'icu_chinese',
            'stopwords' => 'zh'
        ]);

utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')

await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'')

res = await utilsApi.sql('CREATE TABLE products(title text, price float) charset_table = \'chinese\' morphology = \'icu_chinese\' stopwords = \'zh\'');

utilsApi.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'", true);

utilsApi.Sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'", true);

utils_api.sql("CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'", Some(true)).await;

table products {
  charset_table = chinese
  morphology = icu_chinese
  stopwords = zh
  type = rt
  path = tbl
  rt_field = title
  rt_attr_uint = price
}

支持的语言低级分词

Last modified: August 28, 2025

≫ NLP 和分词

数据标记化

字符级标记化

单词级标记化

处理常见和噪声词

HTML 内容

标记长度限制

支持的语言

中文、日文、韩文和泰语语言