Chinese, Japanese and Korean (CJK) and Thai languages

Manticore provides built-in support for indexing languages with continuous scripts (i.e., languages that do not use spaces or other marks between words or sentences). This allows you to process texts in these languages in two different ways:

  1. Precise segmentation using the ICU library. Currently, only Chinese is supported.
‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'icu_chinese'
  1. Precise segmentation using the Jieba library. Like ICU, it currently supports only Chinese.
‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'cont' morphology = 'jieba_chinese'
  1. Basic support using the N-gram options ngram_len and ngram_chars For each language using a continuous script, there are separate character set tables (chinese, korean, japanese, thai) that can be used. Alternatively, you can use the common cont character set table to support all CJK and Thai languages at once, or the cjk charset to include all CJK languages only.
‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cont'

/* Or, alternatively */
CREATE TABLE products(title text, price float) charset_table = 'non_cont' ngram_len = '1' ngram_chars = 'cjk,thai'

Additionally, there is built-in support for Chinese stopwords with the alias zh.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'