Chinese, Japanese and Korean (CJK) languages

Manticore has built-in support for indexing CJK texts. There are two ways how CJK text can be processed:

  • Precise segmentation using ICU library (only Chinese is supported for now)
‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • Javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'
  • Basic support with N-grams options ngram_len and ngram_chars There are separate charset tables(chinese, korean, japanese) that can be used for each CJK-language or, alternatively, common cjk charset table can be applied.
‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

There's also built-in stopwords for Chinese with alias zh.

‹›
  • SQL
  • HTTP
  • PHP
  • Python
  • javascript
  • Java
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'