Chinese, Japanese and Korean (CJK) languages

Manticore provides built-in support for indexing CJK texts, allowing you to process CJK texts in two different ways:

  1. Precise segmentation using the ICU library. Currently, only Chinese is supported.
‹›
  • SQL
  • JSON
  • PHP
  • Python
  • Javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'cjk' morphology = 'icu_chinese'
  1. Basic support using the N-gram options ngram_len and ngram_chars For each CJK language, there are separate character set tables (chinese, korean, japanese) that can be used, or you can use the common cjk character set table.
‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'non_cjk' ngram_len = '1' ngram_chars = 'cjk'

Additionally, there is built-in support for Chinese stopwords with the alias zh.

‹›
  • SQL
  • JSON
  • PHP
  • Python
  • javascript
  • Java
  • C#
  • CONFIG
📋
CREATE TABLE products(title text, price float) charset_table = 'chinese' morphology = 'icu_chinese' stopwords = 'zh'