Token filter plugins let you implement a custom tokenizer that makes tokens according to custom rules. There are two type:
- Index-time tokenizer declared by index_token_filter in index settings
- query-time tokenizer declared by token_filter OPTION directive
Token filters processing tokens after base tokenizer processed text at field or query and made tokens from it. In the text processing pipeline, the token filters will run after the base tokenizer processing occurs (which process the text from field or query and create tokens out of them).
Index-time tokenizer gets created by indexer on indexing source data into index or by RT index on processing INSERT
or REPLACE
statements.
Plugin is declared as library name:plugin name:optional string of settings
. The init functions of the plugin can accept arbitrary settings that can be passed as a string in format option1=value1;option2=value2;..
.
Example:
index_token_filter = my_lib.so:email_process:field=email;split=.io
The call workflow for index-time token filter is as follows:
XXX_init()
gets called right after indexer creates token filter with empty fields list then after indexer got index schema with actual fields list. It must return zero for successful initialization or error description otherwise.XXX_begin_document
gets called only for RT indexINSERT
/REPLACE
for every document. It must return zero for successful call or error description otherwise. Using OPTIONtoken_filter_options
additional parameters/settings can be passed to the function.INSERT INTO rt (id, title) VALUES (1, 'some text corp@space.io') OPTION token_filter_options='.io'
XXX_begin_field
gets called once for each field prior to processing field with base tokenizer with field number as its parameter.XXX_push_token
gets called once for each new token produced by base tokenizer with source token as its parameter. It must return token, count of extra tokens made by token filter and delta position for token.XXX_get_extra_token
gets called multiple times in caseXXX_push_token
reports extra tokens. It must return token and delta position for that extra token.XXX_end_field
gets called once right after source tokens from current field get over.XXX_deinit
gets called in the very end of indexing.
The following functions are mandatory to be defined: XXX_begin_document
and XXX_push_token
and XXX_get_extra_token
.
Query-time tokenizer gets created on search each time full-text invoked by every index involved.
The call workflow for query-time token filter is as follows:
XXX_init()
gets called once per index prior to parsing query with parameters - max token length and string set bytoken_filter
optionSELECT * FROM index WHERE MATCH ('test') OPTION token_filter='my_lib.so:query_email_process:io'
It must return zero for successful initialization or error description otherwise.
XXX_push_token()
gets called once for each new token produced by base tokenizer with parameters: token produced by base tokenizer, pointer to raw token at source query string and raw token length. It must return token and delta position for token.XXX_pre_morph()
gets called once for token right before it got passed to morphology processor with reference to token and stopword flag. It might set stopword flag to mark token as stopword.XXX_post_morph()
gets called once for token after it processed by morphology processor with reference to token and stopword flag. It might set stopword flag to mark token as stopword. It must return flag non-zero value of which means to use token prior to morphology processing.XXX_deinit()
gets called in the very end of query processing.
Absence of any of the functions is tolerated.