Token filter plugins allow you to implement a custom tokenizer that creates tokens according to custom rules. There are two types:
- Index-time tokenizer declared by index_token_filter in table settings
- Query-time tokenizer declared by token_filter OPTION directive
In the text processing pipeline, token filters will run after the base tokenizer processing occurs (which processes the text from fields or queries and creates tokens out of them).
Index-time tokenizer is created by
indexer when indexing source data into a table or by an RT table when processing
Plugin is declared as
library name:plugin name:optional string of settings. The init functions of the plugin can accept arbitrary settings that can be passed as a string in the format
index_token_filter = my_lib.so:email_process:field=email;split=.io
The call workflow for index-time token filter is as follows:
XXX_init()gets called right after
indexercreates token filter with an empty fields list and then after indexer gets the table schema with the actual fields list. It must return zero for successful initialization or an error description otherwise.
XXX_begin_documentgets called only for RT table
REPLACEfor every document. It must return zero for a successful call or an error description otherwise. Using OPTION
token_filter_options, additional parameters/settings can be passed to the function.
INSERT INTO rt (id, title) VALUES (1, 'some text [email protected]') OPTION token_filter_options='.io'
XXX_begin_fieldgets called once for each field prior to processing the field with the base tokenizer, with the field number as its parameter.
XXX_push_tokengets called once for each new token produced by the base tokenizer, with the source token as its parameter. It must return the token, count of extra tokens made by the token filter, and delta position for the token.
XXX_get_extra_tokengets called multiple times in case
XXX_push_tokenreports extra tokens. It must return the token and delta position for that extra token.
XXX_end_fieldgets called once right after the source tokens from the current field are processed.
XXX_deinitgets called at the very end of indexing.
The following functions are mandatory to be defined:
Query-time tokenizer gets created on search each time full-text is invoked by every table involved.
The call workflow for query-time token filter is as follows:
XXX_init()gets called once per table prior to parsing the query with parameters - max token length and a string set by the
SELECT * FROM index WHERE MATCH ('test') OPTION token_filter='my_lib.so:query_email_process:io'
It must return zero for successful initialization or error description otherwise.
XXX_push_token()gets called once for each new token produced by the base tokenizer with parameters: token produced by the base tokenizer, pointer to raw token at source query string, and raw token length. It must return the token and delta position for the token.
XXX_pre_morph()gets called once for the token right before it gets passed to the morphology processor with a reference to the token and stopword flag. It might set the stopword flag to mark the token as a stopword.
XXX_post_morph()gets called once for the token after it is processed by the morphology processor with a reference to the token and stopword flag. It might set the stopword flag to mark the token as a stopword. It must return a flag, the non-zero value of which means to use the token prior to morphology processing.
XXX_deinit()gets called at the very end of query processing.
Absence of the functions is tolerated.
indextool is a helpful utility that extracts various information about a physical table, excluding
distributed tables. Here's the general syntax for utilizing
indextool <command> [options]
These options are applicable to all commands:
-c <file>for short) lets you override the default configuration file names.
-qfor short) suppresses the output of banners and such by
-hfor short) displays all parameters available in your specific build of
-vdisplays the version information of your specific
Here are the available commands:
--checkconfigloads and verifies the config file, checking its validity and for any syntax errors.
--buildidf DICTFILE1 [DICTFILE2 ...] --out IDFILEconstructs an IDF file from one or more dictionary dumps (refer to
--dumpdict). The additional parameter
--skip-uniqwill omit unique words (df=1).
--build-infixes TABLENAMEgenerates infixes for a pre-existing dict=keywords table (updates .sph, .spi in place). Use this option for legacy table files already employing dict=keywords, but now requiring infix search support; updating the table files with indextool may be simpler or quicker than recreating them from scratch with indexer.
--dumpheader FILENAME.sphpromptly dumps the given table header file without disturbing any other table files or even the config file. The report offers a detailed view of all the table settings, especially the complete attribute and field list.
--dumpconfig FILENAME.sphextracts the table definition from the specified table header file in an (almost) manticore.conf file-compliant format.
--dumpheader TABLENAMEdumps table header by table name while searching for the header path in the config file.
--dumpdict TABLENAMEdumps the dictionary. An extra
-statsswitch will add the total document count to the dictionary dump. This is necessary for dictionary files used in IDF file creation.
--dumpdocids TABLENAMEdumps document IDs by table name.
--dumphitlist TABLENAME KEYWORDdumps all instances (occurrences) of a specified keyword in a given table, with the keyword defined as text.
--dumphitlist TABLENAME --wordid IDdumps all instances (occurrences) of a specific keyword in a given table, with the keyword represented as an internal numeric ID.
--docextract TBL DOCIDexecutes a standard table check pass of the entire dictionary/docs/hits, and gathers all the words and hits associated with the requested document. Subsequently, all the words are arranged according to their fields and positions, and the result is printed, grouped by field.
--fold TABLENAME OPTFILEThis option helps understand how the tokenizer processes input. You can supply the indextool with text from a file, if specified, or from stdin otherwise. The output will replace separators with spaces (based on your
charset_tablesettings) and convert letters in words to lowercase.
--htmlstrip TABLENAMEapplies the HTML stripper settings for a specified table to filter stdin, and sends the filtering results to stdout. Be aware that the settings will be fetched from manticore.conf, and not from the table header.
--mergeidf NODE1.idf [NODE2.idf ...] --out GLOBAL.idfcombines multiple .idf files into a single one. The extra parameter
--skip-uniqwill ignore unique words (df=1).
--morph TABLENAMEapplies morphology to the given stdin and directs the result to stdout.
--check TABLENAMEevaluates the table data files for consistency errors that could be caused by bugs in
indexeror hardware faults.
--checkis also functional on RT tables, RAM, and disk chunks. Additional options:
--check-id-dupsassesses for duplicate ids
--check-disk-chunk CHUNK_NAMEchecks only a specific disk chunk of an RT table. The argument is the numeric extension of the RT table's disk chunk to be checked.
--strip-pathremoves the path names from all file names referred to from the table (stopwords, wordforms, exceptions, etc). This is helpful when verifying tables built on a different machine with possibly varying path layouts.
--rotateis only compatible with
--checkand determines whether to check the table waiting for rotation, i.e., with a .new extension. This is useful when you wish to validate your table before actually putting it into use.
--apply-killlistsloads and applies kill-lists for all tables listed in the config file. Changes are saved in .SPM files. Kill-list files (.SPK) are removed. This can be handy if you want to shift the application of tables from server startup to indexing stage.
spelldump command is designed to retrieve the contents from a dictionary file that employs the
MySpell format. This can be handy when you need to compile word lists for wordforms, as it generates all possible forms for you.
Here's the general syntax:
spelldump [options] <dictionary> <affix> [result] [locale-name]
The primary parameters are the main file and the affix file of the dictionary. Typically, these are named as
[language-prefix].aff, respectively. You can find these files in most standard Linux distributions or from numerous online sources.
[result] parameter is where the extracted dictionary data will be stored, and
[locale-name] is the parameter used to specify the locale details of your choice.
There's an optional
-c [file] option as well. This option allows you to specify a file for case conversion details.
Here are some usage examples:
spelldump en.dict en.aff spelldump ru.dict ru.aff ru.txt ru_RU.CP1251 spelldump ru.dict ru.aff ru.txt .1251
The resulting file will list all the words from the dictionary, arranged alphabetically and formatted like a wordforms file. You can then modify this file as per your specific requirements. Here's a sample of what the output file might look like:
zone > zone zoned > zoned zoning > zoning
wordbreaker tool is designed to deconstruct compound words, a common feature in URLs, into their individual components. For instance, it can dissect "lordoftherings" into four separate words or break down
http://manofsteel.warnerbros.com into "man of steel warner bros". This ability enhances search functionality by eliminating the need for prefixes or infixes. To illustrate, a search for "sphinx" wouldn't yield "sphinxsearch" in the results. However, if you apply
wordbreaker to disassemble the compound word and index the detached elements, a search will be successful without the file size expansion associated with prefix or infix usage in full-text indexing.
Here are some examples of how to use
echo manofsteel | bin/wordbreaker -dict dict.txt split man of steel
-dict dictionary file is used to separate the input stream into individual words. If no dictionary file is specified, Wordbreaker will look for a file named
wordbreaker-dict.txt in the current working directory. (Ensure that the dictionary file matches the language of the compound word you're working with.) The
split command breaks words from the standard input and sends the results to the standard output. The
bench commands are also available to assess the splitting quality and measure the performance of the splitting function, respectively.
Wordbreaker uses a dictionary to identify individual substrings within a given string. To distinguish between multiple potential splits, it considers the relative frequency of each word in the dictionary. A higher frequency indicates a higher likelihood for a word split. To generate a file of this nature, you can use the
indexer --buildstops dict.txt 100000 --buildfreqs myindex -c /path/to/manticore.conf
which will produce a text file named
dict.txt that contains the 100,000 most frequently occurring words from
myindex, along with their respective counts. Since this output file is a simple text document, you have the flexibility to manually edit it whenever needed. Feel free to add or remove words as required.
The Manticore Search API is documented using the OpenAPI specification, which can be used to generate client SDKs. The machine-readable YAML file is available at https://raw.githubusercontent.com/manticoresoftware/openapi/master/manticore.yml
You can also view the specification visualized with the online Swagger Editor here.