Openapi | Manticore Search Manual

Token filter plugins let you implement a custom tokenizer that makes tokens according to custom rules. There are two type:

Index-time tokenizer declared by index_token_filter in table settings
query-time tokenizer declared by token_filter OPTION directive

In the text processing pipeline, the token filters will run after the base tokenizer processing occurs (which processes the text from field or query and creates tokens out of them).

Index-time tokenizer gets created by indexer on indexing source data into a table or by an RT table on processing INSERT or REPLACE statements.

Plugin is declared as library name:plugin name:optional string of settings. The init functions of the plugin can accept arbitrary settings that can be passed as a string in format option1=value1;option2=value2;...

Example:

index_token_filter = my_lib.so:email_process:field=email;split=.io

The call workflow for index-time token filter is as follows:

XXX_init() gets called right after indexer creates token filter with empty fields list then after indexer got table schema with actual fields list. It must return zero for successful initialization or error description otherwise.
XXX_begin_document gets called only for RT table INSERT/REPLACE for every document. It must return zero for successful call or error description otherwise. Using OPTION token_filter_options additional parameters/settings can be passed to the function.
```
INSERT INTO rt (id, title) VALUES (1, 'some text corp@space.io') OPTION token_filter_options='.io'
```
XXX_begin_field gets called once for each field prior to processing field with base tokenizer with field number as its parameter.
XXX_push_token gets called once for each new token produced by base tokenizer with source token as its parameter. It must return token, count of extra tokens made by token filter and delta position for token.
XXX_get_extra_token gets called multiple times in case XXX_push_token reports extra tokens. It must return token and delta position for that extra token.
XXX_end_field gets called once right after source tokens from current field get over.
XXX_deinit gets called in the very end of indexing.

The following functions are mandatory to be defined: XXX_begin_document and XXX_push_token and XXX_get_extra_token.

Query-time tokenizer gets created on search each time full-text invoked by every table involved.

The call workflow for query-time token filter is as follows:

XXX_init() gets called once per table prior to parsing query with parameters - max token length and string set by token_filter option
```
SELECT * FROM index WHERE MATCH ('test') OPTION token_filter='my_lib.so:query_email_process:io'
```
It must return zero for successful initialization or error description otherwise.
XXX_push_token() gets called once for each new token produced by base tokenizer with parameters: token produced by base tokenizer, pointer to raw token at source query string and raw token length. It must return token and delta position for token.
XXX_pre_morph() gets called once for token right before it got passed to morphology processor with reference to token and stopword flag. It might set stopword flag to mark token as stopword.
XXX_post_morph() gets called once for token after it processed by morphology processor with reference to token and stopword flag. It might set stopword flag to mark token as stopword. It must return flag non-zero value of which means to use token prior to morphology processing.
XXX_deinit() gets called in the very end of query processing.

Absence of any of the functions is tolerated.

️ Miscellaneous tools

Last modified: December 02, 2022

indextool is a helper tool used to dump miscellaneous information about a physical table (not template or distributed). The general usage is:

indextool <command> [options]

Options effective for all commands:

--config <file> (-c <file> for short) overrides the built-in config file names.
--quiet (-q for short) keep indextool quiet - it will not output banner, etc.
--help (-h for short) lists all of the parameters that can be called in your particular build of indextool.
-v show version information of your particular build of indextool.

The commands are as follows:

--checkconfig just loads and verifies the config file to check if it's valid, without syntax errors.
--buildidf DICTFILE1 [DICTFILE2 ...] --out IDFILE build IDF file from one or several dictionary dumps. Additional parameter --skip-uniq will skip unique (df=1) words.
--build-infixes TABLENAME build infixes for an existing dict=keywords table (upgrades .sph, .spi in place). You can use this option for legacy table files that already use dict=keywords, but now need to support infix searching too; updating the table files with indextool may prove easier or faster than regenerating them from scratch with indexer.
--dumpheader FILENAME.sph quickly dumps the provided table header file without touching any other table files or even the configuration file. The report provides a breakdown of all the table settings, in particular the entire attribute and field list.
--dumpconfig FILENAME.sph dumps the table definition from the given table header file in (almost) compliant sphinx.conf file format.
--dumpheader TABLENAME dumps table header by table name with looking up the header path in the configuration file.
--dumpdict TABLENAME dumps dictionary. Additional -stats switch will dump to dictionary the total number of documents. It is required for dictionary files that are used for creation of IDF files.
--dumpdocids TABLENAME dumps document IDs by table name.
--dumphitlist TABLENAME KEYWORD dumps all the hits (occurrences) of a given keyword in a given table, with keyword specified as text.
--dumphitlist TABLENAME --wordid ID dumps all the hits (occurrences) of a given keyword in a given table, with keyword specified as internal numeric ID.
--docextract TBL DOCID runs usual table check pass of whole dictionary/docs/hits, and collects all the words and hits belonging to requested document. Then all of the words are placed in the order according to their fields and positions, and result is printed, grouping by field.
--fold TABLENAME OPTFILE This options is useful too see how actually tokenizer proceeds input. You can feed indextool with text from file if specified or from stdin otherwise. The output will contain spaces instead of separators (accordingly to your charset_table settings) and lowercased letters in words.
--htmlstrip TABLENAME filters stdin using HTML stripper settings for a given table, and prints the filtering results to stdout. Note that the settings will be taken from sphinx.conf, and not the table header.
--mergeidf NODE1.idf [NODE2.idf ...] --out GLOBAL.idf merge several .idf files into a single one. Additional parameter --skip-uniq will skip unique (df=1) words.
--morph TABLENAME applies morphology to the given stdin and prints the result to stdout.
--check TABLENAME checks the table data files for consistency errors that might be introduced either by bugs in indexer and/or hardware faults. --check also works on RT tables, RAM and disk chunks. Additional options:
- --check-id-dups checks if there are duplicate ids
- --check-disk-chunk CHUNK_NAME checks only specific disk chunk of an RT table. The argument is a disk chunk numeric extension of the RT table to check.
--strip-path strips the path names from all the file names referenced from the table (stopwords, wordforms, exceptions, etc). This is useful for checking tables built on another machine with possibly different path layouts.
--rotate works only with --check and defines whether to check table waiting for rotation, i.e. with .new extension. This is useful when you want to check your table before actually using it.
--apply-killlists loads and applies kill-lists for all tables listed in the config file. Changes are saved in .SPM files. Kill-list files (.SPK) are deleted. This can be useful if you want to move applying tables from server startup to indexing stage.

spelldump is used to extract contents of a dictionary file that uses ispell or MySpell format, which can help build word lists for wordforms - all of the possible forms are pre-built for you.

The general usage is:

spelldump [options] <dictionary> <affix> [result] [locale-name]

The two main parameters are the dictionary's main file and its affix file; usually these are named as [language-prefix].dict and [language-prefix].aff and will be available with most common Linux distributions, as well as various places online.

[result] specifies where the dictionary data should be output to, and [locale-name] additionally specifies the locale details you wish to use.

There is an additional option, -c [file], which specifies a file for case conversion details.

Examples of its usage are:

spelldump en.dict en.aff
spelldump ru.dict ru.aff ru.txt ru_RU.CP1251
spelldump ru.dict ru.aff ru.txt .1251

The results file will contain a list of all the words in the dictionary in alphabetical order, output in the format of a wordforms file, which you can use to customize for your specific circumstances. An example of the result file:

zone > zone
zoned > zoned
zoning > zoning

wordbreaker is used to split compound words, as usual in URLs, into its component words. For example, this tool can split "lordoftherings" into its four component words, or http://manofsteel.warnerbros.com into "man of steel warner bros". This helps searching, without requiring prefixes or infixes: searching for "sphinx" wouldn't match "sphinxsearch" but if you break the compound word and index the separate components, you'll get a match without the costs of prefix and infix larger full-text index files.

Examples of its usage are:

echo manofsteel | bin/wordbreaker -dict dict.txt split
man of steel

The input stream will be separated in words using the -dict dictionary file. In no dictionary specified, wordbreaker looks in the working folder for a wordbreaker-dict.txt file. (The dictionary should match the language of the compound word.) The split command breaks words from the standard input, and outputs the result in the standard output. There are also test and bench commands that let you test the splitting quality and benchmark the splitting functionality.

Wordbreaker needs a dictionary to recognize individual substrings within a string. To differentiate between different guesses, it uses the relative frequency of each word in the dictionary: higher frequency means higher split probability. You can generate such a file using the indexer tool:

indexer --buildstops dict.txt 100000 --buildfreqs myindex -c /path/to/sphinx.conf

which will write the 100,000 most frequent words, along with their counts, from myindex into dict.txt. The output file is a text file, so you can edit it by hand, if need be, to add or remove words.

Token filter plugins ️ OpenAPI specification

Last modified: January 25, 2023

The Manticore Search API is documented using the OpenAPI specification that can be used to generate client SDKs. A machine readable YAML file is available at https://raw.githubusercontent.com/manticoresoftware/openapi/master/manticore.yml

You can also look at the specification visualized with the online Swagger Editor here.

️ Miscellaneous tools ️ Telemetry

Last modified: August 02, 2022

At Manticore, we gather various anonymized metrics in order to enhance the quality of our products, including Manticore Search. By analyzing this data, we can not only improve the overall performance of our product, but also identify which features would be most beneficial to prioritize in order to provide even more value to our users. The telemetry system operates on a separate thread in non-blocking mode, taking snapshots and sending them once every few minutes.

We take your privacy seriously, and you can be assured that all metrics are completely anonymous and no sensitive information is transmitted. However, if you still wish to disable telemetry, you can do so by:

setting the environment variable TELEMETRY=0
or setting telemetry = 0 in the section searchd of your configuration file

Here's a list of all the metrics we collect:

Metric	Description
collector	🏷 `buddy`. Indicates that this metric comes through Manticore Buddy
os_name	🏷️ Name of the operating system
machine_id	🏷 Server identifier (the content of `/etc/machine-id` in Linux)
manticore_version	🏷️ Version of Manticore
columnar_version	🏷️ Version of Columnar lib if it is installed
secondary_version	🏷️ Version of the secondary lib if the Columnar lib is installed
buddy_version	🏷️ Version of the Buddy
invocation	Sent when the Buddy is launched
show_queries	Indicates that the `show queries` command was executed
backup	Indicates that the `backup` query was executed
insert_query	Indicates that the auto schema logic was executed
command_*	All metrics with this prefix are sent from the `show status` query of the manticore daemon
uptime	The uptime of the manticore search daemon
workers_total	The number of workers that Manticore uses
cluster_*	Cluster-related metrics from the `show status` results
table_*_count	The number of tables created for each type: plain, percolate, rt, or distributed
field_count	The count for each field type for tables with rt and percolate types
columnar	Indicates that the columnar lib was used
columnar_field_count	The number of fields that use the columnar lib

The Manticore backup tool sends anonymized metrics to the Manticore metrics server by default in order to help the maintainers improve the product. If you wish to disable telemetry, you can do so by running the tool with the --disable-metric flag or by setting the environment variable TELEMETRY=0.

Below is a list of all metrics that we collect:

Metric	Description
collector	🏷 `backup`. Means this metric comes from the backup tool
os_name	🏷️ Name of the operating system
machine_id	🏷 Server identifier (the content of `/etc/machine-id` in Linux)
backup_version	🏷️ Version of the backup tool that was used
manticore_version	🏷️ Version of Manticore
columnar_version	🏷️ Version of Columnar lib if it is installed
secondary_version	🏷️ Version of the secondary lib if the Columnar lib is installed
invocation	Sent when backup was invoked
failed	Sent in the event of a failed backup
done	Sent when the backup/restore was successful
arg_*	The arguments used to run the tool (excluding index names, etc.)
backup_store_versions_fails	Indicates a failure to save the Manticore version in the backup
backup_table_count	The total number of backed up tables
backup_no_permissions	Failed to back up due to insufficient permissions to the destination directory
backup_total_size	The total size of the full backup
backup_time	The duration of the backup
restore_searchd_running	Failed to run the restore process due to searchd already being running
restore_no_config_file	No config file found in the backup during restore
restore_time	The duration of the restore
fsync_time	The duration of the fsync
restore_target_exists	Occurs when there is already a folder or index in the destination folder to restore to
terminations	Indicates that the process was terminated
signal_*	The signal used to terminate the process
tables	The number of tables in Manticore
config_unreachable	The specified configuration file does not exist
config_data_dir_missing	Failed to parse the `data_dir` from the specified configuration file
config_data_dir_is_relative	The `data_dir` path in the configuration file of the Manticore instance is relative

️ OpenAPI specification ️ Changelog

Last modified: January 04, 2023

Token filter plugins

Index-time tokenizer

query-time token filter

Miscellaneous tools

indextool

spelldump

wordbreaker

OpenAPI specification

Telemetry

Backup metrics