Token filter plugins let you implement a custom tokenizer that makes tokens according to custom rules. There are two type:
- Index-time tokenizer declared by index_token_filter in table settings
- query-time tokenizer declared by token_filter OPTION directive
In the text processing pipeline, the token filters will run after the base tokenizer processing occurs (which processes the text from field or query and creates tokens out of them).
Index-time tokenizer gets created by indexer on indexing source data into a table or by an RT table on processing INSERT or REPLACE statements.
Plugin is declared as library name:plugin name:optional string of settings. The init functions of the plugin can accept arbitrary settings that can be passed as a string in format option1=value1;option2=value2;...
Example:
index_token_filter = my_lib.so:email_process:field=email;split=.io
The call workflow for index-time token filter is as follows:
XXX_init()gets called right afterindexercreates token filter with empty fields list then after indexer got table schema with actual fields list. It must return zero for successful initialization or error description otherwise.XXX_begin_documentgets called only for RT tableINSERT/REPLACEfor every document. It must return zero for successful call or error description otherwise. Using OPTIONtoken_filter_optionsadditional parameters/settings can be passed to the function.INSERT INTO rt (id, title) VALUES (1, 'some text corp@space.io') OPTION token_filter_options='.io'XXX_begin_fieldgets called once for each field prior to processing field with base tokenizer with field number as its parameter.XXX_push_tokengets called once for each new token produced by base tokenizer with source token as its parameter. It must return token, count of extra tokens made by token filter and delta position for token.XXX_get_extra_tokengets called multiple times in caseXXX_push_tokenreports extra tokens. It must return token and delta position for that extra token.XXX_end_fieldgets called once right after source tokens from current field get over.XXX_deinitgets called in the very end of indexing.
The following functions are mandatory to be defined: XXX_begin_document and XXX_push_token and XXX_get_extra_token.
Query-time tokenizer gets created on search each time full-text invoked by every table involved.
The call workflow for query-time token filter is as follows:
XXX_init()gets called once per table prior to parsing query with parameters - max token length and string set bytoken_filteroptionSELECT * FROM index WHERE MATCH ('test') OPTION token_filter='my_lib.so:query_email_process:io'It must return zero for successful initialization or error description otherwise.
XXX_push_token()gets called once for each new token produced by base tokenizer with parameters: token produced by base tokenizer, pointer to raw token at source query string and raw token length. It must return token and delta position for token.XXX_pre_morph()gets called once for token right before it got passed to morphology processor with reference to token and stopword flag. It might set stopword flag to mark token as stopword.XXX_post_morph()gets called once for token after it processed by morphology processor with reference to token and stopword flag. It might set stopword flag to mark token as stopword. It must return flag non-zero value of which means to use token prior to morphology processing.XXX_deinit()gets called in the very end of query processing.
Absence of any of the functions is tolerated.
indextool is a utility tool that helps to dump various information about a physical table (excluding template or distributedtables). The general syntax for using indextool is:
indextool <command> [options]
Options effective for all commands:
--config <file>(-c <file>for short) overrides the built-in config file names.--quiet(-qfor short) keep indextool quiet - it will not output banner, etc.--help(-hfor short) lists all of the parameters that can be called in your particular build ofindextool.-vshow version information of your particular build ofindextool.
The commands are as follows:
--checkconfigjust loads and verifies the config file to check if it's valid, without syntax errors.--buildidf DICTFILE1 [DICTFILE2 ...] --out IDFILEbuild IDF file from one or several dictionary dumps. Additional parameter--skip-uniqwill skip unique (df=1) words.--build-infixes TABLENAMEbuild infixes for an existing dict=keywords table (upgrades .sph, .spi in place). You can use this option for legacy table files that already use dict=keywords, but now need to support infix searching too; updating the table files with indextool may prove easier or faster than regenerating them from scratch with indexer.--dumpheader FILENAME.sphquickly dumps the provided table header file without touching any other table files or even the configuration file. The report provides a breakdown of all the table settings, in particular the entire attribute and field list.--dumpconfig FILENAME.sphdumps the table definition from the given table header file in (almost) compliantsphinx.conffile format.--dumpheader TABLENAMEdumps table header by table name with looking up the header path in the configuration file.--dumpdict TABLENAMEdumps dictionary. Additional-statsswitch will dump to dictionary the total number of documents. It is required for dictionary files that are used for creation of IDF files.--dumpdocids TABLENAMEdumps document IDs by table name.--dumphitlist TABLENAME KEYWORDdumps all the hits (occurrences) of a given keyword in a given table, with keyword specified as text.--dumphitlist TABLENAME --wordid IDdumps all the hits (occurrences) of a given keyword in a given table, with keyword specified as internal numeric ID.--docextract TBL DOCIDruns usual table check pass of whole dictionary/docs/hits, and collects all the words and hits belonging to requested document. Then all of the words are placed in the order according to their fields and positions, and result is printed, grouping by field.--fold TABLENAME OPTFILEThis options is useful too see how actually tokenizer proceeds input. You can feed indextool with text from file if specified or from stdin otherwise. The output will contain spaces instead of separators (accordingly to yourcharset_tablesettings) and lowercased letters in words.--htmlstrip TABLENAMEfilters stdin using HTML stripper settings for a given table, and prints the filtering results to stdout. Note that the settings will be taken from sphinx.conf, and not the table header.--mergeidf NODE1.idf [NODE2.idf ...] --out GLOBAL.idfmerge several .idf files into a single one. Additional parameter--skip-uniqwill skip unique (df=1) words.--morph TABLENAMEapplies morphology to the given stdin and prints the result to stdout.--check TABLENAMEchecks the table data files for consistency errors that might be introduced either by bugs inindexerand/or hardware faults.--checkalso works on RT tables, RAM and disk chunks. Additional options:--check-id-dupschecks if there are duplicate ids--check-disk-chunk CHUNK_NAMEchecks only specific disk chunk of an RT table. The argument is a disk chunk numeric extension of the RT table to check.
--strip-pathstrips the path names from all the file names referenced from the table (stopwords, wordforms, exceptions, etc). This is useful for checking tables built on another machine with possibly different path layouts.--rotateworks only with--checkand defines whether to check table waiting for rotation, i.e. with .new extension. This is useful when you want to check your table before actually using it.--apply-killlistsloads and applies kill-lists for all tables listed in the config file. Changes are saved in .SPM files. Kill-list files (.SPK) are deleted. This can be useful if you want to move applying tables from server startup to indexing stage.
spelldump is used to extract the contents of a dictionary file that uses the ispell or MySpell format, which can be useful in building word lists for wordforms - all of the possible forms are pre-built for you.
The general syntax is:
spelldump [options] <dictionary> <affix> [result] [locale-name]
The two main parameters are the dictionary's main file and its affix file; these are usually named [language-prefix].dict and [language-prefix].aff and can be found in most common Linux distributions and various online sources.
[result] is where the extracted dictionary data will be output, and [locale-name] specifies the locale details you wish to use.
There is also an optional -c [file] option, which specifies a file for case conversion details.
Examples of usage are:
spelldump en.dict en.aff
spelldump ru.dict ru.aff ru.txt ru_RU.CP1251
spelldump ru.dict ru.aff ru.txt .1251
The result file will contain a list of all the words in the dictionary, sorted alphabetically, in the format of a wordforms file. This can be used to tailor it to your specific needs. An example of what the result file could look like:
zone > zone
zoned > zoned
zoning > zoning
wordbreaker is used to split compound words, such as those commonly found in URLs, into their component words. For example, this tool can split "lordoftherings" into its four component words, or http://manofsteel.warnerbros.com into "man of steel warner bros". This helps in searching, as it eliminates the need for prefixes or infixes. For example, searching for "sphinx" would not match "sphinxsearch", but if you break the compound word and index the separate components, you would get a match without the increased file sizes that come with using prefixes and infixes in full-text indexing.
Examples of usage include:
echo manofsteel | bin/wordbreaker -dict dict.txt split
man of steel
The input stream will be separated into words using the -dict dictionary file. If no dictionary is specified, wordbreaker looks in the working folder for a wordbreaker-dict.txt file. (The dictionary should match the language of the compound word.) The split command breaks words from the standard input and outputs the result to the standard output. There are also test and bench commands that allow you to test the splitting quality and benchmark the splitting functionality.
Wordbreaker requires a dictionary to recognize individual substrings within a string. To differentiate between different guesses, it uses the relative frequency of each word in the dictionary, with higher frequency meaning a higher split probability. You can generate such a file using the indexer tool:
indexer --buildstops dict.txt 100000 --buildfreqs myindex -c /path/to/sphinx.conf
which will write the 100,000 most frequent words along with their counts from myindex into dict.txt. The output file is a text file, so it can be edited by hand if necessary to add or remove words.
The Manticore Search API is documented using the OpenAPI specification, which can be used to generate client SDKs. The machine-readable YAML file is available at https://raw.githubusercontent.com/manticoresoftware/openapi/master/manticore.yml
You can also view the specification visualized with the online Swagger Editor here.
At Manticore, we gather various anonymized metrics to enhance the quality of our products, including Manticore Search. By analyzing this data, we can not only improve the overall performance of our product but also identify which features would be most beneficial to prioritize in order to provide even more value to our users. The telemetry system operates on a separate thread in a non-blocking mode, taking snapshots and sending them once every few minutes.
We take your privacy seriously, and you can rest assured that all metrics are completely anonymous and no sensitive information is transmitted. However, if you still wish to disable telemetry, you have the option to do so by:
- Setting the environment variable
TELEMETRY=0 - Or setting
telemetry = 0in the sectionsearchdof your configuration file
Here is a list of all the metrics we collect:
| Metric | Description |
|---|---|
| collector | 🏷 buddy. Indicates that this metric is collected through Manticore Buddy |
| os_name | 🏷️ Name of the operating system |
| machine_id | 🏷 Server identifier (the content of /etc/machine-id in Linux) |
| manticore_version | 🏷️ Version of Manticore |
| columnar_version | 🏷️ Version of the Columnar library if it is installed |
| secondary_version | 🏷️ Version of the secondary library if the Columnar library is installed |
| buddy_version | 🏷️ Version of Manticore Buddy |
| invocation | Sent when Manticore Buddy is launched |
| show_queries | Indicates that the show queries command was executed |
| backup | Indicates that the backup query was executed |
| insert_query | Indicates that the auto schema logic was executed |
| command_* | All metrics with this prefix are sent from the show status query of the Manticore daemon |
| uptime | The uptime of the Manticore Search daemon |
| workers_total | The number of workers used by Manticore |
| cluster_* | Cluster-related metrics from the show status results |
| table_*_count | The number of tables created for each type: plain, percolate, rt, or distributed |
| field_count | The count for each field type for tables with rt and percolate types |
| columnar | Indicates that the Columnar library was used |
| columnar_field_count | The number of fields that use the Columnar library |
The Manticore backup tool sends anonymized metrics to the Manticore metrics server by default in order to help improve the product. If you don't want to send telemetry, you can disable it by running the tool with the --disable-metric flag or by setting the environment variable TELEMETRY=0.
The following is a list of all collected metrics:
| Metric | Description |
|---|---|
| collector | 🏷 backup. Indicates that this metric comes from the backup tool |
| os_name | 🏷️ Name of the operating system |
| machine_id | 🏷 Server identifier (the content of /etc/machine-id in Linux) |
| backup_version | 🏷️ Version of the backup tool used |
| manticore_version | 🏷️ Version of Manticore |
| columnar_version | 🏷️ Version of Columnar library if installed |
| secondary_version | 🏷️ Version of the secondary library if Columnar library is installed |
| invocation | Sent when backup was initiated |
| failed | Sent in case of failed backup |
| done | Sent when backup/restore is successful |
| arg_* | The arguments used to run the tool (excluding index names, etc.) |
| backup_store_versions_fails | Indicates failure in saving Manticore version in the backup |
| backup_table_count | Total number of backed up tables |
| backup_no_permissions | Failed backup due to insufficient permissions to destination directory |
| backup_total_size | Total size of the full backup |
| backup_time | Duration of the backup |
| restore_searchd_running | Failed to run restore process due to searchd already running |
| restore_no_config_file | No config file found in the backup during restore |
| restore_time | Duration of the restore |
| fsync_time | Duration of fsync |
| restore_target_exists | Occurs when a folder or index already exists in the destination folder for restore |
| terminations | Indicates that the process was terminated |
| signal_* | The signal used to terminate the process |
| tables | Number of tables in Manticore |
| config_unreachable | Specified configuration file does not exist |
| config_data_dir_missing | Failed to parse data_dir from specified configuration file |
| config_data_dir_is_relative | data_dir path in Manticore instance's configuration file is relative |