Token filter plugins let you implement a custom tokenizer that makes tokens according to custom rules. There are two type:
- Index-time tokenizer declared by index_token_filter in index settings
- query-time tokenizer declared by token_filter OPTION directive
In the text processing pipeline, the token filters will run after the base tokenizer processing occurs (which processes the text from field or query and creates tokens out of them).
Index-time tokenizer gets created by indexer on indexing source data into index or by RT index on processing
Plugin is declared as
library name:plugin name:optional string of settings. The init functions of the plugin can accept arbitrary settings that can be passed as a string in format
index_token_filter = my_lib.so:email_process:field=email;split=.io
The call workflow for index-time token filter is as follows:
XXX_init()gets called right after indexer creates token filter with empty fields list then after indexer got index schema with actual fields list. It must return zero for successful initialization or error description otherwise.
XXX_begin_documentgets called only for RT index
REPLACEfor every document. It must return zero for successful call or error description otherwise. Using OPTION
token_filter_optionsadditional parameters/settings can be passed to the function.
INSERT INTO rt (id, title) VALUES (1, 'some text email@example.com') OPTION token_filter_options='.io'
XXX_begin_fieldgets called once for each field prior to processing field with base tokenizer with field number as its parameter.
XXX_push_tokengets called once for each new token produced by base tokenizer with source token as its parameter. It must return token, count of extra tokens made by token filter and delta position for token.
XXX_get_extra_tokengets called multiple times in case
XXX_push_tokenreports extra tokens. It must return token and delta position for that extra token.
XXX_end_fieldgets called once right after source tokens from current field get over.
XXX_deinitgets called in the very end of indexing.
The following functions are mandatory to be defined:
Query-time tokenizer gets created on search each time full-text invoked by every index involved.
The call workflow for query-time token filter is as follows:
XXX_init()gets called once per index prior to parsing query with parameters - max token length and string set by
SELECT * FROM index WHERE MATCH ('test') OPTION token_filter='my_lib.so:query_email_process:io'
It must return zero for successful initialization or error description otherwise.
XXX_push_token()gets called once for each new token produced by base tokenizer with parameters: token produced by base tokenizer, pointer to raw token at source query string and raw token length. It must return token and delta position for token.
XXX_pre_morph()gets called once for token right before it got passed to morphology processor with reference to token and stopword flag. It might set stopword flag to mark token as stopword.
XXX_post_morph()gets called once for token after it processed by morphology processor with reference to token and stopword flag. It might set stopword flag to mark token as stopword. It must return flag non-zero value of which means to use token prior to morphology processing.
XXX_deinit()gets called in the very end of query processing.
Absence of any of the functions is tolerated.
indextool is a helper tool used to dump miscellaneous information about a physical index. The general usage is:
indextool <command> [options]
Options effective for all commands:
-c <file>for short) overrides the built-in config file names.
-qfor short) keep indextool quiet - it will not output banner, etc.
-hfor short) lists all of the parameters that can be called in your particular build of
-vshow version information of your particular build of
The commands are as follows:
--checkconfigjust loads and verifies the config file to check if it's valid, without syntax errors.
--buildidf DICTFILE1 [DICTFILE2 ...] --out IDFILEbuild IDF file from one or several dictionary dumps. Additional parameter
--skip-uniqwill skip unique (df=1) words.
--build-infixes INDEXNAMEbuild infixes for an existing dict=keywords index (upgrades .sph, .spi in place). You can use this option for legacy index files that already use dict=keywords, but now need to support infix searching too; updating the index files with indextool may prove easier or faster than regenerating them from scratch with indexer.
--dumpheader FILENAME.sphquickly dumps the provided index header file without touching any other index files or even the configuration file. The report provides a breakdown of all the index settings, in particular the entire attribute and field list.
--dumpconfig FILENAME.sphdumps the index definition from the given index header file in (almost) compliant
--dumpheader INDEXNAMEdumps index header by index name with looking up the header path in the configuration file.
--dumpdict INDEXNAMEdumps dictionary. Additional
-statsswitch will dump to dictionary the total number of documents. It is required for dictionary files that are used for creation of IDF files.
--dumpdocids INDEXNAMEdumps document IDs by index name.
--dumphitlist INDEXNAME KEYWORDdumps all the hits (occurrences) of a given keyword in a given index, with keyword specified as text.
--dumphitlist INDEXNAME --wordid IDdumps all the hits (occurrences) of a given keyword in a given index, with keyword specified as internal numeric ID.
--fold INDEXNAME OPTFILEThis options is useful too see how actually tokenizer proceeds input. You can feed indextool with text from file if specified or from stdin otherwise. The output will contain spaces instead of separators (accordingly to your
charset_tablesettings) and lowercased letters in words.
--htmlstrip INDEXNAMEfilters stdin using HTML stripper settings for a given index, and prints the filtering results to stdout. Note that the settings will be taken from sphinx.conf, and not the index header.
--mergeidf NODE1.idf [NODE2.idf ...] --out GLOBAL.idfmerge several .idf files into a single one. Additional parameter
--skip-uniqwill skip unique (df=1) words.
--morph INDEXNAMEapplies morphology to the given stdin and prints the result to stdout.
--check INDEXNAMEchecks the index data files for consistency errors that might be introduced either by bugs in
indexerand/or hardware faults.
--checkalso works on RT indexes, RAM and disk chunks. Additional options:
--check-id-dupschecks if there are duplicate ids
--check-disk-chunk CHUNK_NAMEchecks only specific disk chunk of an RT index. The argument is a disk chunk numeric extension of the RT index to check.
--strip-pathstrips the path names from all the file names referenced from the index (stopwords, wordforms, exceptions, etc). This is useful for checking indexes built on another machine with possibly different path layouts.
--rotateworks only with
--checkand defines whether to check index waiting for rotation, i.e. with .new extension. This is useful when you want to check your index before actually using it.
--apply-killlistsloads and applies kill-lists for all indexes listed in the config file. Changes are saved in .SPM files. Kill-list files (.SPK) are deleted. This can be useful if you want to move applying indexes from server startup to indexing stage.
spelldump is used to extract contents of a dictionary file that uses
MySpell format, which can help build word lists for wordforms - all of the possible forms are pre-built for you.
The general usage is:
spelldump [options] <dictionary> <affix> [result] [locale-name]
The two main parameters are the dictionary's main file and its affix file; usually these are named as
[language-prefix].aff and will be available with most common Linux distributions, as well as various places online.
[result] specifies where the dictionary data should be output to, and
[locale-name] additionally specifies the locale details you wish to use.
There is an additional option,
-c [file], which specifies a file for case conversion details.
Examples of its usage are:
spelldump en.dict en.aff spelldump ru.dict ru.aff ru.txt ru_RU.CP1251 spelldump ru.dict ru.aff ru.txt .1251
The results file will contain a list of all the words in the dictionary in alphabetical order, output in the format of a wordforms file, which you can use to customize for your specific circumstances. An example of the result file:
zone > zone zoned > zoned zoning > zoning
wordbreaker is used to split compound words, as usual in URLs, into its component words. For example, this tool can split "lordoftherings" into its four component words, or
http://manofsteel.warnerbros.com into "man of steel warner bros". This helps searching, without requiring prefixes or infixes: searching for "sphinx" wouldn't match "sphinxsearch" but if you break the compound word and index the separate components, you'll get a match without the costs of prefix and infix larger index files.
Examples of its usage are:
echo manofsteel | bin/wordbreaker -dict dict.txt split man of steel
The input stream will be separated in words using the
-dict dictionary file. In no dictionary specified, wordbreaker looks in the working folder for a wordbreaker-dict.txt file. (The dictionary should match the language of the compound word.) The
split command breaks words from the standard input, and outputs the result in the standard output. There are also
bench commands that let you test the splitting quality and benchmark the splitting functionality.
Wordbreaker needs a dictionary to recognize individual substrings within a string. To differentiate between different guesses, it uses the relative frequency of each word in the dictionary: higher frequency means higher split probability. You can generate such a file using the
indexer --buildstops dict.txt 100000 --buildfreqs myindex -c /path/to/sphinx.conf
which will write the 100,000 most frequent words, along with their counts, from myindex into dict.txt. The output file is a text file, so you can edit it by hand, if need be, to add or remove words.
The Manticore Search API is documented using the OpenAPI specification that can be used to generate client SDKs. A machine readable YAML file is available at https://raw.githubusercontent.com/manticoresoftware/openapi/master/manticore.yml
You can also look at the specification visualized with the online Swagger Editor here.