Builtin Tools for fulltext search in Crate

Overview

Analyzers are used for creating fulltext-indexes. They take the content of a field and split it into tokens, which are then searched. Analyzers filter, reorder and/or transform the content of a field before it becomes the final stream of tokens.

An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.

When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first. It is used to filter some special chars from the stream of characters that make up the content.

Tokenizers split the possibly filtered stream of characters into tokens.

Token-filters can add tokens, delete tokens or transform them.

With these elements in place, analyzer provide fine-grained control over building a token stream used for fulltext search. For example you can use language specific analyzers, tokenizers and token-filters to get proper search results for data provided in a certain language.

Below the builtin analyzers, tokenizers, token-filters and char-filters are listed. They can be used as is or can be extended.

See also

Indices and Fulltext Search for examples showing how to create tables which make use of analyzers.

Create custom analyzer for an example showing how to create a custom analyzer.

CREATE ANALYZER for the syntax reference.

Builtin Analyzer

standard

type='standard'

An analyzer of type standard is built using the standard Tokenizer with the standard Token Filter, lowercase Token Filter, and stop Token Filter.

Lowercase all Tokens, uses NO stopwords and excludes tokens longer than 255 characters. This analyzer uses unicode text segmentation, which is defined by UAX#29.

For example, the standard analyzer converts the sentence

The quick brown fox jumps Over the lAzY DOG.

into the following tokens

quick, brown, fox, jumps, lazy, dog

Parameters

stopwords
A list of stopwords to initialize the stop filter with. Defaults to the english stop words.
max_token_length
The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.

default

type='default'

This is the same as the standard-analyzer analyzer.

simple

type='simple'

Uses the lowercase tokenizer.

whitespace

type='whitespace'

Uses a whitespace tokenizer

stop

type='stop'

Uses a lowercase Tokenizer, with stop Token Filter.

Parameters

stopwords
A list of stopwords to initialize the :ref:’stop-tokenfilter` filter with. Defaults to the english stop words.
stopwords_path
A path (either relative to config location, or absolute) to a stopwords file configuration.

keyword

type=keyword

Creates one single token from the field-contents.

pattern

type='pattern'

An analyzer of type pattern that can flexibly separate text into terms via a regular expression.

Parameters

lowercase
Should terms be lowercased or not. Defaults to true.
pattern
The regular expression pattern, defaults to W+.
flags
The regular expression flags.

Note

The regular expression should match the token separators, not the tokens themselves.

Flags should be pipe-separated, eg CASE_INSENSITIVE|COMMENTS. Check Java Pattern API for more details about flags options.

language

type='<language-name>'

The following types are supported:

arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.

Parameters

stopwords
A list of stopwords to initialize the stop filter with. Defaults to the english stop words.
stopwords_path
A path (either relative to config location, or absolute) to a stopwords file configuration.

The following analyzers support setting custom stem_exclusion list:

arabic, armenian, basque, brazilian, bulgarian, catalan, czech, danish, dutch, english, finnish, french, galician, german, hindi, hungarian, indonesian, italian, latvian, lithuanian, norwegian, portuguese, romanian, russian, spanish, swedish, turkish.

snowball

type='snowball'

Uses the standard Tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter.

Parameters

stopwords
A list of stopwords to initialize the stop filter with. Defaults to the english stop words.
language
See the language-parameter of snowball.

Builtin Tokenizer

standard

type='standard'

A tokenizer of type standard providing a grammar based tokenizer, which is a good tokenizer for most European language documents. The tokenizer implements the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Parameters

max_token_length
The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.

edge ngram

type='edge_ngram'

This tokenizer is very similar to ngram but only keeps n-grams which start at the beginning of a token.

Parameters

min_gram
Minimum size in codepoints of a single n-gram. default: 1
max_gram
Maximum size in codepoints of a single n-gram. default: 2
token_chars

Characters classes to keep in the tokens, will split on characters that don’t belong to any of these classes. default: [] (Keep all characters).

Classes: letter, digit, whitespace, punctuation, symbol

keyword

type='keyword'

Emits the entire input as a single token.

Parameters

buffer_size
The term buffer size. Defaults to 256.

letter

type='letter'

Divides text at non-letters.

lowercase

type='lowercase'

Performs the function of letter and lowercase together. It divides text at non-letters and converts them to lower case.

ngram

type='ngram'

Parameters

min_gram
Minimum size in codepoints of a single n-gram. default 1.
max_gram
Maximum size in codepoints of a single n-gram. default 2.
token_chars

Characters classes to keep in the tokens, will split on characters that don’t belong to any of these classes. default: [] (Keep all characters).

Classes: letter, digit, whitespace, punctuation, symbol

whitespace

type='whitespace'

Divides text at whitespace.

pattern

type='pattern'

Separates text into terms via a regular expression.

Parameters

pattern
The regular expression pattern, defaults to \W+.
flags
The regular expression flags.
group
Which group to extract into tokens. Defaults to -1 (split).

Note

The regular expression should match the token separators, not the tokens themselves.

Flags should be pipe-separated, eg CASE_INSENSITIVE|COMMENTS. Check Java Pattern API for more details about flags options.

thai

type='thai'

Splits Thai text correctly, treats all other languages like the standard-tokenizer does.

uax email url

type='uax_url_email'

Exactly like the standard, but tokenizes emails and urls as single tokens.

Parameters

max_token_length
The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.

path hierarchy

type='path_hierarchy'

Takes something like this:

/something/something/else

And produces tokens:

/something
/something/something
/something/something/else

Parameters

delimiter
The character delimiter to use, defaults to /.
replacement
An optional replacement character to use. Defaults to the delimiter.
buffer_size
The buffer size to use, defaults to 1024.
reverse
Generates tokens in reverse order, defaults to false.
skip
Controls initial tokens to skip, defaults to 0.

Builtin Token Filter

standard

tyoe='standard'

Normalizes tokens extracted with the standard Tokenizer.

apostrophe

type='apostrophe'

Strips all characters after an apostrophe, and the apostrophe itself.

ascii folding

type='asciifolding'

Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists.

length

type='length'

Removes words that are too long or too short for the stream.

Parameters

min
The minimum number. Defaults to 0.
max
The maximum number. Defaults to Integer.MAX_VALUE.

lowercase

type='lowercase'

Normalizes token text to lower case.

Parameters

language
For options, see language Analyzer.

ngram

type='ngram'

Parameters

min_gram
Defaults to 1.
max_gram
Defaults to 2.

edge ngram

type='edge_ngram'

Parameters

min_gram
Defaults to 1.
max_gram
Defaults to 2.
side
Either front or back. Defaults to front.

porter stem

type='porter_stem'

Transforms the token stream as per the Porter stemming algorithm.

Note

The input to the stemming filter must already be in lower case, so you will need to use Lower Case Token Filter or Lower Case Tokenizer farther down the Tokenizer chain in order for this to work properly! For example, when using custom analyzer, make sure the lowercase filter comes before the porterStem filter in the list of filters.

shingle

type='shingle'

Constructs shingles (token n-grams), combinations of tokens as a single token, from a token stream.

Parameters

max_shingle_size
The maximum shingle size. Defaults to 2.
min_shingle_sizes
The minimum shingle size. Defaults to 2.
output_unigrams
If true the output will contain the input tokens (unigrams) as well as the shingles. Defaults to true.
output_unigrams_if_no_shingles
If output_unigrams is false the output will contain the input tokens (unigrams) if no shingles are available. Note if output_unigrams is set to true this setting has no effect. Defaults to false.
token_separator
The string to use when joining adjacent tokens to form a shingle. Defaults to ” ”.

stop

type='stop'

Removes stop words from token streams.

Parameters

stopwords
A list of stop words to use. Defaults to english stop words.
stopwords_path
A path (either relative to config location, or absolute) to a stopwords file configuration. Each stop word should be in its own “line” (separated by a line break). The file must be UTF-8 encoded.
ignore_case
Set to true to lower case all words first. Defaults to false.
remove_trailing
Set to false in order to not ignore the last term of a search if it is a stop word. Defaults to true

word delimiter

type='word_delimiter'

Splits words into subwords and performs optional transformations on subword groups.

Parameters

generate_word_parts
If true causes parts of words to be generated: “PowerShot” ⇒ “Power” “Shot”. Defaults to true.
generate_number_parts
If true causes number subwords to be generated: “500-42” ⇒ “500” “42”. Defaults to true.
catenate_words
If true causes maximum runs of word parts to be catenated: “wi-fi” ⇒ “wifi”. Defaults to false.
catenate_numbers
If true causes maximum runs of number parts to be catenated: “500-42” ⇒ “50042”. Defaults to false.
catenate_all
If true causes all subword parts to be catenated: “wi-fi-4000” ⇒ “wifi4000”. Defaults to false.
split_on_case_change
If true causes “PowerShot” to be two tokens; (“Power-Shot” remains two parts regards). Defaults to true.
preserve_original
If true includes original words in subwords: “500-42” ⇒ “500-42” “500” “42”. Defaults to false.
split_on_numerics
If true causes “j2se” to be three tokens; “j” “2” “se”. Defaults to true.
stem_english_possessive
If true causes trailing “‘s” to be removed for each subword: “O’Neil’s” ⇒ “O”, “Neil”. Defaults to true.
protected_words
A list of words protected from being delimiter.
protected_words_path
A relative or absolute path to a file configured with protected words (one on each line). If relative, automatically resolves to config/ based location if exists.
type_table
A custom type mapping table

stemmer

type='stemmer'

A filter that stems words (similar to snowball, but with more options).

Parameters

language/name
arabic, armenian, basque, brazilian, bulgarian, catalan, czech, danish, dutch, english, finnish, french, german, german2, greek, hungarian, italian, kp, kstem, lovins, latvian, norwegian, minimal_norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish, minimal_english, possessive_english, light_finnish, light_french, minimal_french, light_german, minimal_german, hindi, light_hungarian, indonesian, light_italian, light_portuguese, minimal_portuguese, portuguese, light_russian, light_spanish, light_swedish.

keyword marker

type='keyword_marker'

Protects words from being modified by stemmers. Must be placed before any stemming filters.

Parameters

keywords
A list of words to use.
keywords_path
A path (either relative to config location, or absolute) to a list of words.
ignore_case
Set to true to lower case all words first. Defaults to false.

kstem

type='kstem'

High performance filter for english. All terms must already be lowercased (use lowercase filter) for this filter to work correctly.

snowball

type='snowball'

A filter that stems words using a Snowball-generated stemmer.

Parameters

language
Possible values: Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, German2, Hungarian, Italian, Kp, Lovins, Norwegian, Porter, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

synonym

type='synonym'

Allows to easily handle synonyms during the analysis process. Synonyms are configured using a configuration file.

Parameters

synonyms_path
Path to synonyms configuration file
ignore_case
Defaults to false
expand
Defaults to true

compound word

type='dictionary_decompounder' or type='hyphenation_decompounder'

Decomposes compound words.

Parameters

word_list
A list of words to use.
word_list_path
A path (either relative to config location, or absolute) to a list of words.
min_word_size
Minimum word size(Integer). Defaults to 5.
min_subword_size
Minimum subword size(Integer). Defaults to 2.
max_subword_size
Maximum subword size(Integer). Defaults to 15.
only_longest_match
Only matching the longest(Boolean). Defaults to false

reverse

type='reverse'

Reverses each token.

elision

type='elision'

Removes elisions.

Parameters

articles
A set of stop words articles, for example ['j', 'l'] for content like J'aime l'odeur.

truncate

type='truncate'

Truncates tokens to a specific length.

Parameters

length
Number of characters to truncate to. default 10

unique

type='unique'

Used to only index unique tokens during analysis. By default it is applied on all the token stream.

Parameters

only_on_same_position
If set to true, it will only remove duplicate tokens on the same position.

pattern capture

type='pattern_capture'

Emits a token for every capture group in the regular expression

Parameters

preserve_original
If set to true (the default) then it would also emit the original token

pattern replace

type='pattern_replace'

Handle string replacements based on a regular expression.

Parameters

pattern
Regular expression whose matches will be replaced.
replacement
The replacement, can reference the original text with $1-like (the first matched group) references.

trim

type='trim'

Trims the whitespace surrounding a token.

limit token count

type='limit'

Limits the number of tokens that are indexed per document and field.

Parameters

max_token_count
The maximum number of tokens that should be indexed per document and field. The default is 1
consume_all_tokens
If set to true the filter exhaust the stream even if max_token_count tokens have been consumed already. The default is false.

hunspell

type='hunspell'

Basic support for Hunspell stemming. Hunspell dictionaries will be picked up from the dedicated directory <path.conf>/hunspell. Each dictionary is expected to have its own directory named after its associated locale (language). This dictionary directory is expected to hold both the *.aff and *.dic files (all of which will automatically be picked up).

Parameters

ignore_case
If true, dictionary matching will be case insensitive (defaults to false)
strict_affix_parsing
Determines whether errors while reading a affix rules file will cause exception or simply be ignored (defaults to true)
locale
A locale for this filter. If this is unset, the lang or language are used instead - so one of these has to be set.
dictionary
The name of a dictionary contained in <path.conf>/hunspell.
dedup
If only unique terms should be returned, this needs to be set to true. Defaults to true.
recursion_level
Configures the recursion level a stemmer can go into. Defaults to 2. Some languages (for example czech) give better results when set to 1 or 0, so you should test it out.

common grams

type='common_grams'

Generates bigrams for frequently occuring terms. Single terms are still indexed. It can be used as an alternative to the stop Token filter when we don’t want to completely ignore common terms.

Parameters

common_words
A list of common words to use.
common_words_path
A path (either relative to config location, or absolute) to a list of common words. Each word should be in its own “line” (separated by a line break). The file must be UTF-8 encoded.
ignore_case
If true, common words matching will be case insensitive (defaults to false).
query_mode
Generates bigrams then removes common words and single terms followed by a common word (defaults to false).

Note

Either common_words or common_words_path must be given.

normalization

type='<language>_normalization'

Normalizes special characters of several languages.

Available languages:

  • arabic
  • german
  • hindi
  • indic
  • sorani
  • persian
  • scandinavian

scandinavian folding

type='scandinavian_folding'

Folds scandinavian characters like ø to o or å to a. Though this might result in different words, it is easier to match different scandinavian languages using this folding algorithm.

delimited payload

type='delimited_payload_filter'

Split tokens up by delimiter (default |) into the real token being indexed and the payload stored additionally into the index. For example Trillian|65535 will be indexed as Trillian with 65535 as payload.

Parameter

encoding
How the payload should be interpreted, possible values are float for float values, int for integer values and identity for keeping the payload as byte array (string).
delimiter
The string used to separate the token and its payload.

keep

type='keep'

Only keep tokens defined within the settings of this filter keep_words and variations. All other tokens will be filtered. This filter works like an inverse stop-tokenfilter filter.

Parameter

keep_words
A list of words to keep and index as tokens.
keep_words_path
A path (either relative to config location, or absolute) to a list of words to keep and index. Each word should be in its own “line” (separated by a line break). The file must be UTF-8 encoded.

stemmer override

type='stemmer_override'

Override any previous stemmer that recognizes keywords with a custom mapping, defined by rules or rules_path. One of these settings has to be set.

Parameter

rules
A list of rules for overriding, in the form of [<source>=><replacement>] e.g. "foo=>bar"
rules_path
A path to a file with one rule per line, like above.

cjk bigram

type='cjk_bigram'

Handle Chinese, Japanese and Korean (CJK) bigrams.

Parameters

output_bigrams
Boolean flag to enable a combined unigram+bigram approach. Default is false, so single CJK characters that do not form a bigram are passed as unigrams. All non CJK characters are output unmodified.
ignored_scripts
Scripts to ignore. possible values: han, hiragana, katakana, hangul

cjk width

type='cjk_width'

A filter that normalizes CJK.

language stem

type='arabic_stem' or
type='brazilian_stem' or
type='czech_stem' or
type='dutch_stem' or
type='french_stem' or
type='german_stem' or
type='russian_stem'

A group of filters that applies language specific stemmers to the token stream. To prevent terms from being stemmed put a keywordmarker-tokenfilter before this filter into the token_filter chain.

decimal_digit

A token filter that folds unicode digits to 0-9

Builtin Char Filter

mapping

type='mapping'

Parameters

mappings
A list of mappings as strings of the form [<source>=><replacement>] e.g. "ph=>f"
mappings_path
A path to a file with one mapping per line, like above.

html strip

type='html_strip'

Strips out HTML elements from an analyzed text.

pattern replace

type='pattern_replace'

Manipulates the characters in a string before analysis with a regex.

Parameters

pattern
Regex whose matches will be replaced
replacement
Replacement string, can reference replaced text by $1 like references (first matched element)