Elasticsearch Analyzer 分词器

2022-03-10 作者: escray

Elasticsearch Analyzer 分词器，包括默认分词器和定制分词器，内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇，英文内容来自 Elasticsearch: The Definitive Guide [2.x]

默认分词器

The standard analyzer, which is the default analyzer used for full-text field, is a good choice for most Western language.

standard tokenizer：以单词边界进行切分 splits the input text on word boundaries
standard token filter：什么都不做 intended to tidy up the token emitted by the tokenizer (but currently does nothing)
lowercase token filter：将所有字母转换为小写 converts all tokens into lowercase
stop token filer（默认被禁用）：移除停用词，比如a the it等等 removes stopwords - common words that have little impact on search relevance, such as a, the, and, is.

By default, the stopwords filter is disabled.

修改分词器的设置

创建了一个新的 analyzer 叫做 es_std analyzer，启用 spanish 停用词 token filter

PUT /spanish_docs
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_spanish_"
        }
      }
    }
  }
}

The esstd analyzer is not global - it exists only in the spanishdocs index where we have defined it.

The abbreviated results show that the Spanish stopword El has been removed correly:

GET /spanish_docs/_analyze
{
  "analyzer": "es_std",
  "text": "El veloz zorro marrón"
}

{
  "tokens" : [
    { "token" : "veloz",  "position" : 1 },
    { "token" : "zorro",  "position" : 2 },
    { "token" : "marrón", "position" : 3 }
  ]
}

定制化分词器 Custom Analyzers

Character filters: An analyzer may have zero or more character filters. tidy up a string before it is tokenized. html_strip character filter use to remove all HTML tags and to convert HTML entities like Á into the corresponding Unicode character Á.

Tokenizers: An analyzer must have a single tokenizer. The Tokenizer breaks up the string into individual terms or tokens.
standard tokenizer
keyword tokenizer: outputs exactly the same string as it received, without any tokenization
whitespace tokenizer
the pattern tokenizer: split text on a matching regular expression

Token filters: After tokenization, the resulting token stream is passed through any specified token filters, in the order in which they are specified. Token filters may change, add, or remove tokens.
lowercase
stop token filters
Stemming token filters: "stem" words to their root form
ascii_folding filter: removes diacritics, converting a term like "très" into "tres".
ngram
edge_ngram token filters

定制分词器语法

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": { ... custom character filters ...},
      "tokenizer": { ... custom tokenizers ...},
      "filter": { ... custom token filters ...},
      "analyzer": { ... custom analyzers ... }
    }
  }
}

示例：

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [ "&=> and "]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["the", "a"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip", "&_to_and"],
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "text": "The quick & brown fox",
  "analyzer": "my_analyzer"
}

{
  "tokens" : [
    { "token" : "quick", "position" : 1 },
    { "token" : "and",   "position" : 2 },
    { "token" : "brown", "position" : 3 },
    { "token" : "fox",   "position" : 4 }
  ]
}

The analyzer is not much use unless we tell Elasticsearch where to use it.

PUT /my_index/_mapping/my_type
{
  "properties": {
    "title": {
      "type": "string",
      "analyzer": "my_analyzer"
    }
  }
}

因为版本升级，报错了。需要修改两个地方：

Types cannot be provided in put mapping requests, unless the include_type_name parameter is set to true.
No handler for type [string] declared on field [title]，string 类型已经没有了。

PUT /my_index/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}