Bootstrap

Elasticsearch Analyzer 分词器

Elasticsearch Analyzer 分词器,包括默认分词器和定制分词器,内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]

默认分词器

The standard analyzer, which is the default analyzer used for full-text field, is a good choice for most Western language.

  • standard tokenizer:以单词边界进行切分 splits the input text on word boundaries

  • standard token filter:什么都不做 intended to tidy up the token emitted by the tokenizer (but currently does nothing)

  • lowercase token filter:将所有字母转换为小写 converts all tokens into lowercase

  • stop token filer(默认被禁用):移除停用词,比如a the it等等 removes stopwords - common words that have little impact on search relevance, such as a, the, and, is.

By default, the stopwords filter is disabled.

修改分词器的设置

创建了一个新的 analyzer 叫做 es_std analyzer,启用 spanish 停用词 token filter

PUT /spanish_docs
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_spanish_"
        }
      }
    }
  }
}

The esstd analyzer is not global - it exists only in the spanishdocs index where we have defined it.

The abbreviated results show that the Spanish stopword El has been removed correly:

GET /spanish_docs/_analyze
{
  "analyzer": "es_std",
  "text": "El veloz zorro marrón"
}

{
  "tokens" : [
    { "token" : "veloz",  "position" : 1 },
    { "token" : "zorro",  "position" : 2 },
    { "token" : "marrón", "position" : 3 }
  ]
}
定制化分词器 Custom Analyzers

  • Character filters: An analyzer may have zero or more character filters. tidy up a string before it is tokenized. html_strip character filter use to remove all HTML tags and to convert HTML entities like Á into the corresponding Unicode character Á.

  • Tokenizers: An analyzer must have a single tokenizer. The Tokenizer breaks up the string into individual terms or tokens.

  • standard tokenizer

  • keyword tokenizer: outputs exactly the same string as it received, without any tokenization

  • whitespace tokenizer

  • the pattern tokenizer: split text on a matching regular expression

  • Token filters: After tokenization, the resulting token stream is passed through any specified token filters, in the order in which they are specified. Token filters may change, add, or remove tokens.

  • lowercase

  • stop token filters

  • Stemming token filters: "stem" words to their root form

  • ascii_folding filter: removes diacritics, converting a term like "très" into "tres".

  • ngram

  • edge_ngram token filters

定制分词器语法

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": { ... custom character filters ...},
      "tokenizer": { ... custom tokenizers ...},
      "filter": { ... custom token filters ...},
      "analyzer": { ... custom analyzers ... }
    }
  }
}

示例:

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": [ "&=> and "]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["the", "a"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip", "&_to_and"],
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "text": "The quick & brown fox",
  "analyzer": "my_analyzer"
}

{
  "tokens" : [
    { "token" : "quick", "position" : 1 },
    { "token" : "and",   "position" : 2 },
    { "token" : "brown", "position" : 3 },
    { "token" : "fox",   "position" : 4 }
  ]
}

The analyzer is not much use unless we tell Elasticsearch where to use it.

PUT /my_index/_mapping/my_type
{
  "properties": {
    "title": {
      "type": "string",
      "analyzer": "my_analyzer"
    }
  }
}

因为版本升级,报错了。需要修改两个地方:

  • Types cannot be provided in put mapping requests, unless the include_type_name parameter is set to true.

  • No handler for type [string] declared on field [title],string 类型已经没有了。

PUT /my_index/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}