Bootstrap

Elasticsearch Inverted Index

Elasticsearch Inverted Index,内容来自 B 站中华石杉 Elasticsearch 顶尖高手系列课程核心知识篇,英文内容来自 Elasticsearch: The Definitive Guide [2.x]

Inverted Index

two documents, each with a content field containing the following:

first, split the content field of each document into separate words (which we call term, or tokens), create a sorted list of all the unique terms, and then list in which document each term appears. The result looks something like this:

Term      Doc_1  Doc_2
-------------------------
Quick   |       |  X
The     |   X   |
brown   |   X   |  X
dog     |   X   |
dogs    |       |  X
fox     |   X   |
foxes   |       |  X
in      |       |  X
jumped  |   X   |
lazy    |   X   |  X
leap    |       |  X
over    |   X   |  X
quick   |   X   |
summer  |       |  X
the     |   X   |
------------------------

if we want to search for quick brown, we just need to find the documents in which each term appears:

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
quick   |   X   |
------------------------
Total   |   2   |  1

Both documents match, but the first document has more matches than the second.

after normalize the terms into a standard format (lowercased, stemmed, synonyms...)

Term      Doc_1  Doc_2
-------------------------
brown   |   X   |  X
dog     |   X   |  X
fox     |   X   |  X
in      |       |  X
jump    |   X   |  X
lazy    |   X   |  X
over    |   X   |  X
quick   |   X   |  X
summer  |       |  X
the     |   X   |  X
------------------------

You can find only terms that exist in your index, so both only indexed text and the query string must be normalized into the same form.

This process of tokenization and normalization is called analysis.

Making Text Searchable

倒排索引是适合用于进行搜索的

The data structure that best supports the multiple-values-per-field requirement is the inverted index.

The inverted index contains a sorted list of all the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.

Term  | Doc 1 | Doc 2 | Doc 3 | ...
------------------------------------
brown |   X   |       |  X    | ...
fox   |   X   |   X   |  X    | ...
quick |   X   |   X   |       | ...
the   |   X   |       |  X    | ...

historically, an inverted index was used to index whole unstructured text documents. A document in Elasticsearch is a structured JSON document with fields and values. In reality, every indexed field in a JSON document has its own inverted index

倒排索引的结构

word		doc1		doc2
dog		   *		   *
hello		 *
you				       *

the inverted index needs to know about all documents in the collection in order for it to function as intended.

Immutablity

The inverted index that is written to disk is immutable: it doesn't change. Ever.

倒排索引不可变的好处

倒排索引不可变的坏处:每次都要重新构建整个索引

downsides: You can't change it. If you want to make the new documents searchable, you have to rebuild the entired index.