# Elasticsearch Analyzer

Elasticsearch 中的分析器,是為什麼 Elasticsearch 能夠作為搜尋引擎的關鍵,在將 Document 加入 index 前,會先透過 Analyzer 將 Document 進行拆解(過濾、分詞),並建立反向索引,以便 Elasticsearch 透過反向索引進行字詞搜尋

# Analyzer 基本

在 elasticsearch 分析器的基本流程為 Character Filter > Tokenizer > Token Filter

  • Character Filter : 過濾,根據所選的 character filter 會將不需要的字符給過濾掉;etc : HTML標籤,標點符號
  • Tokenizer : 分詞,每個分析器必備的流程,負責重要的分詞處理
  • Token Filter : 將分詞後的 Token 做個別處理;etc :大小寫轉換

# Analyzer 使用

使用一個簡單的範例來看看 Elasticsearch 是如何分詞的,先將 The Old Man and the Sea 插入索引中

POST book/_doc/1
{
    "title":"The Old Man and the Sea"
}

接著使用下列API,查詢分詞結果

GET <index>/_analyze
{
  "field": "欄位名",
  "text": "內容"
}
GET book/_analyze
{
  "field": "title",
  "text": "The Old Man and the Sea"
}

elasticsearch 中預設的 analyzer 是 "analyzer" : "standard",會將所有的字詞進行拆分,結果如下:

{
    {
        "tokens" : [
            {
            "token" : "the",
            "start_offset" : 0,
            "end_offset" : 3,
            "type" : "<ALPHANUM>",
            "position" : 0
            },
            {
            "token" : "old",
            "start_offset" : 4,
            "end_offset" : 7,
            "type" : "<ALPHANUM>",
            "position" : 1
            },
            {
            "token" : "man",
            "start_offset" : 8,
            "end_offset" : 11,
            "type" : "<ALPHANUM>",
            "position" : 2
            },
            {
            "token" : "and",
            "start_offset" : 12,
            "end_offset" : 15,
            "type" : "<ALPHANUM>",
            "position" : 3
            },
            {
            "token" : "the",
            "start_offset" : 16,
            "end_offset" : 19,
            "type" : "<ALPHANUM>",
            "position" : 4
            },
            {
            "token" : "sea",
            "start_offset" : 20,
            "end_offset" : 23,
            "type" : "<ALPHANUM>",
            "position" : 5
            }
        ]
    }
}

也可以直接指定 Analyzer 做測試

GET _analyze
{
    "analyzer": "分析器種類",
    "text" : "內容"
}
GET _analyze
{
    "analyzer": "standard",
    "text" : "The Old Man and the Sea"
}

# 內建分析器

Elasticsearch 內建幾個分析器如下:

  • Standard Analyzer
  • Simple Analyzer
  • Whitespace Analyzer
  • Stop Analyzer
  • Keyword Analyzer
  • Pattern Analyzer
  • Language Analyzers
  • Fingerprint Analyzer

另外可以自定義 Custom analyzer 文件來源 (opens new window)

稍微整理一下內建分析器的 Example output ,都是以官方範例 "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 為基準,只是放在一起比較容易比較其中差異 :

# Standard Analyzer

預設的 Analyzer 若無特別指定就是這個,直接切分;小寫顯示

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出:

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

# Simple Analyzer

The simple analyzer breaks text into tokens at any non-letter character, such as numbers, spaces, hyphens and apostrophes, discards non-letter characters, and changes uppercase to lowercase. 將 Text 以數字、空白、符號分詞,並刪去非字母的文字;小寫顯示

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出:

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

# Whitespace Analyzer

The whitespace analyzer breaks text into terms whenever it encounters a whitespace character.

簡單粗暴,遇到空白就斷詞;不轉小寫

POST _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

# Stop Analyzer

The stop analyzer is the same as the simple analyzer but adds support for removing stop words. It defaults to using the english stop words.

simple analyzer 相同,但多了過濾 Stop Analyzer 使用 _english_ 來過濾特定的 stop words(停用詞)

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出結果將 the;2 停用詞過濾掉了 有一個章節專門描述停用詞 (opens new window)

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

# Keyword Analyzer

The keyword analyzer is a “noop” analyzer which returns the entire input string as a single token.

一句話總結:就是【不分詞】

POST _analyze
{
    "analyzer": "keyword",
    "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出

[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]

# Pattern Analyzer

The pattern analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators not the tokens themselves. The regular expression defaults to \W+ (or all non-word characters).

以正則表示式作為分詞依據,預設是 \W+,很玄幻的分析器有機會用到在特別紀錄

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

輸出

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

# Language Analyzers

A set of analyzers aimed at analyzing specific language text
特定語言的分析器,支援的語言點這裡 (opens new window) 然後沒有中文,中文在ES中是特別的存在,開一節特別紀錄

# Index Mapping & Analyzer

可以在創建 Index 時 Mapping 自己想要的 Analyzer,但要注意的是 Mapping 後的欄位就不能更改其 Analyzer 了,除非使用 Reindex

# Index Mapping 基本使用

實驗一下,對 book 這個 index 新增一個 field title3 其類型為 text ;分析器為 simple

PUT /book/_mapping/
{
    "properties":{
        "title3":{
        "type":"text",
        "analyzer":"simple"
        }
    }
}

查詢 index mapping 的設定

GET /book/_mapping/

可以看到 Title3 的 analyzersimple,第一個 Title 為預設新增的可以看到 properties 中有一個 fields 這個便是 Multi-field,下一小段說明 Multi-field

{
    "book" : {
        "mappings" : {
            "properties" : {
                "title" : {
                "type" : "text",
                "fields" : {
                    "keyword" : {
                    "type" : "keyword",
                    "ignore_above" : 256
                    }
                }
                },
                "title2" : {
                "type" : "text",
                "analyzer" : "standard"
                },
                "title3" : {
                "type" : "text",
                "analyzer" : "simple"
                }
            }
        }
    }
}

# Multi-fields

我們可以透過分配不同的 Multi-fields 在同一個字段中使用不同的 analyzer 來提升搜尋的準確度。

  • 先創建一個 index 讓主要的 analyzersimplefields 中的為 english
PUT /book-multi-fields
{
  "mappings":{
    "properties":{
      "title":{
        "type":"text",
        "analyzer":"simple",
        "fields":{
          "english":{
            "type":"text",
            "analyzer":"english"
          }
        }
      }
    }
  }
}

index book-multi-fields 中放入一筆資料 "title":"The Old Man and the Sea"

POST /book-multi-fields/_doc/1
{
  "title":"The Old Man and the Sea"
}

接著搜尋以下三種情況:

GET /book-multi-fields/_search
{
  "query": {
    "match": {
      "title": "the"
    }
  }
}
GET /book-multi-fields/_search
{
  "query": {
    "match": {
      "title.english": "the"
    }
  }
}
GET /book-multi-fields/_search
{
  "query": {
    "match": {
      "title.english": "sea"
    }
  }
}

multifield比較
結果為第一個命中,第二個未命中,第三個命中,這是因為條件為 title 時分析器為 simple 會保留 the,條件為 title.english 時,分析器 english 會將 the 作為 stop words 過濾掉。

# 客製化 Analyzer

待補