關於網路那些事...

網路行銷,SEO,網路趨勢,教學文章,網頁設計,生活時事

Elasticsearch Analyzer 分詞器是什麼?實際範例演練

| Comments

Elasticsearch Analyzer 分詞器是什麼?實際範例演練

Analysis 是 Elasticsearch 的一個概念,他可以將我們數據 寫入及查詢時,也會針對分詞做處理。

這些 analysis 的概念主要是由 Analyzer(分詞器) 來處理,主要做的事情包含:

  • Character filter: 將文本的 HTML 內容進行過濾
  • Tokenizer: 按照規則進行分詞處理
  • Token Filter: 去掉一些詞語或轉換大小寫會類型

在 Elasticsearch 內置的分詞器包含:

分詞器名稱 處理方式
standard analyzer 預設分詞器,可以做分詞切分,小寫處理
simple analyzer 按照符號切分(非字母),小寫處理
stop analyzer 小寫處理,停用詞語過濾(the, a, this...)
whitespace analyzer 依照空白進行切分,小寫處理,並且不會過濾HTML
keyword analyzer 不分詞,直接輸入
pattern analyzer 正則表達式,預設使用 \W+(非字符串分隔)

在了解分詞的運作方式之後,接下來我們就針對這些分詞器來進行範例演練:

standard analyzer

預設分詞器:

GET _analyze
{
  "analyzer": "standard",
  "text":"hello for 2 <b>in your</b> why-not?"
}

處理結果,可以看到所有字串都會進行拆分,如下:

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "for",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "2",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<NUM>",
      "position" : 2
    },
    {
      "token" : "b",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "in",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "your",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "b",
      "start_offset" : 24,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "why",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "not",
      "start_offset" : 31,
      "end_offset" : 34,
      "type" : "<ALPHANUM>",
      "position" : 8
    }
  ]
}

simple analyzer

按照符號切分(非字母),小寫處理:

GET _analyze
{
  "analyzer": "standard",
  "text":"hello for 2 <b>in your</b> why-not?"
}

處理結果,可以看到數字已經消失,如下:

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "for",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "b",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "in",
      "start_offset" : 15,
      "end_offset" : 17,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "your",
      "start_offset" : 18,
      "end_offset" : 22,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "b",
      "start_offset" : 24,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "why",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "not",
      "start_offset" : 31,
      "end_offset" : 34,
      "type" : "word",
      "position" : 7
    }
  ]
}

接下來以下幾項僅舉例,可以自行試試看:

stop analyzer

小寫處理,停用詞語過濾(the, a, this...)

GET _analyze
{
  "analyzer": "stop",
  "text":"hello for 2 <b>in your</b> why-not?"
}

whitespace analyzer

依照空白進行切分,小寫處理 ,並且不會過濾HTML

GET _analyze
{
  "analyzer": "whitespace",
  "text":"hello for 2 <b>in your</b> why-not?"
}

keyword analyzer

不分詞,直接輸入

GET _analyze
{
  "analyzer": "keyword",
  "text":"hello for 2 <b>in your</b> why-not?"
}

輸出結果

{
  "tokens" : [
    {
      "token" : "hello for 2 <b>in your</b> why-not?",
      "start_offset" : 0,
      "end_offset" : 35,
      "type" : "word",
      "position" : 0
    }
  ]
}

pattern analyzer

正則表達式,預設使用 \W+(非字符串分隔)

GET _analyze
{
  "analyzer": "pattern",
  "text":"hello for 2 <b>in your</b> why-not?"
}

聲明,近期發現有平台大量盜用文章內容,若您發現內文非來自關於網路那些事 [https://adon988.logdown.com](https://adon988.logdown.com) ,希望您與我聯繫 [https://www.facebook.com/ThinkingWebsite](https://www.facebook.com/ThinkingWebsite)
(本文章源自"關於網路那些事)

最後,如果你喜歡這篇文章,請幫忙點個讚



最新文章推薦

討論

comments powered by Disqus