Elasticsearch Analyzer 分詞器是什麼?實際範例演練
Analysis 是 Elasticsearch 的一個概念,他可以將我們數據 寫入及查詢時,也會針對分詞做處理。
這些 analysis 的概念主要是由 Analyzer(分詞器) 來處理,主要做的事情包含:
- Character filter: 將文本的 HTML 內容進行過濾
- Tokenizer: 按照規則進行分詞處理
- Token Filter: 去掉一些詞語或轉換大小寫會類型
在 Elasticsearch 內置的分詞器包含:
分詞器名稱 | 處理方式 |
---|---|
standard analyzer | 預設分詞器,可以做分詞切分,小寫處理 |
simple analyzer | 按照符號切分(非字母),小寫處理 |
stop analyzer | 小寫處理,停用詞語過濾(the, a, this...) |
whitespace analyzer | 依照空白進行切分,小寫處理,並且不會過濾HTML |
keyword analyzer | 不分詞,直接輸入 |
pattern analyzer | 正則表達式,預設使用 \W+(非字符串分隔) |
在了解分詞的運作方式之後,接下來我們就針對這些分詞器來進行範例演練:
standard analyzer
預設分詞器:
GET _analyze
{
"analyzer": "standard",
"text":"hello for 2 <b>in your</b> why-not?"
}
處理結果,可以看到所有字串都會進行拆分,如下:
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "for",
"start_offset" : 6,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "2",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<NUM>",
"position" : 2
},
{
"token" : "b",
"start_offset" : 13,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "in",
"start_offset" : 15,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "your",
"start_offset" : 18,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "b",
"start_offset" : 24,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "why",
"start_offset" : 27,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "not",
"start_offset" : 31,
"end_offset" : 34,
"type" : "<ALPHANUM>",
"position" : 8
}
]
}
simple analyzer
按照符號切分(非字母),小寫處理:
GET _analyze
{
"analyzer": "standard",
"text":"hello for 2 <b>in your</b> why-not?"
}
處理結果,可以看到數字已經消失,如下:
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "for",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 1
},
{
"token" : "b",
"start_offset" : 13,
"end_offset" : 14,
"type" : "word",
"position" : 2
},
{
"token" : "in",
"start_offset" : 15,
"end_offset" : 17,
"type" : "word",
"position" : 3
},
{
"token" : "your",
"start_offset" : 18,
"end_offset" : 22,
"type" : "word",
"position" : 4
},
{
"token" : "b",
"start_offset" : 24,
"end_offset" : 25,
"type" : "word",
"position" : 5
},
{
"token" : "why",
"start_offset" : 27,
"end_offset" : 30,
"type" : "word",
"position" : 6
},
{
"token" : "not",
"start_offset" : 31,
"end_offset" : 34,
"type" : "word",
"position" : 7
}
]
}
接下來以下幾項僅舉例,可以自行試試看:
stop analyzer
小寫處理,停用詞語過濾(the, a, this...)
GET _analyze
{
"analyzer": "stop",
"text":"hello for 2 <b>in your</b> why-not?"
}
whitespace analyzer
依照空白進行切分,小寫處理 ,並且不會過濾HTML
GET _analyze
{
"analyzer": "whitespace",
"text":"hello for 2 <b>in your</b> why-not?"
}
keyword analyzer
不分詞,直接輸入
GET _analyze
{
"analyzer": "keyword",
"text":"hello for 2 <b>in your</b> why-not?"
}
輸出結果
{
"tokens" : [
{
"token" : "hello for 2 <b>in your</b> why-not?",
"start_offset" : 0,
"end_offset" : 35,
"type" : "word",
"position" : 0
}
]
}
pattern analyzer
正則表達式,預設使用 \W+(非字符串分隔)
GET _analyze
{
"analyzer": "pattern",
"text":"hello for 2 <b>in your</b> why-not?"
}
聲明,近期發現有平台大量盜用文章內容,若您發現內文非來自關於網路那些事 [https://adon988.logdown.com](https://adon988.logdown.com) ,希望您與我聯繫 [https://www.facebook.com/ThinkingWebsite](https://www.facebook.com/ThinkingWebsite)
(本文章源自"關於網路那些事)