注釈
スコアラーは実験的な機能です。
バージョン 5.0.1 で追加.
scorer_tf_idf is a scorer based of TF-IDF (term frequency-inverse document frequency) score function.
To put it simply, TF (term frequency) divided by DF (document frequency) is TF-IDF. "TF" means that "the number of occurrences is more important". "TF divided by DF" means that "the number of occurrences of important term is more important".
The default score function in Groonga is TF (term frequency). It doesn't care about term importance but is fast.
TF-IDF cares about term importance but is slower than TF.
TF-IDF will compute more suitable score rather than TF for many cases. But it's not perfect.
If document contains many same keywords such as "They are keyword, keyword, keyword ... and keyword". It increases score by TF and TF-IDF. Search engine spammer may use the technique. But TF-IDF doesn't guard from the technique.
Okapi BM25 can solve the case. But it's more slower than TF-IDF and not implemented yet in Groonga.
You don't need to resolve scoring only by score function. Score function is highly depends on search query. You may be able to use metadata of matched record.
For example, Google uses PageRank for scoring. You may be able to use data type ("title" data are important rather than "memo" data), tag, geolocation and so on.
Please stop to think about only score function for scoring.
このセクションではscorerの使い方について説明します。
使い方を示すために使うスキーマ定義とサンプルデータは以下の通りです。
サンプルスキーマ:
実行例:
table_create Logs TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Logs message COLUMN_SCALAR Text
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
--default_tokenizer TokenBigram \
--normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms message_index COLUMN_INDEX|WITH_POSITION Logs message
# [[0, 1337566253.89858, 0.000355720520019531], true]
サンプルデータ:
実行例:
load --table Logs
[
{"message": "Error"},
{"message": "Warning"},
{"message": "Warning Warning"},
{"message": "Warning Warning Warning"},
{"message": "Info"},
{"message": "Info Info"},
{"message": "Info Info Info"},
{"message": "Info Info Info Info"},
{"message": "Notice"},
{"message": "Notice Notice"},
{"message": "Notice Notice Notice"},
{"message": "Notice Notice Notice Notice"},
{"message": "Notice Notice Notice Notice Notice"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 13]
match_columns の中で scorer_tf_idf を次のようにして指定できます:
実行例:
select Logs \
--match_columns "scorer_tf_idf(message)" \
--query "Error OR Info" \
--output_columns "message, _score" \
--sortby "-_score"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 5
# ],
# [
# [
# "message",
# "Text"
# ],
# [
# "_score",
# "Int32"
# ]
# ],
# [
# "Info Info Info Info",
# 3
# ],
# [
# "Error",
# 2
# ],
# [
# "Info Info Info",
# 2
# ],
# [
# "Info Info",
# 1
# ],
# [
# "Info",
# 1
# ]
# ]
# ]
# ]
Both the score of Info Info Info and the score of Error are 2 even Info Info Info includes three Info terms. Because Error is more important term rather than Info. The number of documents that include Info is 4. The number of documents that include Error is 1. Term that is included in less documents means that the term is more characteristic term. Characteristic term is important term.