Ranking¶
Setup¶
pip install ydf -U
What is Ranking?¶
Ranking, also called Learn to Rank, is the task of determining the order of items. For instance, when you search for a query on Google, it ranks web pages and displays the top-ranked results first. A common way to represent a ranking dataset is with a "relevance" score. The order of the elements is defined by their relevance: items with greater relevance should appear before items with lower relevance. The cost of a mistake is defined by the difference between the relevance of the predicted item and the relevance of the correct item. For example, misordering two items with relevance 3 and 4 is not as bad as misordering two items with relevance 1 and 5. YDF expects ranking datasets to be presented in a "flat" format. A dataset of queries and corresponding documents might look like this:
query | document_id | feature_1 | feature_2 | relevance |
---|---|---|---|---|
cat | 1 | 0.1 | blue | 4 |
cat | 2 | 0.5 | green | 1 |
cat | 3 | 0.2 | red | 2 |
dog | 4 | NA | red | 0 |
dog | 5 | 0.2 | red | 0 |
dog | 6 | 0.6 | green | 1 |
The relevance/label is a floating point numerical value between 0 and 5 (generally between 0 and 4) where 0 means "completely unrelated", 4 means "very relevant" and 5 means "same as the query".
In this example, Document 1 is very relevant to the query "cat", while document 2 is only "related" to cats. There are no documents that are really talking about "dog" (the highest relevance is 1 for the document 6). However, the dog query is still expecting to return document 6 (since this is the document that talks the "most" about dogs).
Training a ranking model¶
The task of a model (e.g., classification, regression, ranking, uplifting) is determined by the learner argument task
.
# Load libraries
import ydf # Yggdrasil Decision Forests
import pandas as pd # We use Pandas to load small datasets
import numpy as np
# Download and load a ranking datasets as Pandas DataFrames
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/synthetic_ranking_train.csv")
test_ds = pd.read_csv(f"{ds_path}/synthetic_ranking_test.csv")
# Print the first 5 training examples
train_ds.head(5)
GROUP | LABEL | cat_int_0 | cat_int_1 | cat_str_0 | cat_str_1 | num_0 | num_1 | num_2 | num_3 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | G0 | 0.493644 | NaN | 11.0 | V_18 | V_7 | 0.923738 | 0.373921 | 0.154973 | 0.892344 |
1 | G0 | 1.461350 | 28.0 | 5.0 | V_15 | V_28 | 0.627094 | 0.907925 | 0.556397 | 0.839919 |
2 | G0 | 0.662606 | 6.0 | 22.0 | NaN | V_2 | 0.690948 | 0.129315 | 0.832686 | 0.318354 |
3 | G0 | 2.510630 | 7.0 | 1.0 | V_5 | V_12 | 0.698481 | NaN | 0.899466 | 0.831899 |
4 | G0 | 0.691813 | 15.0 | 24.0 | V_7 | V_27 | 0.102744 | 0.237528 | 0.379345 | 0.699236 |
In this dataset, each row represents a pair of query / document (called "GROUP"). The "LABEL" is the relevance and it indicates how much the query matches the document.
The features of the query and the document are merged together as features in the other columns cat_int_0
, cat_int_1
, etc.
We can train a ranking model:
model = ydf.GradientBoostedTreesLearner(
label="LABEL",
ranking_group="GROUP",
task=ydf.Task.RANKING).train(train_ds)
Train model on 3990 examples Model trained in 0:00:00.695050
By default, YDF evaluates ranking models with NDCG.
evaluation = model.evaluate(test_ds)
print(evaluation)
NDCG: 0.726741 num examples: 1010 num examples (weighted): 1010