Ranking¶

Setup¶

In [ ]:

Copied!

pip install ydf -U
pip install ydf -U

What is Ranking?¶

Ranking, also called Learn to Rank, is the task of determining the order of items. For instance, when you search for a query on Google, it ranks web pages and displays the top-ranked results first. A common way to represent a ranking dataset is with a "relevance" score. The order of the elements is defined by their relevance: items with greater relevance should appear before items with lower relevance. The cost of a mistake is defined by the difference between the relevance of the predicted item and the relevance of the correct item. For example, misordering two items with relevance 3 and 4 is not as bad as misordering two items with relevance 1 and 5. YDF expects ranking datasets to be presented in a "flat" format. A dataset of queries and corresponding documents might look like this:

query	document_id	feature_1	feature_2	relevance
cat	1	0.1	blue	4
cat	2	0.5	green	1
cat	3	0.2	red	2
dog	4	NA	red	0
dog	5	0.2	red	0
dog	6	0.6	green	1

The relevance/label is a floating point numerical value between 0 and 5 (generally between 0 and 4) where 0 means "completely unrelated", 4 means "very relevant" and 5 means "same as the query".

In this example, Document 1 is very relevant to the query "cat", while document 2 is only "related" to cats. There are no documents that are really talking about "dog" (the highest relevance is 1 for the document 6). However, the dog query is still expecting to return document 6 (since this is the document that talks the "most" about dogs).

Training a ranking model¶

The task of a model (e.g., classification, regression, ranking, uplifting) is determined by the learner argument task.

In [1]:

Copied!





# Load libraries
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets
import numpy as np

# Download and load a ranking datasets as Pandas DataFrames
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/synthetic_ranking_train.csv")
test_ds = pd.read_csv(f"{ds_path}/synthetic_ranking_test.csv")

# Print the first 5 training examples
train_ds.head(5)
# Load libraries
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets
import numpy as np

# Download and load a ranking datasets as Pandas DataFrames
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/synthetic_ranking_train.csv")
test_ds = pd.read_csv(f"{ds_path}/synthetic_ranking_test.csv")

# Print the first 5 training examples
train_ds.head(5)

Out[1]:

	GROUP	LABEL	cat_int_0	cat_int_1	cat_str_0	cat_str_1	num_0	num_1	num_2	num_3
0	G0	0.493644	NaN	11.0	V_18	V_7	0.923738	0.373921	0.154973	0.892344
1	G0	1.461350	28.0	5.0	V_15	V_28	0.627094	0.907925	0.556397	0.839919
2	G0	0.662606	6.0	22.0	NaN	V_2	0.690948	0.129315	0.832686	0.318354
3	G0	2.510630	7.0	1.0	V_5	V_12	0.698481	NaN	0.899466	0.831899
4	G0	0.691813	15.0	24.0	V_7	V_27	0.102744	0.237528	0.379345	0.699236

In this dataset, each row represents a pair of query / document (called "GROUP"). The "LABEL" is the relevance and it indicates how much the query matches the document. The features of the query and the document are merged together as features in the other columns cat_int_0, cat_int_1, etc.

We can train a ranking model:

In [2]:

Copied!





model = ydf.GradientBoostedTreesLearner(
    label="LABEL",
    ranking_group="GROUP",
    task=ydf.Task.RANKING).train(train_ds)
model = ydf.GradientBoostedTreesLearner(
    label="LABEL",
    ranking_group="GROUP",
    task=ydf.Task.RANKING).train(train_ds)

Train model on 3990 examples
Model trained in 0:00:00.695050

By default, YDF evaluates ranking models with NDCG.

In [3]:

Copied!

evaluation = model.evaluate(test_ds)

print(evaluation)
evaluation = model.evaluate(test_ds)

print(evaluation)

NDCG: 0.726741
num examples: 1010
num examples (weighted): 1010