Tuning¶

Setup¶

In [ ]:

Copied!

pip install ydf -U
pip install ydf -U

What is model tuning?¶

Model tuning, also known as automated model hyperparameter optimization or AutoML, involves finding the optimal hyperparameters for a learner to maximize the performance of a model. YDF supports model tuning out-of-the-box.

YDF model tuning has two modes. A user can either manually specify the hyperparameters to optimize and their candidate values, or use a pre-configured tuning. The second option is simpler, while the first option gives you more control. We will demonstrate both options in this tutorial.

Tuning can be done on a single machine or across multiple machines using distributed training. This tutorial focuses on tuning on a single machine. Local tuning is simple to set up and can produce excellent results on small datasets.

Distributed model tuning¶

Distributed training tuning can be advantageous for models that take a long time to train or have a large hyperparameter search space. Distributed tuning requires configuring workers and specifying the workers constructor argument of the learner. After the workers are set up, the model tuning strategy is the same as for tuning on a local machine. For more information, see the distributed training tutorial.

Download dataset¶

We use the adult dataset.

In [1]:

Copied!





import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets

# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets

# Download a classification dataset and load it as a Pandas DataFrame.
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)

Out[1]:

	age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	hours_per_week	native_country	income
0	44	Private	228057	7th-8th	4	Married-civ-spouse	Machine-op-inspct	Wife	White	Female	0	40	Dominican-Republic	<=50K
1	20	Private	299047	Some-college	10	Never-married	Other-service	Not-in-family	White	Female	0	20	United-States	<=50K
2	40	Private	342164	HS-grad	9	Separated	Adm-clerical	Unmarried	White	Female	0	37	United-States	<=50K
3	30	Private	361742	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	50	United-States	<=50K
4	67	Self-emp-inc	171564	HS-grad	9	Married-civ-spouse	Prof-specialty	Wife	White	Female	20051	30	England	>50K

Local tuning with manually set hyper-parameters¶

The hyper-parameters of a learner are accessible in the API and on the hyper-parameter page. The guide How to improve a model also provides some recommendations on the hyper-parameters that are most impactful to optimize. In this example, we train a gradient boosted trees model and optimize the following hyper-parameters: shrinkage, subsample, and max_depth.

The tuning objective is automatically selected for the model. For instance, for GradientBoostedTreesLearner used in this example, the loss is minimized.

Let's configure the tuner:

In [2]:

Copied!





tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
tuner.choice("max_depth", [3, 4, 5, 6])
tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])
tuner.choice("max_depth", [3, 4, 5, 6])

Out[2]:

<ydf.learner.tuner.SearchSpace at 0x7f3eb4372310>

We create a learner using this tuner, and train a model:

Note: Parameters that are not tuned can be specified directly on the learner.

Note: To print the tuning logs during tuning, enable logging with ydf.verbose(2).

In [5]:

Copied!





learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100, # Used for all the trials.
    tuner=tuner,
)
model =learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100, # Used for all the trials.
    tuner=tuner,
)
model =learner.train(train_ds)

Train model on 22792 examples
Model trained in 0:00:03.998356

The model description includes the tuning logs, which is a list of the hyper-parameters that were tested and their scores, are available in the tuning tab of the model description.

In [6]:

Copied!

model.describe()
model.describe()

Out[6]:

Name : GRADIENT_BOOSTED_TREES
Task : CLASSIFICATION
Label : income
Features (14) : age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
Weights : None
Trained with tuner : Yes
Model size : 543 kB

Number of records: 22792
Number of columns: 15

Number of columns by type:
	CATEGORICAL: 9 (60%)
	NUMERICAL: 6 (40%)

Columns:

CATEGORICAL: 9 (60%)
	0: "income" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"<=50K" 17308 (75.9389%)
	2: "workclass" CATEGORICAL num-nas:1257 (5.51509%) has-dict vocab-size:8 num-oods:3 (0.0139308%) most-frequent:"Private" 15879 (73.7358%)
	4: "education" CATEGORICAL has-dict vocab-size:17 zero-ood-items most-frequent:"HS-grad" 7340 (32.2043%)
	6: "marital_status" CATEGORICAL has-dict vocab-size:8 zero-ood-items most-frequent:"Married-civ-spouse" 10431 (45.7661%)
	7: "occupation" CATEGORICAL num-nas:1260 (5.52826%) has-dict vocab-size:14 num-oods:4 (0.018577%) most-frequent:"Prof-specialty" 2870 (13.329%)
	8: "relationship" CATEGORICAL has-dict vocab-size:7 zero-ood-items most-frequent:"Husband" 9191 (40.3256%)
	9: "race" CATEGORICAL has-dict vocab-size:6 zero-ood-items most-frequent:"White" 19467 (85.4115%)
	10: "sex" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Male" 15165 (66.5365%)
	14: "native_country" CATEGORICAL num-nas:407 (1.78571%) has-dict vocab-size:41 num-oods:1 (0.00446728%) most-frequent:"United-States" 20436 (91.2933%)

NUMERICAL: 6 (40%)
	1: "age" NUMERICAL mean:38.6153 min:17 max:90 sd:13.661
	3: "fnlwgt" NUMERICAL mean:189879 min:12285 max:1.4847e+06 sd:106423
	5: "education_num" NUMERICAL mean:10.0927 min:1 max:16 sd:2.56427
	11: "capital_gain" NUMERICAL mean:1081.9 min:0 max:99999 sd:7509.48
	12: "capital_loss" NUMERICAL mean:87.2806 min:0 max:4356 sd:403.01
	13: "hours_per_week" NUMERICAL mean:40.3955 min:1 max:99 sd:12.249

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

A tuner automatically selects the hyper-parameters of a learner.

trial	score	duration	shrinkage	subsample	max_depth
16	-0.574861	2.49348	0.2	1	5
31	-0.576405	3.53616	0.2	1	6
15	-0.577211	2.4727	0.1	1	5
33	-0.578941	3.69053	0.2	0.9	5
32	-0.579071	3.54803	0.2	0.9	6
35	-0.579637	3.99118	0.1	1	6
19	-0.581703	2.68832	0.2	0.8	6
34	-0.582941	3.90171	0.1	0.8	6
14	-0.583348	2.46785	0.2	0.8	5
27	-0.583466	3.23896	0.2	0.9	4
10	-0.58463	2.14364	0.2	1	4
22	-0.584824	2.97681	0.1	0.9	6
13	-0.585809	2.46436	0.1	0.9	5
12	-0.587067	2.29765	0.1	0.8	5
8	-0.590813	1.97632	0.2	0.8	4
24	-0.593991	3.0293	0.05	1	6
9	-0.595175	2.14037	0.1	1	4
21	-0.596592	2.91333	0.05	0.8	6
28	-0.597159	3.2767	0.1	0.9	4
20	-0.597244	2.90384	0.05	0.9	6
6	-0.597766	1.96352	0.1	0.8	4
5	-0.603554	1.71404	0.2	1	3
23	-0.60517	3.01335	0.2	0.9	3
18	-0.605849	2.54463	0.05	0.9	5
0	-0.606706	1.49037	0.2	0.8	3
17	-0.607283	2.511	0.05	0.8	5
30	-0.608091	3.47695	0.05	1	5
25	-0.619956	3.17843	0.1	0.9	3
3	-0.620752	1.63833	0.1	0.8	3
4	-0.621349	1.70712	0.1	1	3
7	-0.625488	1.96705	0.05	0.8	4
29	-0.626953	3.43528	0.05	0.9	4
11	-0.62982	2.16092	0.05	1	4
1	-0.656424	1.57613	0.05	0.8	3
26	-0.656732	3.20212	0.05	1	3
2	-0.656747	1.62633	0.05	0.9	3

The following evaluation is computed on the validation or out-of-bag dataset.

Task: CLASSIFICATION
Label: income
Loss (BINOMIAL_LOG_LIKELIHOOD): 0.574861

Accuracy: 0.87251  CI95[W][0 1]
ErrorRate: : 0.12749


Confusion Table:
truth\prediction
       <=50K  >50K
<=50K   1570    94
 >50K    194   401
Total: 2259

Variable importances measure the importance of an input feature for a model.

    1.            "age"  0.257622 ################
    2.   "capital_gain"  0.249047 #############
    3.   "relationship"  0.244032 ###########
    4.     "occupation"  0.242881 ###########
    5. "hours_per_week"  0.238530 ##########
    6.      "education"  0.237441 #########
    7. "marital_status"  0.234935 ########
    8.   "capital_loss"  0.231145 #######
    9.         "fnlwgt"  0.226059 ######
   10. "native_country"  0.225767 ######
   11.      "workclass"  0.220718 ####
   12.  "education_num"  0.219033 ####
   13.            "sex"  0.211384 #
   14.           "race"  0.206124

    1.   "capital_gain" 11.000000 ################
    2.            "age" 10.000000 ##############
    3. "hours_per_week" 10.000000 ##############
    4.   "relationship"  9.000000 ############
    5. "marital_status"  7.000000 #########
    6.      "education"  6.000000 ########
    7.   "capital_loss"  6.000000 ########
    8.         "fnlwgt"  5.000000 ######
    9.      "workclass"  3.000000 ###
   10.  "education_num"  3.000000 ###
   11.            "sex"  3.000000 ###
   12.     "occupation"  1.000000 
   13.           "race"  1.000000

    1.     "occupation" 144.000000 ################
    2.            "age" 121.000000 #############
    3.      "education" 113.000000 ############
    4.   "capital_gain" 111.000000 ############
    5.   "capital_loss" 90.000000 #########
    6. "native_country" 87.000000 #########
    7.         "fnlwgt" 84.000000 #########
    8.   "relationship" 73.000000 #######
    9. "marital_status" 68.000000 #######
   10. "hours_per_week" 64.000000 ######
   11.      "workclass" 49.000000 #####
   12.  "education_num" 28.000000 ##
   13.            "sex" 14.000000 #
   14.           "race"  5.000000

    1.   "relationship" 1675.422986 ################
    2.   "capital_gain" 1040.150118 #########
    3.  "education_num" 687.196583 ######
    4.     "occupation" 526.056194 #####
    5. "marital_status" 469.469421 ####
    6.            "age" 289.979275 ##
    7.   "capital_loss" 281.277707 ##
    8.      "education" 259.256109 ##
    9. "hours_per_week" 181.939375 #
   10. "native_country" 108.750643 #
   11.      "workclass" 64.136268 
   12.         "fnlwgt" 46.873309 
   13.            "sex" 30.074515 
   14.           "race"  2.153583

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 75

Only printing the first tree.

Tree #0:
    "relationship" is in [BITMAP] {<OOD>, Husband, Wife} [s:0.036623 n:20533 np:9213 miss:1] ; pred:-8.31766e-09
        ├─(pos)─ "education_num">=12.5 [s:0.0343752 n:9213 np:2773 miss:0] ; pred:0.233866
        |        ├─(pos)─ "capital_gain">=5095.5 [s:0.0125728 n:2773 np:434 miss:0] ; pred:0.545366
        |        |        ├─(pos)─ "occupation" is in [BITMAP] {<OOD>, Prof-specialty, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service, Machine-op-inspct, Transport-moving, Handlers-cleaners, ...[2 left]} [s:0.000434532 n:434 np:429 miss:1] ; pred:0.832346
        |        |        |        ├─(pos)─ pred:0.834828
        |        |        |        └─(neg)─ pred:0.619473
        |        |        └─(neg)─ "capital_loss">=1782.5 [s:0.0101181 n:2339 np:249 miss:0] ; pred:0.492116
        |        |                 ├─(pos)─ pred:0.813402
        |        |                 └─(neg)─ pred:0.453839
        |        └─(neg)─ "capital_gain">=5095.5 [s:0.0205419 n:6440 np:303 miss:0] ; pred:0.0997371
        |                 ├─(pos)─ "age">=60.5 [s:0.00421502 n:303 np:43 miss:0] ; pred:0.810859
        |                 |        ├─(pos)─ pred:0.634856
        |                 |        └─(neg)─ pred:0.839967
        |                 └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support, Protective-serv} [s:0.0100346 n:6137 np:2334 miss:0] ; pred:0.0646271
        |                          ├─(pos)─ pred:0.205598
        |                          └─(neg)─ pred:-0.0218904
        └─(neg)─ "capital_gain">=7073.5 [s:0.0143125 n:11320 np:199 miss:0] ; pred:-0.190336
                 ├─(pos)─ "age">=21.5 [s:0.00807667 n:199 np:194 miss:1] ; pred:0.795647
                 |        ├─(pos)─ "capital_gain">=7565.5 [s:0.00761118 n:194 np:184 miss:0] ; pred:0.811553
                 |        |        ├─(pos)─ pred:0.833976
                 |        |        └─(neg)─ pred:0.398979
                 |        └─(neg)─ pred:0.178485
                 └─(neg)─ "education" is in [BITMAP] {<OOD>, Bachelors, Masters, Prof-school, Doctorate} [s:0.00229611 n:11121 np:2199 miss:1] ; pred:-0.207979
                          ├─(pos)─ "age">=31.5 [s:0.00725859 n:2199 np:1263 miss:1] ; pred:-0.10157
                          |        ├─(pos)─ pred:-0.0207104
                          |        └─(neg)─ pred:-0.210678
                          └─(neg)─ "capital_loss">=2218.5 [s:0.000534265 n:8922 np:41 miss:0] ; pred:-0.234206
                                   ├─(pos)─ pred:0.14084
                                   └─(neg)─ pred:-0.235938

The model can then be evaluated as usual.

In [7]:

Copied!

model.evaluate(test_ds)
model.evaluate(test_ds)

Out[7]:

accuracy:

0.875218

AUC: '>50K' vs others:

0.929283

PR-AUC: '>50K' vs others:

0.831294

loss:

0.277689

num examples:

9769

num examples (weighted):

9769

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	6974	438
>50K	781	1576

Configuring conditional hyper-parameters¶

There are hyper-parameters that are only relevant when other hyper-parameters are configured in a specific way. For example, when growing_strategy=LOCAL, it makes sense to optimize max_depth. However, when growing_strategy=BEST_FIRST_GLOBAL, it is better to optimize max_num_nodes. We can configure a tuner to account for these conditional dependencies.

In [13]:

Copied!





tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])

local_subspace = tuner.choice("growing_strategy", ["LOCAL"])
local_subspace.choice("max_depth", [3, 4, 5, 6])

global_subspace = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_subspace.choice("max_num_nodes", [32, 64, 128, 256])
tuner = ydf.RandomSearchTuner(num_trials=50)
tuner.choice("shrinkage", [0.2, 0.1, 0.05])
tuner.choice("subsample", [1.0, 0.9, 0.8])

local_subspace = tuner.choice("growing_strategy", ["LOCAL"])
local_subspace.choice("max_depth", [3, 4, 5, 6])

global_subspace = tuner.choice("growing_strategy", ["BEST_FIRST_GLOBAL"], merge=True)
global_subspace.choice("max_num_nodes", [32, 64, 128, 256])

Out[13]:

<ydf.learner.tuner.SearchSpace at 0x7f3f10549e50>

Let's tune the model and display the results.

In [14]:

Copied!





learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)

Train model on 22792 examples
Model trained in 0:00:06.789261

In [15]:

Copied!

model.describe()
model.describe()

Out[15]:

Name : GRADIENT_BOOSTED_TREES
Task : CLASSIFICATION
Label : income
Features (14) : age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
Weights : None
Trained with tuner : Yes
Model size : 543 kB

Number of records: 22792
Number of columns: 15

Number of columns by type:
	CATEGORICAL: 9 (60%)
	NUMERICAL: 6 (40%)

Columns:

CATEGORICAL: 9 (60%)
	0: "income" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"<=50K" 17308 (75.9389%)
	2: "workclass" CATEGORICAL num-nas:1257 (5.51509%) has-dict vocab-size:8 num-oods:3 (0.0139308%) most-frequent:"Private" 15879 (73.7358%)
	4: "education" CATEGORICAL has-dict vocab-size:17 zero-ood-items most-frequent:"HS-grad" 7340 (32.2043%)
	6: "marital_status" CATEGORICAL has-dict vocab-size:8 zero-ood-items most-frequent:"Married-civ-spouse" 10431 (45.7661%)
	7: "occupation" CATEGORICAL num-nas:1260 (5.52826%) has-dict vocab-size:14 num-oods:4 (0.018577%) most-frequent:"Prof-specialty" 2870 (13.329%)
	8: "relationship" CATEGORICAL has-dict vocab-size:7 zero-ood-items most-frequent:"Husband" 9191 (40.3256%)
	9: "race" CATEGORICAL has-dict vocab-size:6 zero-ood-items most-frequent:"White" 19467 (85.4115%)
	10: "sex" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Male" 15165 (66.5365%)
	14: "native_country" CATEGORICAL num-nas:407 (1.78571%) has-dict vocab-size:41 num-oods:1 (0.00446728%) most-frequent:"United-States" 20436 (91.2933%)

NUMERICAL: 6 (40%)
	1: "age" NUMERICAL mean:38.6153 min:17 max:90 sd:13.661
	3: "fnlwgt" NUMERICAL mean:189879 min:12285 max:1.4847e+06 sd:106423
	5: "education_num" NUMERICAL mean:10.0927 min:1 max:16 sd:2.56427
	11: "capital_gain" NUMERICAL mean:1081.9 min:0 max:99999 sd:7509.48
	12: "capital_loss" NUMERICAL mean:87.2806 min:0 max:4356 sd:403.01
	13: "hours_per_week" NUMERICAL mean:40.3955 min:1 max:99 sd:12.249

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

A tuner automatically selects the hyper-parameters of a learner.

trial	score	duration	shrinkage	subsample	growing_strategy	max_depth	max_num_nodes
31	-0.574861	5.4128	0.2	1	LOCAL	5
10	-0.576405	2.72618	0.2	1	LOCAL	6
18	-0.578031	3.67246	0.1	0.9	BEST_FIRST_GLOBAL		32
25	-0.578941	4.434	0.2	0.9	LOCAL	5
11	-0.579071	2.97415	0.2	0.9	LOCAL	6
21	-0.579482	4.04769	0.1	0.9	BEST_FIRST_GLOBAL		64
39	-0.579482	5.72021	0.1	0.9	BEST_FIRST_GLOBAL		128
44	-0.579637	6.08383	0.1	1	LOCAL	6
16	-0.580548	3.50807	0.1	0.8	BEST_FIRST_GLOBAL		32
8	-0.582698	2.65852	0.2	1	BEST_FIRST_GLOBAL		64
28	-0.582941	5.26744	0.1	0.8	LOCAL	6
6	-0.583348	2.62349	0.2	0.8	LOCAL	5
4	-0.583466	2.33348	0.2	0.9	LOCAL	4
14	-0.583824	3.30352	0.2	0.8	BEST_FIRST_GLOBAL		128
15	-0.583824	3.32547	0.2	0.8	BEST_FIRST_GLOBAL		64
3	-0.584435	2.30352	0.2	0.8	BEST_FIRST_GLOBAL		32
42	-0.584518	5.98935	0.1	0.8	BEST_FIRST_GLOBAL		256
49	-0.584518	6.78263	0.1	0.8	BEST_FIRST_GLOBAL		128
33	-0.58463	5.52032	0.2	1	LOCAL	4
12	-0.584824	3.29028	0.1	0.9	LOCAL	6
41	-0.587067	5.83872	0.1	0.8	LOCAL	5
32	-0.589069	5.51251	0.2	0.9	BEST_FIRST_GLOBAL		32
47	-0.590361	6.48934	0.05	1	BEST_FIRST_GLOBAL		64
23	-0.590361	4.2553	0.05	1	BEST_FIRST_GLOBAL		256
43	-0.590541	6.02143	0.05	0.9	BEST_FIRST_GLOBAL		32
37	-0.590813	5.66041	0.2	0.8	LOCAL	4
45	-0.592258	6.24542	0.2	0.9	BEST_FIRST_GLOBAL		64
34	-0.592258	5.6012	0.2	0.9	BEST_FIRST_GLOBAL		256
9	-0.592258	2.72076	0.2	0.9	BEST_FIRST_GLOBAL		128
22	-0.59235	4.12727	0.05	0.9	BEST_FIRST_GLOBAL		128
48	-0.59235	6.76277	0.05	0.9	BEST_FIRST_GLOBAL		256
46	-0.59389	6.38848	0.05	1	BEST_FIRST_GLOBAL		32
35	-0.593991	5.64836	0.05	1	LOCAL	6
17	-0.594588	3.5419	0.05	0.8	BEST_FIRST_GLOBAL		32
19	-0.595605	3.78829	0.05	0.8	BEST_FIRST_GLOBAL		64
20	-0.595605	3.82693	0.05	0.8	BEST_FIRST_GLOBAL		256
36	-0.597159	5.65092	0.1	0.9	LOCAL	4
13	-0.597244	3.29495	0.05	0.9	LOCAL	6
2	-0.597766	2.23067	0.1	0.8	LOCAL	4
1	-0.603554	1.84102	0.2	1	LOCAL	3
29	-0.60517	5.35164	0.2	0.9	LOCAL	3
7	-0.605849	2.63557	0.05	0.9	LOCAL	5
0	-0.606706	1.81158	0.2	0.8	LOCAL	3
5	-0.607283	2.58145	0.05	0.8	LOCAL	5
24	-0.619956	4.39896	0.1	0.9	LOCAL	3
40	-0.621349	5.80321	0.1	1	LOCAL	3
30	-0.626953	5.38874	0.05	0.9	LOCAL	4
27	-0.62982	5.01653	0.05	1	LOCAL	4
38	-0.656732	5.66151	0.05	1	LOCAL	3
26	-0.656747	4.62038	0.05	0.9	LOCAL	3

The following evaluation is computed on the validation or out-of-bag dataset.

Task: CLASSIFICATION
Label: income
Loss (BINOMIAL_LOG_LIKELIHOOD): 0.574861

Accuracy: 0.87251  CI95[W][0 1]
ErrorRate: : 0.12749


Confusion Table:
truth\prediction
       <=50K  >50K
<=50K   1570    94
 >50K    194   401
Total: 2259

Variable importances measure the importance of an input feature for a model.

    1.            "age"  0.257622 ################
    2.   "capital_gain"  0.249047 #############
    3.   "relationship"  0.244032 ###########
    4.     "occupation"  0.242881 ###########
    5. "hours_per_week"  0.238530 ##########
    6.      "education"  0.237441 #########
    7. "marital_status"  0.234935 ########
    8.   "capital_loss"  0.231145 #######
    9.         "fnlwgt"  0.226059 ######
   10. "native_country"  0.225767 ######
   11.      "workclass"  0.220718 ####
   12.  "education_num"  0.219033 ####
   13.            "sex"  0.211384 #
   14.           "race"  0.206124

    1.   "capital_gain" 11.000000 ################
    2.            "age" 10.000000 ##############
    3. "hours_per_week" 10.000000 ##############
    4.   "relationship"  9.000000 ############
    5. "marital_status"  7.000000 #########
    6.      "education"  6.000000 ########
    7.   "capital_loss"  6.000000 ########
    8.         "fnlwgt"  5.000000 ######
    9.      "workclass"  3.000000 ###
   10.  "education_num"  3.000000 ###
   11.            "sex"  3.000000 ###
   12.     "occupation"  1.000000 
   13.           "race"  1.000000

    1.     "occupation" 144.000000 ################
    2.            "age" 121.000000 #############
    3.      "education" 113.000000 ############
    4.   "capital_gain" 111.000000 ############
    5.   "capital_loss" 90.000000 #########
    6. "native_country" 87.000000 #########
    7.         "fnlwgt" 84.000000 #########
    8.   "relationship" 73.000000 #######
    9. "marital_status" 68.000000 #######
   10. "hours_per_week" 64.000000 ######
   11.      "workclass" 49.000000 #####
   12.  "education_num" 28.000000 ##
   13.            "sex" 14.000000 #
   14.           "race"  5.000000

    1.   "relationship" 1675.422986 ################
    2.   "capital_gain" 1040.150118 #########
    3.  "education_num" 687.196583 ######
    4.     "occupation" 526.056194 #####
    5. "marital_status" 469.469421 ####
    6.            "age" 289.979275 ##
    7.   "capital_loss" 281.277707 ##
    8.      "education" 259.256109 ##
    9. "hours_per_week" 181.939375 #
   10. "native_country" 108.750643 #
   11.      "workclass" 64.136268 
   12.         "fnlwgt" 46.873309 
   13.            "sex" 30.074515 
   14.           "race"  2.153583

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 75

Only printing the first tree.

Tree #0:
    "relationship" is in [BITMAP] {<OOD>, Husband, Wife} [s:0.036623 n:20533 np:9213 miss:1] ; pred:-8.31766e-09
        ├─(pos)─ "education_num">=12.5 [s:0.0343752 n:9213 np:2773 miss:0] ; pred:0.233866
        |        ├─(pos)─ "capital_gain">=5095.5 [s:0.0125728 n:2773 np:434 miss:0] ; pred:0.545366
        |        |        ├─(pos)─ "occupation" is in [BITMAP] {<OOD>, Prof-specialty, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service, Machine-op-inspct, Transport-moving, Handlers-cleaners, ...[2 left]} [s:0.000434532 n:434 np:429 miss:1] ; pred:0.832346
        |        |        |        ├─(pos)─ pred:0.834828
        |        |        |        └─(neg)─ pred:0.619473
        |        |        └─(neg)─ "capital_loss">=1782.5 [s:0.0101181 n:2339 np:249 miss:0] ; pred:0.492116
        |        |                 ├─(pos)─ pred:0.813402
        |        |                 └─(neg)─ pred:0.453839
        |        └─(neg)─ "capital_gain">=5095.5 [s:0.0205419 n:6440 np:303 miss:0] ; pred:0.0997371
        |                 ├─(pos)─ "age">=60.5 [s:0.00421502 n:303 np:43 miss:0] ; pred:0.810859
        |                 |        ├─(pos)─ pred:0.634856
        |                 |        └─(neg)─ pred:0.839967
        |                 └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support, Protective-serv} [s:0.0100346 n:6137 np:2334 miss:0] ; pred:0.0646271
        |                          ├─(pos)─ pred:0.205598
        |                          └─(neg)─ pred:-0.0218904
        └─(neg)─ "capital_gain">=7073.5 [s:0.0143125 n:11320 np:199 miss:0] ; pred:-0.190336
                 ├─(pos)─ "age">=21.5 [s:0.00807667 n:199 np:194 miss:1] ; pred:0.795647
                 |        ├─(pos)─ "capital_gain">=7565.5 [s:0.00761118 n:194 np:184 miss:0] ; pred:0.811553
                 |        |        ├─(pos)─ pred:0.833976
                 |        |        └─(neg)─ pred:0.398979
                 |        └─(neg)─ pred:0.178485
                 └─(neg)─ "education" is in [BITMAP] {<OOD>, Bachelors, Masters, Prof-school, Doctorate} [s:0.00229611 n:11121 np:2199 miss:1] ; pred:-0.207979
                          ├─(pos)─ "age">=31.5 [s:0.00725859 n:2199 np:1263 miss:1] ; pred:-0.10157
                          |        ├─(pos)─ pred:-0.0207104
                          |        └─(neg)─ pred:-0.210678
                          └─(neg)─ "capital_loss">=2218.5 [s:0.000534265 n:8922 np:41 miss:0] ; pred:-0.234206
                                   ├─(pos)─ pred:0.14084
                                   └─(neg)─ pred:-0.235938

In [16]:

Copied!

model.evaluate(test_ds)
model.evaluate(test_ds)

Out[16]:

accuracy:

0.875218

AUC: '>50K' vs others:

0.929283

PR-AUC: '>50K' vs others:

0.831294

loss:

0.277689

num examples:

9769

num examples (weighted):

9769

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	6974	438
>50K	781	1576

Local tuning with automatically configured hyper-parameters¶

If you do not want to configure the hyperparameters to optimize, you can use a preconfigured tuner.

In [8]:

Copied!

tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True)
tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True)

Model training is similar:

In [10]:

Copied!





learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)
learner = ydf.GradientBoostedTreesLearner(
    label="income",
    num_trees=100,
    tuner=tuner,
)
model =learner.train(train_ds)

Train model on 22792 examples
Model trained in 0:00:01.745021

As well as looking at the model:

In [11]:

Copied!

model.describe()
model.describe()

Out[11]:

Name : GRADIENT_BOOSTED_TREES
Task : CLASSIFICATION
Label : income
Features (14) : age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country
Weights : None
Trained with tuner : Yes
Model size : 1374 kB

Number of records: 22792
Number of columns: 15

Number of columns by type:
	CATEGORICAL: 9 (60%)
	NUMERICAL: 6 (40%)

Columns:

CATEGORICAL: 9 (60%)
	0: "income" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"<=50K" 17308 (75.9389%)
	2: "workclass" CATEGORICAL num-nas:1257 (5.51509%) has-dict vocab-size:8 num-oods:3 (0.0139308%) most-frequent:"Private" 15879 (73.7358%)
	4: "education" CATEGORICAL has-dict vocab-size:17 zero-ood-items most-frequent:"HS-grad" 7340 (32.2043%)
	6: "marital_status" CATEGORICAL has-dict vocab-size:8 zero-ood-items most-frequent:"Married-civ-spouse" 10431 (45.7661%)
	7: "occupation" CATEGORICAL num-nas:1260 (5.52826%) has-dict vocab-size:14 num-oods:4 (0.018577%) most-frequent:"Prof-specialty" 2870 (13.329%)
	8: "relationship" CATEGORICAL has-dict vocab-size:7 zero-ood-items most-frequent:"Husband" 9191 (40.3256%)
	9: "race" CATEGORICAL has-dict vocab-size:6 zero-ood-items most-frequent:"White" 19467 (85.4115%)
	10: "sex" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Male" 15165 (66.5365%)
	14: "native_country" CATEGORICAL num-nas:407 (1.78571%) has-dict vocab-size:41 num-oods:1 (0.00446728%) most-frequent:"United-States" 20436 (91.2933%)

NUMERICAL: 6 (40%)
	1: "age" NUMERICAL mean:38.6153 min:17 max:90 sd:13.661
	3: "fnlwgt" NUMERICAL mean:189879 min:12285 max:1.4847e+06 sd:106423
	5: "education_num" NUMERICAL mean:10.0927 min:1 max:16 sd:2.56427
	11: "capital_gain" NUMERICAL mean:1081.9 min:0 max:99999 sd:7509.48
	12: "capital_loss" NUMERICAL mean:87.2806 min:0 max:4356 sd:403.01
	13: "hours_per_week" NUMERICAL mean:40.3955 min:1 max:99 sd:12.249

Terminology:
	nas: Number of non-available (i.e. missing) values.
	ood: Out of dictionary.
	manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
	tokenized: The attribute value is obtained through tokenization.
	has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
	vocab-size: Number of unique values.

A tuner automatically selects the hyper-parameters of a learner.

trial	score	duration
0	-0.579637	1.74332

The following evaluation is computed on the validation or out-of-bag dataset.

Task: CLASSIFICATION
Label: income
Loss (BINOMIAL_LOG_LIKELIHOOD): 0.579637

Accuracy: 0.868083  CI95[W][0 1]
ErrorRate: : 0.131917


Confusion Table:
truth\prediction
       <=50K  >50K
<=50K   1564   100
 >50K    198   397
Total: 2259

Variable importances measure the importance of an input feature for a model.

    1.   "capital_gain"  0.234685 ################
    2.            "age"  0.231226 ###############
    3. "marital_status"  0.225030 #############
    4.     "occupation"  0.216504 ###########
    5.      "education"  0.212171 #########
    6.   "relationship"  0.203987 #######
    7. "hours_per_week"  0.203680 #######
    8.   "capital_loss"  0.199160 ######
    9.         "fnlwgt"  0.188297 ###
   10. "native_country"  0.187899 ###
   11.  "education_num"  0.185984 ##
   12.      "workclass"  0.184872 ##
   13.           "race"  0.177978 
   14.            "sex"  0.176098

    1.   "capital_gain" 19.000000 ################
    2. "marital_status" 18.000000 ###############
    3.            "age" 15.000000 ############
    4.   "relationship" 10.000000 #######
    5.   "capital_loss"  8.000000 #####
    6. "hours_per_week"  8.000000 #####
    7.      "education"  6.000000 ###
    8.  "education_num"  5.000000 ##
    9.           "race"  5.000000 ##
   10.         "fnlwgt"  2.000000 
   11.     "occupation"  2.000000 
   12.            "sex"  2.000000

    1.     "occupation" 437.000000 ################
    2.            "age" 331.000000 ############
    3.      "education" 285.000000 ##########
    4.   "capital_gain" 257.000000 #########
    5.   "capital_loss" 230.000000 ########
    6. "native_country" 221.000000 #######
    7.         "fnlwgt" 210.000000 #######
    8. "hours_per_week" 207.000000 #######
    9.   "relationship" 172.000000 ######
   10.      "workclass" 140.000000 ####
   11. "marital_status" 139.000000 ####
   12.  "education_num" 63.000000 ##
   13.            "sex" 23.000000 
   14.           "race"  8.000000

    1.   "relationship" 2993.930793 ################
    2.   "capital_gain" 2048.254640 ##########
    3. "marital_status" 1095.321390 #####
    4.      "education" 1094.118075 #####
    5.     "occupation" 1009.400363 #####
    6.  "education_num" 794.643186 ####
    7.   "capital_loss" 571.858684 ###
    8.            "age" 545.766716 ##
    9. "hours_per_week" 336.939387 #
   10. "native_country" 241.147622 #
   11.      "workclass" 164.564834 
   12.         "fnlwgt" 115.319824 
   13.            "sex" 43.401514 
   14.           "race"  2.559291

Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.

Num trees : 100

Only printing the first tree.

Tree #0:
    "relationship" is in [BITMAP] {<OOD>, Husband, Wife} [s:0.036623 n:20533 np:9213 miss:1] ; pred:-4.15883e-09
        ├─(pos)─ "education_num">=12.5 [s:0.0343752 n:9213 np:2773 miss:0] ; pred:0.116933
        |        ├─(pos)─ "capital_gain">=5095.5 [s:0.0125728 n:2773 np:434 miss:0] ; pred:0.272683
        |        |        ├─(pos)─ "occupation" is in [BITMAP] {<OOD>, Prof-specialty, Exec-managerial, Craft-repair, Adm-clerical, Sales, Other-service, Machine-op-inspct, Transport-moving, Handlers-cleaners, ...[2 left]} [s:0.000434532 n:434 np:429 miss:1] ; pred:0.416173
        |        |        |        ├─(pos)─ "age">=79.5 [s:0.000449964 n:429 np:5 miss:0] ; pred:0.417414
        |        |        |        |        ├─(pos)─ pred:0.309737
        |        |        |        |        └─(neg)─ pred:0.418684
        |        |        |        └─(neg)─ pred:0.309737
        |        |        └─(neg)─ "capital_loss">=1782.5 [s:0.0101181 n:2339 np:249 miss:0] ; pred:0.246058
        |        |                 ├─(pos)─ "capital_loss">=1989.5 [s:0.00201289 n:249 np:39 miss:0] ; pred:0.406701
        |        |                 |        ├─(pos)─ pred:0.349312
        |        |                 |        └─(neg)─ pred:0.417359
        |        |                 └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Sales, Tech-support, Protective-serv} [s:0.0097175 n:2090 np:1688 miss:0] ; pred:0.226919
        |        |                          ├─(pos)─ pred:0.253437
        |        |                          └─(neg)─ pred:0.11557
        |        └─(neg)─ "capital_gain">=5095.5 [s:0.0205419 n:6440 np:303 miss:0] ; pred:0.0498685
        |                 ├─(pos)─ "age">=60.5 [s:0.00421502 n:303 np:43 miss:0] ; pred:0.40543
        |                 |        ├─(pos)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Machine-op-inspct, Transport-moving, Handlers-cleaners} [s:0.0296244 n:43 np:25 miss:0] ; pred:0.317428
        |                 |        |        ├─(pos)─ pred:0.397934
        |                 |        |        └─(neg)─ pred:0.205614
        |                 |        └─(neg)─ "fnlwgt">=36212.5 [s:1.36643e-16 n:260 np:250 miss:1] ; pred:0.419984
        |                 |                 ├─(pos)─ pred:0.419984
        |                 |                 └─(neg)─ pred:0.419984
        |                 └─(neg)─ "occupation" is in [BITMAP] {Prof-specialty, Exec-managerial, Adm-clerical, Sales, Tech-support, Protective-serv} [s:0.0100346 n:6137 np:2334 miss:0] ; pred:0.0323136
        |                          ├─(pos)─ "age">=33.5 [s:0.00939348 n:2334 np:1769 miss:1] ; pred:0.102799
        |                          |        ├─(pos)─ pred:0.132992
        |                          |        └─(neg)─ pred:0.00826457
        |                          └─(neg)─ "education" is in [BITMAP] {<OOD>, HS-grad, Some-college, Bachelors, Masters, Assoc-voc, Assoc-acdm, Prof-school, Doctorate} [s:0.00478423 n:3803 np:2941 miss:1] ; pred:-0.0109452
        |                                   ├─(pos)─ pred:0.00969668
        |                                   └─(neg)─ pred:-0.0813718
        └─(neg)─ "capital_gain">=7073.5 [s:0.0143125 n:11320 np:199 miss:0] ; pred:-0.0951681
                 ├─(pos)─ "age">=21.5 [s:0.00807667 n:199 np:194 miss:1] ; pred:0.397823
                 |        ├─(pos)─ "capital_gain">=7565.5 [s:0.00761118 n:194 np:184 miss:0] ; pred:0.405777
                 |        |        ├─(pos)─ "capital_gain">=30961.5 [s:0.000242202 n:184 np:20 miss:0] ; pred:0.416988
                 |        |        |        ├─(pos)─ pred:0.392422
                 |        |        |        └─(neg)─ pred:0.419984
                 |        |        └─(neg)─ "education" is in [BITMAP] {Bachelors, Masters, Assoc-voc, Prof-school} [s:0.16 n:10 np:5 miss:0] ; pred:0.19949
                 |        |                 ├─(pos)─ pred:0.419984
                 |        |                 └─(neg)─ pred:-0.0210046
                 |        └─(neg)─ pred:0.0892425
                 └─(neg)─ "education" is in [BITMAP] {<OOD>, Bachelors, Masters, Prof-school, Doctorate} [s:0.00229611 n:11121 np:2199 miss:1] ; pred:-0.10399
                          ├─(pos)─ "age">=31.5 [s:0.00725859 n:2199 np:1263 miss:1] ; pred:-0.0507848
                          |        ├─(pos)─ "education" is in [BITMAP] {<OOD>, HS-grad, Some-college, Assoc-voc, 11th, Assoc-acdm, 10th, 7th-8th, Prof-school, 9th, ...[5 left]} [s:0.0110157 n:1263 np:125 miss:1] ; pred:-0.0103552
                          |        |        ├─(pos)─ pred:0.16421
                          |        |        └─(neg)─ pred:-0.0295298
                          |        └─(neg)─ "capital_loss">=1977 [s:0.00164232 n:936 np:5 miss:0] ; pred:-0.105339
                          |                 ├─(pos)─ pred:0.19949
                          |                 └─(neg)─ pred:-0.106976
                          └─(neg)─ "capital_loss">=2218.5 [s:0.000534265 n:8922 np:41 miss:0] ; pred:-0.117103
                                   ├─(pos)─ "fnlwgt">=125450 [s:0.0755454 n:41 np:28 miss:1] ; pred:0.0704198
                                   |        ├─(pos)─ pred:-0.0328167
                                   |        └─(neg)─ pred:0.292776
                                   └─(neg)─ "hours_per_week">=40.5 [s:0.000447024 n:8881 np:1559 miss:0] ; pred:-0.117969
                                            ├─(pos)─ pred:-0.0927111
                                            └─(neg)─ pred:-0.123347

And evaluating the model:

In [12]:

Copied!

model.evaluate(test_ds)
model.evaluate(test_ds)

Out[12]:

accuracy:

0.874808

AUC: '>50K' vs others:

0.929433

PR-AUC: '>50K' vs others:

0.830569

loss:

0.277839

num examples:

9769

num examples (weighted):

9769

Confusion matrix

Label \ Pred	<=50K	>50K
<=50K	6985	427
>50K	796	1561