pip install ydf -U
import ydf
import numpy as np
What are Multi-dimensional features?¶
Multi-dimensional features are model inputs with multiple dimensions. For example, feeding multiple timestamps of a time series or the value of different pixels in an image are multi-dimensional features. They are different from single-dimensional features, which only have one dimension. Each dimension of a multi-dimensional feature is treated as an individual single-dimensional feature.
Multi-dimensional features are fed as multi-dimensional arrays such as Numpy array or TensorFlow vectors. The next example shows a toy example of feeding a multi-dimensional feature to a model.
Create a multi-dimensional dataset¶
The simplest way to create a multi-dimensional dataset is to use a dictionary of multi-dimensional NumPy arrays.
def create_dataset(num_examples):
# Generates random feature values.
dataset = {
# f1 is a 4 multi-dimensional feature.
"f1": np.random.uniform(size=(num_examples, 4)),
# f2 is a single-dimensional feature.
"f2": np.random.uniform(size=(num_examples)),
}
# Add a synthetic label
noise = np.random.uniform(size=num_examples)
dataset["label"] = (
np.sum(dataset["f1"], axis=1) + dataset["f2"] * 0.2 + noise
) >= 2.0
return dataset
print("A dataset with 5 examples:")
create_dataset(num_examples=5)
A dataset with 5 examples:
{'f1': array([[0.5373759 , 0.18098291, 0.74489824, 0.27706572], [0.4517745 , 0.37578001, 0.45156836, 0.05413219], [0.77036813, 0.1640734 , 0.47994649, 0.06315383], [0.44115416, 0.95749836, 0.80662146, 0.78114808], [0.40393628, 0.22786682, 0.32477702, 0.18309577]]), 'f2': array([0.02058218, 0.94332705, 0.25678716, 0.02122367, 0.04498769]), 'label': array([False, True, False, True, False])}
Train model¶
Training a model on multi-dimensional features is similar to training a model on single-dimension features.
train_ds = create_dataset(num_examples=10000)
model = ydf.GradientBoostedTreesLearner(label="label").train(train_ds)
Train model on 10000 examples Model trained in 0:00:02.789326
Model understanding¶
When interpreting the model, each dimension of the multi-dimensional feature is treated independently. For example, describing the model would show each dimension individually.
model.describe()
Task : CLASSIFICATION
Label : label
Features (5) : f1.0_of_4 f1.1_of_4 f1.2_of_4 f1.3_of_4 f2
Weights : None
Trained with tuner : No
Model size : 767 kB
Number of records: 10000 Number of columns: 6 Number of columns by type: NUMERICAL: 5 (83.3333%) CATEGORICAL: 1 (16.6667%) Columns: NUMERICAL: 5 (83.3333%) 1: "f1.0_of_4" NUMERICAL mean:0.49459 min:4.63251e-05 max:0.999917 sd:0.289597 2: "f1.1_of_4" NUMERICAL mean:0.498703 min:5.8423e-06 max:0.999997 sd:0.289197 3: "f1.2_of_4" NUMERICAL mean:0.498227 min:7.85791e-05 max:0.999943 sd:0.288629 4: "f1.3_of_4" NUMERICAL mean:0.496773 min:9.6696e-05 max:0.99987 sd:0.28987 5: "f2" NUMERICAL mean:0.504066 min:3.89178e-05 max:0.999976 sd:0.289052 CATEGORICAL: 1 (16.6667%) 0: "label" CATEGORICAL has-dict vocab-size:3 no-ood-item most-frequent:"true" 8140 (81.4%) Terminology: nas: Number of non-available (i.e. missing) values. ood: Out of dictionary. manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred. tokenized: The attribute value is obtained through tokenization. has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string. vocab-size: Number of unique values.
The following evaluation is computed on the validation or out-of-bag dataset.
Task: CLASSIFICATION Label: label Loss (BINOMIAL_LOG_LIKELIHOOD): 0.48424 Accuracy: 0.88592 CI95[W][0 1] ErrorRate: : 0.11408 Confusion Table: truth\prediction false true false 114 65 true 46 748 Total: 973
Variable importances measure the importance of an input feature for a model.
1. "f1.3_of_4" 0.381709 ################ 2. "f1.0_of_4" 0.364288 ############## 3. "f1.1_of_4" 0.347260 ############# 4. "f1.2_of_4" 0.310757 ########## 5. "f2" 0.187976
1. "f1.3_of_4" 20.000000 ################ 2. "f1.0_of_4" 16.000000 ######### 3. "f1.1_of_4" 10.000000 4. "f1.2_of_4" 10.000000
1. "f1.2_of_4" 368.000000 ################ 2. "f1.3_of_4" 367.000000 ############### 3. "f1.0_of_4" 340.000000 ############# 4. "f1.1_of_4" 318.000000 ############ 5. "f2" 164.000000
1. "f1.2_of_4" 1184.639552 ################ 2. "f1.1_of_4" 1180.490537 ############### 3. "f1.0_of_4" 1087.300719 ############## 4. "f1.3_of_4" 1061.770106 ############## 5. "f2" 124.009355
Those variable importances are computed during training. More, and possibly more informative, variable importances are available when analyzing a model on a test dataset.
Only printing the first tree.
Tree #0: "f1.3_of_4">=0.340523 [s:0.0125718 n:9027 np:5931 miss:1] ; pred:-1.59908e-08 ├─(pos)─ "f1.0_of_4">=0.355412 [s:0.00663164 n:5931 np:3784 miss:1] ; pred:0.0534567 | ├─(pos)─ "f1.1_of_4">=0.285883 [s:0.00223373 n:3784 np:2645 miss:1] ; pred:0.0939348 | | ├─(pos)─ "f1.2_of_4">=0.211332 [s:0.000410131 n:2645 np:2088 miss:1] ; pred:0.114401 | | | ├─(pos)─ "f1.3_of_4">=0.343045 [s:7.77674e-05 n:2088 np:2082 miss:1] ; pred:0.121303 | | | | ├─(pos)─ pred:0.121615 | | | | └─(neg)─ pred:0.0129024 | | | └─(neg)─ "f1.1_of_4">=0.404073 [s:0.00320274 n:557 np:479 miss:1] ; pred:0.0885265 | | | ├─(pos)─ pred:0.103596 | | | └─(neg)─ pred:-0.00401777 | | └─(neg)─ "f1.2_of_4">=0.186632 [s:0.0112601 n:1139 np:915 miss:1] ; pred:0.0464084 | | ├─(pos)─ "f1.2_of_4">=0.528798 [s:0.00319255 n:915 np:556 miss:0] ; pred:0.0810544 | | | ├─(pos)─ pred:0.111015 | | | └─(neg)─ pred:0.0346534 | | └─(neg)─ "f1.0_of_4">=0.702914 [s:0.0383716 n:224 np:96 miss:0] ; pred:-0.0951145 | | ├─(pos)─ pred:0.0541452 | | └─(neg)─ pred:-0.207059 | └─(neg)─ "f1.1_of_4">=0.412053 [s:0.0223679 n:2147 np:1253 miss:1] ; pred:-0.0178841 | ├─(pos)─ "f1.2_of_4">=0.204555 [s:0.0101297 n:1253 np:1010 miss:1] ; pred:0.065479 | | ├─(pos)─ "f1.2_of_4">=0.404707 [s:0.00241061 n:1010 np:772 miss:1] ; pred:0.0980558 | | | ├─(pos)─ pred:0.116045 | | | └─(neg)─ pred:0.0397044 | | └─(neg)─ "f1.3_of_4">=0.667282 [s:0.0338417 n:243 np:114 miss:0] ; pred:-0.0699227 | | ├─(pos)─ pred:0.0592101 | | └─(neg)─ pred:-0.18404 | └─(neg)─ "f1.2_of_4">=0.494196 [s:0.0422598 n:894 np:448 miss:1] ; pred:-0.134723 | ├─(pos)─ "f1.3_of_4">=0.561545 [s:0.0132409 n:448 np:285 miss:0] ; pred:0.000627708 | | ├─(pos)─ pred:0.0580524 | | └─(neg)─ pred:-0.0997774 | └─(neg)─ "f1.3_of_4">=0.702899 [s:0.0247338 n:446 np:213 miss:0] ; pred:-0.270681 | ├─(pos)─ pred:-0.162138 | └─(neg)─ pred:-0.369906 └─(neg)─ "f1.1_of_4">=0.456287 [s:0.0326619 n:3096 np:1725 miss:1] ; pred:-0.102407 ├─(pos)─ "f1.0_of_4">=0.465293 [s:0.0172008 n:1725 np:920 miss:1] ; pred:0.00391262 | ├─(pos)─ "f1.2_of_4">=0.150376 [s:0.00671146 n:920 np:781 miss:1] ; pred:0.0848681 | | ├─(pos)─ "f1.1_of_4">=0.675697 [s:0.000847067 n:781 np:480 miss:0] ; pred:0.107675 | | | ├─(pos)─ pred:0.122883 | | | └─(neg)─ pred:0.0834216 | | └─(neg)─ "f1.3_of_4">=0.0952032 [s:0.0266206 n:139 np:103 miss:1] ; pred:-0.0432749 | | ├─(pos)─ pred:0.0203768 | | └─(neg)─ pred:-0.225389 | └─(neg)─ "f1.2_of_4">=0.588246 [s:0.0368965 n:805 np:331 miss:0] ; pred:-0.0886079 | ├─(pos)─ "f1.2_of_4">=0.705104 [s:0.00581962 n:331 np:244 miss:0] ; pred:0.0630749 | | ├─(pos)─ pred:0.0931343 | | └─(neg)─ pred:-0.0212296 | └─(neg)─ "f1.1_of_4">=0.640417 [s:0.0161828 n:474 np:313 miss:0] ; pred:-0.19453 | ├─(pos)─ pred:-0.134324 | └─(neg)─ pred:-0.311575 └─(neg)─ "f1.2_of_4">=0.519391 [s:0.0405007 n:1371 np:637 miss:0] ; pred:-0.236179 ├─(pos)─ "f1.0_of_4">=0.316183 [s:0.0395178 n:637 np:418 miss:1] ; pred:-0.0936254 | ├─(pos)─ "f1.0_of_4">=0.686852 [s:0.0172247 n:418 np:186 miss:0] ; pred:0.00132543 | | ├─(pos)─ pred:0.0980488 | | └─(neg)─ pred:-0.0762201 | └─(neg)─ "f1.2_of_4">=0.893097 [s:0.03348 n:219 np:43 miss:0] ; pred:-0.274856 | ├─(pos)─ pred:-0.0305784 | └─(neg)─ pred:-0.334537 └─(neg)─ "f1.0_of_4">=0.667598 [s:0.0436245 n:734 np:222 miss:0] ; pred:-0.359894 ├─(pos)─ "f1.1_of_4">=0.19785 [s:0.0336715 n:222 np:119 miss:1] ; pred:-0.150583 | ├─(pos)─ pred:-0.0379291 | └─(neg)─ pred:-0.280736 └─(neg)─ "f1.0_of_4">=0.402493 [s:0.00914017 n:512 np:213 miss:1] ; pred:-0.45065 ├─(pos)─ pred:-0.375903 └─(neg)─ pred:-0.503897
Analyzing the model and predictions also shows each dimension individually.
test_ds = create_dataset(num_examples=10000)
model.analyze(test_ds)
Variable importances measure the importance of an input feature for a model.
1. "f1.1_of_4" 0.064800 ################ 2. "f1.3_of_4" 0.064300 ############### 3. "f1.2_of_4" 0.062700 ############### 4. "f1.0_of_4" 0.058700 ############## 5. "f2" 0.004000
1. "f1.0_of_4" 0.032397 ################ 2. "f1.3_of_4" 0.032241 ############### 3. "f1.1_of_4" 0.031047 ############### 4. "f1.2_of_4" 0.030587 ############### 5. "f2" 0.001307
1. "f1.3_of_4" 0.113808 ################ 2. "f1.0_of_4" 0.113546 ############### 3. "f1.1_of_4" 0.112715 ############### 4. "f1.2_of_4" 0.110428 ############### 5. "f2" 0.005334
1. "f1.0_of_4" 0.032394 ################ 2. "f1.3_of_4" 0.032237 ############### 3. "f1.1_of_4" 0.031045 ############### 4. "f1.2_of_4" 0.030584 ############### 5. "f2" 0.001307
1. "f1.3_of_4" 0.381709 ################ 2. "f1.0_of_4" 0.364288 ############## 3. "f1.1_of_4" 0.347260 ############# 4. "f1.2_of_4" 0.310757 ########## 5. "f2" 0.187976
1. "f1.3_of_4" 20.000000 ################ 2. "f1.0_of_4" 16.000000 ######### 3. "f1.1_of_4" 10.000000 4. "f1.2_of_4" 10.000000
1. "f1.2_of_4" 368.000000 ################ 2. "f1.3_of_4" 367.000000 ############### 3. "f1.0_of_4" 340.000000 ############# 4. "f1.1_of_4" 318.000000 ############ 5. "f2" 164.000000
1. "f1.2_of_4" 1184.639552 ################ 2. "f1.1_of_4" 1180.490537 ############### 3. "f1.0_of_4" 1087.300719 ############## 4. "f1.3_of_4" 1061.770106 ############## 5. "f2" 124.009355