cdef class Tree

Array-based representation of a binary decision tree.
    The binary tree is represented as a number of parallel arrays. The i-th
    element of each array holds information about the node `i`. Node 0 is the
    tree’s root. You can find a detailed description of all arrays in
    `_tree.pxd`. NOTE: Some of the arrays only apply to either leaves or split
    nodes, resp. In this case the values of nodes of the other type are
    arbitrary!

For a tree, it certainly has a few non-array attributes like node_count, capacity and max_depth. Other than those three, there are 8 methods which are arrays and all unanimously has the size of node_count.

1. children_left

each element stores the node id of the left children for the i-th node, of course, if i-th node doesn’t any child, it will use the value TREE_LEAF=-1.   As the binary tree always follow the priority that the left child will always has the values that smaller than the threshold. Of course, all the children have a node id greater than its direct split node so children_left[i] > i.

2. children_right

very similar to child_left

3. feature

feature[i] holds the feature to split on for the internal node i -> Key

4. threshold

threshold[i] holds the threshold for the internal node i -> Value

5. value

value is a bit complex, it has the shape of [node_count, n_outputs, max_n_classes]. The documentation says “Contains the constant prediction value of each node.

6. impurity:

impurity at node i

7. n_node_samples:

holds the number of training samples reaching node i

8. weighted_n_node_samples:

holds the weighted number of training samples reaching node i
Screen Shot 2019-10-27 at 5.06.52 PMScreen Shot 2019-10-27 at 5.13.23 PM
In the tree method, there are not that many methods that meant to be used by the users, and the sklearn developers achieved this goal by forcing the methods to be only available in Cython by declaring them using cdef. However, there are still a few methods that are being defined using cpdef which must be familiar to most sklearn users.
1. predict(X)
Screen Shot 2019-10-27 at 8.38.47 PM
2. apply(X)
Screen Shot 2019-10-27 at 5.25.46 PM
3. decision_path(X)
The decision_path code is very similar to other methods which it look through all the samples, and within each sample, it will populate the outcome. However, within the code of decision_path, it has used the indptr and indices and also the pointer to these two arrays, indptr_ptr and indices_ptr to point to the data.
Screen Shot 2019-10-27 at 8.56.10 PM
If you are confused by indices, indptr and data, don’t worry, all those variables are the key variables for the CSR (compressed sparse rows) matrix in Scipy.
Instead of reading its official documentation, you can find this great Stackoverflow question. Here is a screenshot provided by user Tanguy explaining how those variables got put together.
csr.png
4. compute_feature_importances()
Screen Shot 2019-10-27 at 5.17.53 PM

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s