Sample Mask

Screen Shot 2019-11-30 at 11.38.09 AM.png

random.rand

random_state is a class from the numpy package which can generate random number following many types of distributions. In this case, the author used the random_state.rand(d0) to generate a one dimension array that has as many elements as n_total_samples, in which each value is between 0 and 1.

Meanwhile, they also prepopulate the sample_mask which has the same shape as the rand variable with 0s.

Then during the for loop, they decide if they want to mask each sample from 0 to 1 using a pretty interesting condition.

rand[i] * (n_total_samples – i) < (n_total_in_bag – n_bagged)

or if we rearrange to be like:

rand[i]  < (n_total_in_bag – n_bagged) / (n_total_samples – i)

n_total_in_bag – n_bagged => how many remaining need to be bagged

n_total_samples – i => how many total samples to be considered

At the beginning of numerator is large and will decrease as we switch more masks, and the denominator is also large and also decrease as we continue. n_bagged will start at 0 and as i increase, the threshold will keep increasing and the probability that rand[i] is smaller actually will increase. And when the condition is met, we will flag it as masked and add 1 to the n_bagged. And we will continue until n_bagged equals to n_total_in_bag. In that case, the threshold will become literally 0 and the condition will never be met because the random number generator only spit out numbers between [0,1). Also, if there are more remaining need to be bagged than total left to be considered, the threshold will be strictly greater than 1 and it will definitely will masked which probably happens a lot at the beginning.

mask

Screen Shot 2019-11-30 at 12.23.51 PM.png

Here is a screenshot to demonstrate how np.array indexing mask works. As you can see, they support boolean type and numeric index slicing. However, there is one operator which is the “~” sign which our Python users might not come across very often. It is actually the invert operator.

Upsidedown-gi

yeah, this is exactly what I mean by inversion!

Joke aside, “~” is the same as the operator.__invert__. Regarding bitwise operations, you can find the definitions from the Python wiki bitwise operations documentation here. “~x” is basically the same as -x-1. Like ~3 => -4 and ~-4 => 3.

You can also find more information about indexing from ndarray indexing documentation.

sklearn.emsemble Gradient Boosting Tree _gb.py

After we spent the previous few posts looking into decision trees, now is the time to see a few powerful ensemble methods built on top of decision trees. One of the most applicable ones is the gradient boosting tree. You can read more about ensemble from the sklearn ensemble user guide., this post will focus on reading the source code of gradient boosting implementation which is based at sklearn.ensemble._gb.py.

As introduced in the file docstring.

Gradient Boosted Regression Trees
This module contains methods for fitting gradient boosted regression trees for
both classification and regression.
The module structure is the following:
– The “BaseGradientBoosting“ base class implements a common “fit“ method
for all the estimators in the module. Regression and classification
only differ in the concrete “LossFunction“ used.
– “GradientBoostingClassifier“ implements gradient boosting for
classification problems.
– “GradientBoostingRegressor“ implements gradient boosting for
regression problems.
Almost the first thousand lines of code are all deprecated loss functions which got moved to the _gb_losses.py, which we can skip for now.
Screen Shot 2019-11-28 at 2.39.47 PM.png
So let’s start by reading the BaseGradientBoosting class. The very majority of the inputs/attributes to BaseGradientBoosting are inputs to the decision tree also, and the ones that are not present in the BaseDecisionTree are actually the key elements for understanding GradientBoosting.
Screen Shot 2019-11-28 at 2.58.49 PM
Now let’s take a look at the methods too.
Screen Shot 2019-11-28 at 3.02.55 PM
__check_params is the basic checker to make sure the inputs/attributes are within the reasonable range and raise error if not.
__init_state, _clear_state, _resize_state and _is_initialized are all related to the lifecycle of states. One thing to watch out is that there are three key data structures to store the state, “estimators”, “train_score” and “oob (out of bag) improvements”. They all have the same number of rows as the number of estimators in which each row stores the metrics related to each estimator, and for the “estimators_”, each column stores the state for that particular class.
Screen Shot 2019-11-28 at 3.09.24 PM
Screen Shot 2019-11-28 at 3.13.22 PM
All the methods except “fit”, “apply” and “feature_importance” are intended as internal methods not being used by the end-users, hence, prefixed by a single underscore.

apply

Apply trees in the ensemble to X, return leaf indices

Screen Shot 2019-11-29 at 12.22.48 PM

_validate_X_predict is the internal method for the class BaseDecisionTree which checks the data types and shapes, ensure they are compatible. Then we create an ndarray – leaves that of have as many elements as the number of input data. For each input row, there will also as many any elements as the estimators that we have established, finally, for each estimator, we will have as many classes as the prediction.

The double for loop here is basically to iterate through all the estimators and all the classes and populate the leaves variable using the apply method for the underlying estimator.

feature_importance

Screen Shot 2019-11-29 at 12.29.15 PM.png

The feature_importances_ method first use a double for loop list comprehension to iterate through all the estimators and all the trees built within each stage. This is the very first time the term “stage” got introduced probably in this post.  However, we have covered very briefly of the estimators_ attribute above in which each element represent one estimator. Then we can easily draw the conclusion that each stage is sort of representative of one estimator. In fact, that is how boosting works as indicated in the user guide –

“The train error at each iteration is stored in the train_score_ attribute of the gradient boosting model. The test error at each iterations can be obtained via the staged_predict method which returns a generator that yields the predictions at each stage. Plots like these can be used to determine the optimal number of trees (i.e. n_estimators) by early stopping. The plot on the right shows the feature importances which can be obtained via the feature_importances_ property.”

It is then literally the algorithmic average for the feature importance across all the relevant trees. Just as a friendly reminder, the compute_feature_importance method appear at the _tree.pyx method which is as below:

Screen Shot 2019-11-29 at 12.41.21 PM.png

fit

“fit” method is THE most important method of the _gb.py, as it is calling lots of other internal methods which we will first introduce two short methods before we dive into its own implementation:

_raw_predict_init and _raw_predict

Screen Shot 2019-11-29 at 12.43.49 PM.png

These methods are used jointly to generate the raw predictions given inputs matrix X and the returned value is also a matrix of shape that has the same number of rows as the input matrix and the same number of columns as the class types. The idea is very simple but this part is the core of the boosting process as boosting in nature is this constant iterations of doing predictions, calculate error, correct it and then doing this again. Hence, the predict_stages are implemented in high performance Cython.

Screen Shot 2019-11-29 at 12.49.48 PM

predict_stages iterate through all the estimators and all the classes and generate the predictions from the tree. This part of the code probably will deserve its own post explaining but let’s put a pin here and stop at these Cython code knowing that there is something complex going on here doing some predictions “really fast”.

Screen Shot 2019-11-29 at 12.50.37 PM

Screen Shot 2019-11-29 at 12.50.29 PM

fit

The fit method starts by checking the inputs of the X and ys. After that, it has some logic carving out the validation set which is driven by a technique called stratify.

You can find more about stratify or cross validation from this Stackoverflow question or the user guide directly. Meanwhile, knowing gd.fit has some mechanism of splitting the data is all you need to understand the following code.

Screen Shot 2019-11-29 at 2.21.58 PM.png

_fit_stages

Screen Shot 2019-11-29 at 2.30.41 PM.png

_fit_stage

Screen Shot 2019-11-29 at 2.36.29 PM.png

Gradient Boosting Classifier and Regressor

After the BaseGradientBoosting class got introduced, they extend it slightly to the classifier and regressor which the end-user can utilize directly.

The main difference between these two is the loss functions being used. For regressor, it is the ls, lad, huber or quantile while the loss function for the classifier is the deviance or exponential loss functions.

Screen Shot 2019-11-29 at 2.43.38 PMScreen Shot 2019-11-29 at 2.43.28 PM

Now, we can covered pretty much the whole _gb.py which we covered the how the relevant classes related to gradientboosting got implemented and at the same time owed a great amount of technical debt which I list here for few deeper dives in case the readers are interested.

  • loss functions
  • cross validation – stratify
  • Cython in place predict

Python Remove Comment – Tokenize

Today while I was doing some code review, I want to gauge the amount of effort by estimating how many lines of code there is. For example, if you are at the root folder of some Python library, like flask, you can easily count the number of lines in each file:

(python37) $ wc -l flask/*
      60 flask/__init__.py
      15 flask/__main__.py
     145 flask/_compat.py
    2450 flask/app.py
     569 flask/blueprints.py
     ...
      65 flask/signals.py
     137 flask/wrappers.py

    7703 total

However, when you open up one of the files, you realize the very majority of the content are either docstrings or comments and the code review isn’t quite as intimidating as it looks like at a first glance.

Then you ask yourself the question, how to strip out the comments and docstrings and count the effective lines of code. I didn’t manage to find a satisfying answer on Stackoverflow but came across this little snippet of gist from Github by BroHui.

At the beginning, I was thinking an approach like basic string manipulation like regular expression but the author totally leverage the built-in libraries to take advantage of lexical analysis. I have actually never used these two libraries – token and tokenize before so it turned out to be a great learning experience.

First, let’s take a look at what a token is.

TokenInfo(type=1 (NAME), string='import', start=(16, 0), end=(16, 6), line='import requests\n')
TokenInfo(type=1 (NAME), string='requests', start=(16, 7), end=(16, 15), line='import requests\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(16, 15), end=(16, 16), line='import requests\n')

For example, one line of python import code got parsed and broken down into different word/token. Each token info not only contain the basic token type, but also contains the physical location of the token start/end with the row and column count.

After understanding tokenization, it won’t be too hard to draw the connection between how to identify comment and docstring and how to deal with those. For comment, it is pretty straightforward and we can identify it by the token type COMMENT-55. For docstring, it is actually a string within its own line/lines of code without any other elements rather than indentations.

Keep in mind that we are parsing through tokens one by one, you really need to retain the original content after your work.

Frankly speaking, I cannot wrap my head around the flags that the author used to keep track of the previous_token and the first two if statement cases. However, I don’t think that matter that much so let’s keep note of it and focus on the application.

Here I created a small quote sample with test docstrings and comment in blue.

Screen Shot 2019-11-26 at 10.23.20 PM.png

This is the output of tokenization and I also helped highlighted the lines that interest us. Screen Shot 2019-11-26 at 10.24.46 PM

This is the final output after the parsing. However, you might want to completely remove the comments or even make it more compact by removing blank lines. We can either modify the code above by replacing mod.write with pass and also identify “NL” and remove them completely.

Screen Shot 2019-11-26 at 10.27.10 PM

AWS – RESTful API – Part II API Gateway

This is the part II of building a RESTful API using AWS Lambda and API Gateway. If you have not read the first post, check it out here. In this post, I will share how to get use the Lambda function that we have created in the previous section and create a publicly visible endpoint from there.

Amazon API Gateway is an AWS service for creating, publishing, maintaining, monitoring, and securing REST and WebSocket APIs at any scale

You can refer to the user guide of AWS API Gateway from here, very great content from the developers of AWS.

Create

First of all,  first we need to create an API name. The default is very good but as you can see, it also support websocket when you need APIs for streaming near real time data – video, audio. And when you create API, you can also import from different sources.

Screen Shot 2019-11-23 at 8.12.37 PM

 

Resource & Method

After you create an API, the next step is to create resources and methods attached to it. The resource is pretty much the suffix of the API URL, in this case, we will use “exist” so that the user knows that we are check if the keyword exist or not. Then we can pick a http method. In this case, we will use HTTP POST which we require the user to submit the URL and KEYWORD in the request body of the post request. Theoretically, we don’t have to use POST and many other methods including GET should be sufficient.

Attached is a screenshot of if we are going to create a GET method under exist on top of an existing lambda function, it should be as easy as click the drop down of the Lambda Function and it will pick up one that we created in the previous post.

Screen Shot 2019-11-23 at 8.15.41 PM

 

Test and Deploy

After we create the method. API Gateway will generate this diagram in which the processes got displayed to you in a visualized way. I assume that as we have more components integrated into the API Gateway, it will become more intuitive and helpful to the developers.

Screen Shot 2019-11-23 at 8.13.12 PM

In the Client block, there is a lightning sign that you can click to test the end client. I edited the request body so it contains a sample user input, a json object that contains the URL and Keyword. After you click test, it immediately responded with the right response.

Frankly speaking, it looks easy when it works. However, it did took me a while to get this part ready because I was not sure which method to use and at the same time, should I use the query strings or the request body. During my exploration, the logging on the bottom right was super helpful and by putting more logic into your logging and error handling when you develop the Lambda handler, this integration should not take long.

My lambda function has been working pretty well. It took me a few rounds of modification to the original code, rezip the environment and reupdate using the AWS CLI for a few times.

 

Screen Shot 2019-11-23 at 8.17.59 PM

Once my API is working, it still not available to the public until you deploy it. There are several stages which you can set up like alpha, beta, QA or production – whatever you prefer. Then it will be available to the public and you can invoke it from anywhere.

The default is that you don’t need authorization. However, in a production environment, you will need authentication and authorization using tokens or other mechanisms to make sure your API is protected. Not only because you don’t want your service to be abused by unintentional users, but also restricted the limited audience so that your users won’t spike because of bots which will directly lead to a spike in cost too for you.

Screen Shot 2019-11-23 at 8.18.47 PM

 

Client Test

After you deploy it, I can make an API call from my local terminal using CURL. Screen Shot 2019-11-23 at 8.24.55 PM

Now you have a public API up and running!

However, to fully explore the capability of Lambda and API Gateway, I did a pressure test by making distributed API calls without caching.

I was using the library grequest as it claims to make distributed parallelism in a truly asynchronous way. In the end, the performance was not disappointing and the latency was never outstanding. To be honest, I am not sure I have done this part right, as theoretically, my personal blog should also be flooded with this test but somehow I did not see the usage at all. I was wondering maybe the lambda function did not get fully executed. Also grequests won’t display the response but only display the response status which is a bit mysterious.

(The following code took a paragraph of my blog post, split into words, then extend the list 10 times so it has a lot of total words. Then each element will trigger a request)

Screen Shot 2019-11-23 at 8.23.35 PM

In the end, I logged into the dashboard and recognized that the usage spiked by thousands of requests which definitely came from this testing script.

Screen Shot 2019-11-23 at 8.26.05 PM

Pricing

AWS Lambda pricing

AWS API Gateway Pricing 

Both of these applications’ pricing model is based on usage, usually a few pennies per million calls. So you probably should also take the pricing model into consideration to design you application. For example, if you have a large volume of traffic, how to avoid small API calls instead of batching them. Or consolidate APIs so one API is more computation-intensive, etc.

Don’t forget to delete your development environment on AWS to make sure you don’t get charged afterward.

Conclusion

It was a great experience for me to try out AWS Lambda and API Gateway, super straightforward to use and never have to worry about anything OS level or below. At the same time, great alternative to control the operating cost of IT project as your project virtually will cost nothing until the usage will catch up. Also, it forces you to focus your attention on the development, and also focusing on the “what” rather than the “how”. I know that AWS probably has an amazing SLA for all its services, also, by using services rather than doing it yourself, it also gives you more time/reason to unit test, battle test your own product rather than “don’t make a problem until it is a problem” because I know many teams are cautious about doing battle testing / chaos monkey on their own house of cards 🙂

AWS – RESTful API – Part I Lambda

Introduction

As the Cloud providers are coming up with more and more services, the life of being a software developer just becomes easier and easier. As to get something up and running usually requires many technical aspects which one individual can hardly gain. Now many of the ground work got packaged and handled well by the cloud, usually a small team or even individuals can focus on the “coding” rather than the “administration”. In this post, I will document my first experience of leveraging AWS Lambda and API gateway to build a RESTful API, using Python.

As a Python developer, usually the go-to solution is to prototype an API using frameworks like Django or Flask. However, getting something up and running on your laptop is not sufficient. In order for any API to be online, there are many requirements like it needs to be hosted in an environment outside your laptop, logging and auditing, monitoring, load balancing, authentication and authorization and even auto scaling. It is very hard work but sometimes, actually most times not very exciting. I certainly see more people who prefers to develop new features during the day time rather than keep the lights on during the night time. Cloud providers are here to achieve the purpose of facilitating developers and take those ground work away.

In this little project, I am planning to have an API that the client can submit a URL, and check if a certain keyword exists on the page or not. Of course, you can find most of the instructions just by reading AWS Lambda’s user guide.

Hello World

First, you can start by logging into your AWS console and navigate to the page for Lambda.

Screen Shot 2019-11-23 at 6.38.15 PM

The lambda console is extremely clean, as the service itself :). The easiest place to get started is to create a function by using one of its blueprints. The sample functions range from easy ones like an echo API to complex API with hundreds of lines of code using machine learning. Hello-World-Python is a very great starting point.

Screen Shot 2019-11-23 at 6.41.23 PM

If you have used Flask or Django before, the resemblance is uncanny. The event object here in the handler is very much like the requests from flask and it stores the payload.

One thing worth highlighting is that AWS Lambda supports not only Python, but also many other languages, it even support many different versions of Python from classic 2.7 to 3.8 as the date this post was written, here is another blueprints using node.js.

Screen Shot 2019-11-23 at 6.49.58 PM.png

By creating a handler, you can submit and you are almost ready to go.

Screen Shot 2019-11-23 at 6.52.45 PM

Immediately, you will have a lambda function, along with Amazon CloudWatch Logs. Your print statement or even future highly customized logging events will be stored there for debugging purposes. In the bottom of the console, there is an embedded IDE in which you can do some basic development. In order to “run” your code, you can create a test case by passing in some data to test your lambda function.

Screen Shot 2019-11-23 at 6.53.45 PM

And you can check out the test result to make sure everything is fine.

Screen Shot 2019-11-23 at 6.59.22 PM

Virtual Env

For our use cases, it won’t be as simple as the hello world. As we will need to use some 3rd party libraries like requests to scrape user submitted site, parse it using a library maybe like beautifulsoup. Each programming language has its own ways of handling dependencies, the Python go-to solution is usually by creating a dedicated environment for your application, so you can pinpoint which libraries you have to use in the end. And then figure out a way to pass that the whole environment or requirements to a new environment. AWS already took this into heart and provided a good solution by using virtualenv. You can find the detailed instructions here. You basically follow the following steps:

  1. create a virtual environment
  2. develop the function there with necessary libraries installed
  3. zip the libraries along with the python function you wrote
  4. submit to lambda

My code is super simple.

Screen Shot 2019-11-23 at 7.09.28 PM.png

And in the end, the zip file (function.zip) is only 4.7MB. So we can upload it AS-IS, however, if your environment is a bit large, for example, if you use some heavy duty libraries like sklearn or tensorflow, you can easily exceed the limit of 50MB which then you have to use S3 to store it first.

Screen Shot 2019-11-23 at 7.14.16 PM

There are still a few configurations that you can make in the Lambda function console:

  • concurrency
  • time out in seconds
  • audit using cloudtrail
  • memory usage
  • error handling

However, you don’t have to do any of these if you don’t want to change the default.

AWS CLI

Again, whenever you use your mouse, you know that next time you still have to do it. For example, if you are developing more future versions of the same API or even want to create more Lamda functions, logging into the console might become a bit tedious and subject to error. After a few times, you can script your workflow using AWS CLI through the command line or event using Python to call boto3 to achieve the same goal.

The installation is pretty trivial and the set up is also only one time. The first time you use AWS CLI, you do have to copy paste a few commands from the tutorial or tinker with the command line help to get all the arguments right. However, like programing, once you get it done right once, next time is just cookie cutting and can also be automated if needed.

Conclusion

Now we have a Lambda function just like that, however, lambda itself is not necessarily a web service yet. Lambda can be integrated into many other components within the AWS ecosystem but in the next post, we will use AWS API Gateway to put a wrapped on top of it so it becomes an endpoint that is visible to the public.

 

 

 

Best First Tree Builder

In the _tree.pyx file, there is a very interesting implementation of a type of TreeBuilder called the BestFirstTreeBuilder. The __cinit__ is very boilerplate but all the intelligence actually sits in the build method.

In the BestFirstTreeBuilder, they used the data structure PriorityHeap to implement the priority binary tree. There are two key variables, the max_split_nodes and max_leaf_nodes which are being used to declare the initial capacity.

Screen Shot 2019-10-27 at 9.52.43 AM

In a complete binary tree, if root is at the depth 0, then at depth n, you will have 2^(n) total leaves at the bottom of the tree. And all the nodes above will be split nodes. And in total, you will have 1+2+4 … = 2^0 + 2^1 + 2^2 + … + 2^(n-1) = 2^n – 1. That is why max_split_nodes = max_leaf_nodes – 1 and initial capacity is the sum of those two. Screen Shot 2019-10-27 at 9.58.24 AM

Then the rest of the build method will recursively build the tree following the “priority”.

Screen Shot 2019-10-27 at 10.10.30 AM

There are two methods which we will cover here. self._add_split_node will compute the split node and _add_to_frontier will add the split to the frontier.

 

 

Pickle

Here is a bit more about the Pickle and how you can customize how the pickling process work, use __getstate__ to customize how things got pickle.dumps and use __setstate__ to customize how things got pickle.loads.

It is not recommended but you can also use __reduce__ if you know what you are doing.

__getstate__ and __setstate__

Screen Shot 2019-10-27 at 11.00.40 PM

__reduce__

Screen Shot 2019-10-27 at 11.18.30 PM

cdef class Tree

Array-based representation of a binary decision tree.
    The binary tree is represented as a number of parallel arrays. The i-th
    element of each array holds information about the node `i`. Node 0 is the
    tree’s root. You can find a detailed description of all arrays in
    `_tree.pxd`. NOTE: Some of the arrays only apply to either leaves or split
    nodes, resp. In this case the values of nodes of the other type are
    arbitrary!

For a tree, it certainly has a few non-array attributes like node_count, capacity and max_depth. Other than those three, there are 8 methods which are arrays and all unanimously has the size of node_count.

1. children_left

each element stores the node id of the left children for the i-th node, of course, if i-th node doesn’t any child, it will use the value TREE_LEAF=-1.   As the binary tree always follow the priority that the left child will always has the values that smaller than the threshold. Of course, all the children have a node id greater than its direct split node so children_left[i] > i.

2. children_right

very similar to child_left

3. feature

feature[i] holds the feature to split on for the internal node i -> Key

4. threshold

threshold[i] holds the threshold for the internal node i -> Value

5. value

value is a bit complex, it has the shape of [node_count, n_outputs, max_n_classes]. The documentation says “Contains the constant prediction value of each node.

6. impurity:

impurity at node i

7. n_node_samples:

holds the number of training samples reaching node i

8. weighted_n_node_samples:

holds the weighted number of training samples reaching node i
Screen Shot 2019-10-27 at 5.06.52 PMScreen Shot 2019-10-27 at 5.13.23 PM
In the tree method, there are not that many methods that meant to be used by the users, and the sklearn developers achieved this goal by forcing the methods to be only available in Cython by declaring them using cdef. However, there are still a few methods that are being defined using cpdef which must be familiar to most sklearn users.
1. predict(X)
Screen Shot 2019-10-27 at 8.38.47 PM
2. apply(X)
Screen Shot 2019-10-27 at 5.25.46 PM
3. decision_path(X)
The decision_path code is very similar to other methods which it look through all the samples, and within each sample, it will populate the outcome. However, within the code of decision_path, it has used the indptr and indices and also the pointer to these two arrays, indptr_ptr and indices_ptr to point to the data.
Screen Shot 2019-10-27 at 8.56.10 PM
If you are confused by indices, indptr and data, don’t worry, all those variables are the key variables for the CSR (compressed sparse rows) matrix in Scipy.
Instead of reading its official documentation, you can find this great Stackoverflow question. Here is a screenshot provided by user Tanguy explaining how those variables got put together.
csr.png
4. compute_feature_importances()
Screen Shot 2019-10-27 at 5.17.53 PM

SKLEARN SOURCE CODE TREE – PART II

This post dives into the sklearn decision tree building process.

The node_split method within the BestSplitter class is probably the most element implementation that I will say within the whole decision tree module. Not only it entails the implementation of the splitting process, but also demonstrated a myriad of core computer science concepts. Now let’s go through these 200 LOC and enjoy.

Screen Shot 2019-10-14 at 11.30.47 PM

In the method documentation string, it is mean to “Find the best split on node samples[start:end]”. Variable samples store all the samples, or each node_split is not necessarily running against the whole samples at every split. During each split, its range will target at the start:end range which will find the rotation later.

Meanwhile, at each node_split, its inputs include not only the latest impurity process, it also contains a pointer split to the split record, last but not least, for efficiency purpose, the splitters keep track of the number of constant_features so we can speed up the splitting by avoiding constant features, hence the comments below:

        # Sample up to max_features without replacement using a
        # Fisher-Yates-based algorithm (using the local variables `f_i` and
        # `f_j` to compute a permutation of the `features` array).
        #
        # Skip the CPU intensive evaluation of the impurity criterion for
        # features that were already detected as constant (hence not suitable
        # for good splitting) by ancestor nodes and save the information on
        # newly discovered constant features to spare computation on descendant
        # nodes.

First, let’s look at the outermost loop and its condition:

  1. f_i > n_total_constants
  2. n_visited_features < max_features
  3. n_visited_features <= n_found_constants + n_drawn_constants

condition 1 and (condition 2 or condition 3)

max_features is a an input variable that can be changed by the user, but default to be the total number of features. When float, it will analyze pick a fraction of all the features during the split or hard coded to be a specific number of features.

The implementation takes into account constant variables, and has several flags/counter to keep track in order to improve the performance. However, that might not help the reader so we will assume all the columns are not constant so any counter related to constant will be zero.

In that case, the while loop condition will be (f_i > 0 and n_visited_features < max).

n_visited_features is initialized to be zero and increases by 1 at each loop.

f_i was initialized to be n_features outside the loop, and at each loop f_j is randomly drawn from (n_drawn_constant, f_i – n_found_constants) ~ (0, f_i) when there is no constants. Of course, if we do have constants, f_j is meant to randomly select a feature that is not drawn nor not found constant.

When you draw the feature, the code checks what kind of feature this is, and update the related counter accordingly.

For example, as it is random sample without replacement, if it has been drawn, then it should not be drawn again, hence, f_j has to a random integer greater than the n_drawn_constants. f_j > n_drawn_constant.

At the same time, if f_j is smaller than n_known_constants, which means that we might have already known it is a constant without drawn yet (cache), then, we will mark this constant column as a drawn column by switch columns, append to the end of the drawn_constants range and update the n_drawn_constant counter by one.

Of course, we don’t know yet, we will update f_j by adding the identified/found constants so f_j is now pointing to the f_j th unidentified column and use it as the current feature.

Once we select a feature, then we will copy the column values to a new array Xf and sort the values. After we sort the values, then we will check to see if the column is close enough to be constant by comparing the min Xf[start] and max Xf[end-1] and see if the difference is smaller than feature_threshold. If so, of course, we will update the found_constants by 1 and total_constant by 1.

Of course, after all, if is not known constants, or if it is new but not constant, then we will go and find the split points by decreasing f_i by 1 and switch values of feature f_i and f_j. In that case, our randomly selected features will always be at the end of the list and slowly building up to the start of the queue.

Screen Shot 2019-10-14 at 10.55.54 PMScreen Shot 2019-10-14 at 11.10.55 PM

When the feature is properly sampled, they first initialize the position – variable p to be the start, the record index of the first sample. And it will go through each record. In addition, when it moves on to the next record, they FEATURE_THRESHOLD to make sure that it was not numerically equal, if so, they will skip to the next record which is materially different.

When sklearn is trying to find a split point during a continuous range, there is a very important user input called min_samples_leaf, which is defaulted to 1. When you build a decision tree, theoretically, you can train your model to be 100% accuracy during the training stage by creating a tree path which each unique record falls into its own path. Your model will for sure be big, and at the same time, you might be cursed with overfitting your data that during the prediction stage, the accuracy will be drastically lower. To avoid that, min_samples_leaf can be used to make sure you only create split point when the children has more than min_samples_leaf data. Here is the official documentation.

Screen Shot 2019-10-26 at 10.38.57 PM

The same idea applies to the min_weight_leaf in which instead of samples count, the requirements hold true for sum of weight.

After these two checks, they will calculate the criterion improvement and calculate with the record keeper until they loop through all positions and find the best split point and save that position as the best.

Screen Shot 2019-10-14 at 10.39.45 PM

After they find the best split point, they will “Reorganize” and actually split the samples using the best split point. Keep in mind that X still holds the raw samples and are NOT sorted. while p:= start < partition_end := end is an interesting way of splitting the data and sort into two subgroups in which left part of the array are always smaller and the right is always larger.

Here is a small Python script to demonstrate how it works, just to clarify, the data is still not completely sorted.

Screen Shot 2019-10-26 at 11.05.31 PM.png

Screen Shot 2019-10-14 at 11.19.24 PM

After the “Reorganization”, the memory of constant features will be copied and return the best split point.

Screen Shot 2019-10-26 at 11.15.35 PM.png

In summary, in this post, we have covered the split_node method with in the BestSplitter class. We looked at how they use Fisher_Yates random sample the features, for a given feature, how it find the best splitting point and at last, how it cleaned everything up and pass on to the children.

 

sklearn Source Code tree – Part I

This post, actually this upcoming series of posts, will be focused on gaining more knowledge of the exactly implementation of sklearn. Not only how in depth the algorithm got implemented, but also learning the best practices and styles of one of the most popular python library or even machine learning library out there. And today focus will be looking at the _tree.pyx.

To really dive into the details of trees, one has to be familiar with the underlying data structure used for implementing a decision tree, or a tree in general. Under the _utils.pxd, can you easily find the declarations for two key data structures Stack and Priority Heap and its relevant implementation atomic unit – StackRecord and PriorityHeapRecord.

Screen Shot 2019-10-13 at 9.59.39 PM

No surprise, Stack is the commonly used data structure which supports FILO logic, hence, we have the push and pop method. Each record is actually fairly interesting which we are going to future explain what each attribute is being used for.

After that, it is another data structure called priorityHeap.

Screen Shot 2019-10-13 at 10.07.12 PM

I came across a great post about PriorityQueue with BinaryHeap which you can find the more interesting reading and Python implementation here. At a very high level, the regular Stack is used for depth first tree builder and the PriorityHeap is used for the best first tree builder. In the ideal world, both tree builders will lead to the same final tree but one is learning faster and usually is preferred when we need cut off the learning process early with pruning (like decision stumps in GBM). To simplify, we will start by focusing on depth first tree builder.

Now let’s switch our eyeballs to the _tree.pxd.

Like StackRecord, the atomic unit of a tree is node, and each node is made of its left and right child (identified by the ID), the split feature, the threshold (regression), impurity gain during split, and others.

Then let’s take a look at the Tree class’es attributes. Node* and double* are the two pointers/arrays that store the true content of a decision tree.

Screen Shot 2019-10-13 at 10.23.34 PM

Now we have skimmed through the basic data structures, let’s switch to the _tree.pyx implementation and take a look.

Screen Shot 2019-10-13 at 10.42.51 PM

The whole _tree.pyx isn’t quite complex, only ~1600 LOC and if we are only interested in the Tree class implementation and the easiest tree builder DepthFirstTreeBuilder, you only need to read a few hundreds of lines of code. So let’s get started.

At the beginning, they first declared a TreeBuilder class as the basic interface which further got extended into different types of TreeBuilder (depth first or best first). It only has an internal method _check_input to ensure the data is contiguous.

Screen Shot 2019-10-13 at 10.29.41 PM.png

Across the whole implementation, there are numerous places that for performance reasons making calls to compress sparse matrix and others. Those functions play a pivotal role regarding making a python library fast enough but itself might deserve a dedicated series and less relevant to the tree implementation which we will skip for now.

Screen Shot 2019-10-13 at 10.51.05 PMThe constructor of DepthFirstTreeBuilder includes several key parameters when builder a tree. Splitter is the various splitter implementation which we will cover later. Now let’s go through the build method and see how each attribute drives the building process.

max_depth determines the maximum depth of the decision tree. As decision is a binary tree, when it is complete, the number of nodes grow exponentially. For example if you have 1 level, there are total 1 node, which is the root, if you have 2 levels, you have 3 nodes, and if you have three levels, you have 7 nodes, and when you have N levels, you will need 1+2+4+… = 2^0 + 2^1 + 2^2 + ..  2^ (N-1) = 2^N – 1 in total.

And as you can tell from the first few steps of the build method, that is exactly how max_depth is being used. Screen Shot 2019-10-13 at 10.56.29 PM.png

The next steps will be actually building the tree by working on the popped record and pushing its two children iteratively.

Screen Shot 2019-10-13 at 11.07.57 PM

As you can tell from the code, the stack will pop each record and replace with two children if any. And that is also the reason that why the stack has the size of INITIAL_STACK_SIZE which is 10, the same as the depth of the initial tree capacity. In this way, it will first build/traverse the left most branch and then bottom up, slowly transition to the right and traverse the whole tree with only a stack of 10 records. 

Now, let’s take a look at how splitter got called in the depth first tree building process.

Screen Shot 2019-10-13 at 11.36.10 PM

In the next post, we will spend more time looking into the node_split method and tree._add_node method to further understand the tree building details.