Stylegan2 training speed up – minibatch, num_gpu

The training of stylegan2 probably accidentally defied the purpose of open source as it requires so much computing power, so much to a level that the regular joes won’t even come close to train their own model in a way that is supposed to.

You can read about the README page from Stylegan2’s github project. It usually requires several high-performance GPUs like Tesla V100 to train for several days. And it evenFor the limited audience who happened to have access to high-performance computers like Nvidia DGX 2, The repository can be little bit daunting because there are still thousands of lines of tensorflow code that you probably need to spend a few days to dig into.

In this exercise, I will show you my experience training some images using DGX, how the experience was and what did I do to improve the through put.

There’s one thing that I realized when I ran the training jobEste fault is thatOnly eight out of the 16 GPUs are being used. And I’ll put those eight GPUsThe GPU memory usage it’s only around 9 GB out of 32 in total for each GPU. So that as a problem, Each of the GPU is only usingAbout 20%Of its total memoryAnd only half of the GPU is being used. That is an overall memory usage of only ten percent.

Screen Shot 2020-04-05 at 9.59.12 AM

By looking into the main function of run_training.py, we realized that there isn’t really that much configuration the author mean to you to change, the only one is the number of GPUs and in the code it is limited to 1,2,4 and 8.

By further digging into the network code, we realize there are several parameters that looks interesting and promising for us to tinker with. for example, anything related to mini batch is probably a good starting point for us to try it out. in this case, I changed two parameters minibatch_size_base and minibatch_gpu_base each controls how many images to load for a minibatch and how many samples process by one gpu at a time. To be frank, I increased the value without really knowing what was going on behind the scene (minibatch_size_base from 32 to 128 and minibatch_gpu_base from 4 to 8)

Screen Shot 2020-04-05 at 10.15.54 PM

After the change I do have to say add to the beginning when the tensor flow graph was built, it took a very long time like 5 to 10 minutes, but when the training started by ticking, the throughput doubled. and the memory usage got materialy increased, like from 9 gigabytes to 17 gigabytes.

You’re might have to ask the question, Is there anything that I can do to change so that the total training time can be improved.

Screen Shot 2020-04-05 at 10.02.25 PM

Screen Shot 2020-04-05 at 10.18.13 AM

Screen Shot 2020-04-05 at 10.12.53 PM

As you can tell, in about the same time, the new configuration has already reach tick 420 with about 3433 Kimg processes while the default configuration processed 319 tickets and 2572 Kimgs. That is about a 33% performance increase overall.

What bothered me a little is that as I am currently using all 16 GPUs, I expect the performance to be at least double, let alone each GPU is actually working harder now, which should add up to 4 times the amount of performance theoretically. By 33% is better than nothing.

sklearn.emsemble Gradient Boosting Tree _gb.py

After we spent the previous few posts looking into decision trees, now is the time to see a few powerful ensemble methods built on top of decision trees. One of the most applicable ones is the gradient boosting tree. You can read more about ensemble from the sklearn ensemble user guide., this post will focus on reading the source code of gradient boosting implementation which is based at sklearn.ensemble._gb.py.

As introduced in the file docstring.

Gradient Boosted Regression Trees

This module contains methods for fitting gradient boosted regression trees for

both classification and regression.

The module structure is the following:

– The “BaseGradientBoosting“ base class implements a common “fit“ method

for all the estimators in the module. Regression and classification

only differ in the concrete “LossFunction“ used.

– “GradientBoostingClassifier“ implements gradient boosting for

classification problems.

– “GradientBoostingRegressor“ implements gradient boosting for

regression problems.

Almost the first thousand lines of code are all deprecated loss functions which got moved to the _gb_losses.py, which we can skip for now.

Screen Shot 2019-11-28 at 2.39.47 PM.png

So let’s start by reading the BaseGradientBoosting class. The very majority of the inputs/attributes to BaseGradientBoosting are inputs to the decision tree also, and the ones that are not present in the BaseDecisionTree are actually the key elements for understanding GradientBoosting.

Now let’s take a look at the methods too.

__check_params is the basic checker to make sure the inputs/attributes are within the reasonable range and raise error if not.

__init_state, _clear_state, _resize_state and _is_initialized are all related to the lifecycle of states. One thing to watch out is that there are three key data structures to store the state, “estimators”, “train_score” and “oob (out of bag) improvements”. They all have the same number of rows as the number of estimators in which each row stores the metrics related to each estimator, and for the “estimators_”, each column stores the state for that particular class.

All the methods except “fit”, “apply” and “feature_importance” are intended as internal methods not being used by the end-users, hence, prefixed by a single underscore.

apply

Apply trees in the ensemble to X, return leaf indices

_validate_X_predict is the internal method for the class BaseDecisionTree which checks the data types and shapes, ensure they are compatible. Then we create an ndarray – leaves that of have as many elements as the number of input data. For each input row, there will also as many any elements as the estimators that we have established, finally, for each estimator, we will have as many classes as the prediction.

The double for loop here is basically to iterate through all the estimators and all the classes and populate the leaves variable using the apply method for the underlying estimator.

feature_importance

Screen Shot 2019-11-29 at 12.29.15 PM.png

The feature_importances_ method first use a double for loop list comprehension to iterate through all the estimators and all the trees built within each stage. This is the very first time the term “stage” got introduced probably in this post. However, we have covered very briefly of the estimators_ attribute above in which each element represent one estimator. Then we can easily draw the conclusion that each stage is sort of representative of one estimator. In fact, that is how boosting works as indicated in the user guide –

“The train error at each iteration is stored in the train_score_ attribute of the gradient boosting model. The test error at each iterations can be obtained via the staged_predict method which returns a generator that yields the predictions at each stage. Plots like these can be used to determine the optimal number of trees (i.e. n_estimators) by early stopping. The plot on the right shows the feature importances which can be obtained via the feature_importances_ property.”

It is then literally the algorithmic average for the feature importance across all the relevant trees. Just as a friendly reminder, the compute_feature_importance method appear at the _tree.pyx method which is as below:

Screen Shot 2019-11-29 at 12.41.21 PM.png

fit

“fit” method is THE most important method of the _gb.py, as it is calling lots of other internal methods which we will first introduce two short methods before we dive into its own implementation:

_raw_predict_init and _raw_predict

Screen Shot 2019-11-29 at 12.43.49 PM.png

These methods are used jointly to generate the raw predictions given inputs matrix X and the returned value is also a matrix of shape that has the same number of rows as the input matrix and the same number of columns as the class types. The idea is very simple but this part is the core of the boosting process as boosting in nature is this constant iterations of doing predictions, calculate error, correct it and then doing this again. Hence, the predict_stages are implemented in high performance Cython.

Screen Shot 2019-11-29 at 12.49.48 PM

predict_stages iterate through all the estimators and all the classes and generate the predictions from the tree. This part of the code probably will deserve its own post explaining but let’s put a pin here and stop at these Cython code knowing that there is something complex going on here doing some predictions “really fast”.

Screen Shot 2019-11-29 at 12.50.37 PM

Screen Shot 2019-11-29 at 12.50.29 PM

fit

The fit method starts by checking the inputs of the X and ys. After that, it has some logic carving out the validation set which is driven by a technique called stratify.

You can find more about stratify or cross validation from this Stackoverflow question or the user guide directly. Meanwhile, knowing gd.fit has some mechanism of splitting the data is all you need to understand the following code.

Screen Shot 2019-11-29 at 2.21.58 PM.png

_fit_stages

Screen Shot 2019-11-29 at 2.30.41 PM.png

_fit_stage

Screen Shot 2019-11-29 at 2.36.29 PM.png

Gradient Boosting Classifier and Regressor

After the BaseGradientBoosting class got introduced, they extend it slightly to the classifier and regressor which the end-user can utilize directly.

The main difference between these two is the loss functions being used. For regressor, it is the ls, lad, huber or quantile while the loss function for the classifier is the deviance or exponential loss functions.

Screen Shot 2019-11-29 at 2.43.38 PM

Now, we can covered pretty much the whole _gb.py which we covered the how the relevant classes related to gradientboosting got implemented and at the same time owed a great amount of technical debt which I list here for few deeper dives in case the readers are interested.

loss functions
cross validation – stratify
Cython in place predict

Vincent – A Python Library to build d3 quality plot

Vega is a visualization grammar developed by the folks from Trifacta(some history). You can have a brief idea of the idea of Vega in just one minute going to this online editor, which will map the json syntax to its corresponding virtualization. What is more, the community has also developed Python package Vincent to make Python plot beautifulsoup d3 quality graphs with Vega running behind the scene.

I installed Anaconda Python on my Ubuntu box since pandas is a dependency for Vincent, which Anaconda ships with pandas. There are hundreds of warnings while I was doing `sudo pip install vincent` and it also took me a 10+ minutes pausing now and then, in the end, it will finish, FYI.

My first impression of Vincent is it is actually not as awesome and friendly as rCharts in R. And after tinkering about for a while, I did not figure out how to do that in iPython notebook and instead generated the vega template and the vega.json to have a plot looks like below. It looks pretty slick but there is no hovering over and those interactive features that I assume should be delivered. And I still think it will prefer rCharts over Vincent unless one day that I have to develop it in the Python environment. iPython Notebook + Vincent + Flask? maybe …
But at this moment, I like Shiny + rCharts + ggplot2!