# UDACITY DEEP LEARNING – Regularization

The regularization is a trick where you try to avoid “overcomplexing” your model, especially during the cases in which the weights are extraordinary big. Having certain weights at certain size might minimize the overall function, however, that unique sets of weights might lead to “overfitting” where the model does not really perform well when new data come in. In that case, people came up with several ways to control the overall size of the weights by appending a term to the existing cost function called regularization. It could be a pure sum of the absolute value of all the weights or it could be a sum of the square (norm) of all the weights. Here is usually a constant you assign to regularization in the cost function, the bigger the number is, the more you want to regulate the overall size of all the weights. vice versa, if the constant is really small, say 10^(-100), it is almost close to zero, which is equivalent of not having regularization. Regularization usually helps prevent overfitting, generalize the model and even increase the accuracy of your model. What will be a good regularization constant is what we are going to look into today.

Here is the source code of regularizing the logistic regression model:

logits = tf.matmul(tf_train_dataset, weights) + biases
loss_base = tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits)
regularizer = tf.nn.l2_loss(weights)
loss = tf.reduce_mean(loss_base + beta * regularizer)
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

As you can see, the code is pretty much the same as the one without regularization, except we add a component of “beta * tf.nn.l2_loss(weights)”. To better understand how the regularization piece contribute to the overall accurarcy, I packaged the training into one function for each reusability. And then, I change the value from extremely small to fairly big and recorded the test accuracy, the training accuracy and validation accuracy on the last batch as a reference, and plotted them in different colors for each visualization.

The red line is what we truly want to focus on, which is the accuracy of the model running against test data. As we increase the value from tiny (10^-5). There is a noticeable but not outstanding bump in the test accuracy, and reaches its highest test accuracy during the range of 0.001 to 0.01. After 0.1, the test accuracy decrease significantly as we increase beta. At certain stage, the accuracy is almost 10% after beta=1. We have seen before that our overall loss is not a big number < 10. And even the train and valid accuracy drop to low point when the beta is relatively big. In summary, we can make the statement that regularization can avoid overfitting, contribute positively to your accuracy after fine tuning and potentially ruin your model if you are not careful.

Now, let’s take a look at the how adding regularization performed on a neural network with one hidden layer.

First, we want to highlight that this graph is in a different scale (y axis from 78 to 88) from the one above (0 to 80). We can see that the test accuracy fluctuate quite a bit in a small range between 86% to 89% but we cannot necessarily see a strong correlation between beta and test accuracy. One explanation could be that our model is already good enough and hard to see any substantial change. Our neuralnet with one hidden layer of a thousand nodes using relu is already sophisticated, without regularization, it can already reach an accuracy of 87% easily.

After all, regularization is something we should all know what it does, and when and where to apply it.

# Udacity Deep Learning – The Hidden Layer

The homework of fullyconnected session require the students to:

Turn the logistic regression example with SGD into a 1-hidden layer neural network with rectified linear units nn.relu() and 1024 hidden nodes. This model should improve your validation / test accuracy.

The last block of code was neural network where is simply a network connecting input directly to the output, of course, softmax in the middle. And the change we need to make here is to instead of mapping all the inputs (784) to (10) outputs, we first need to create a layer which maps 784 to 1024 and then another layer to map 1024 to 10. Nothing fancy, but we simply need to add relu after the first Wx+b before passing on to the next layer as the activation function.

First, let’s take a look at how the old no hidden layer SGD looks like:

# Variables.
weights = tf.Variable(
tf.truncated_normal([image_size * image_size, num_labels])
)
biases = tf.Variable(
tf.zeros([num_labels])
)

# Training computation.
logits = tf.matmul(tf_train_dataset, weights) + biases
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels=tf_train_labels,
logits=logits
)
)

# Optimizer.

# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(logits)
valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
test_prediction = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

Now, let’s build on top of this and see if we can architect a new network based on the homework requirement. Frankly speaking, the answer I am going to provide here was mostly inspired by this notebook or even more, the answer is only a fraction of what Mr. Damien offered in his code.

First of all, our network was like softmax(w * x + b) before. Now we are going to have one extra layer. The math will look like softmax(w_2 * h+b_2) where h = relu(w_1 * x + b_1). Or put everything in one:

softmax(w_2 * relu(w_1 * x + b_1) + b_2)

It is a bit complex, but not too much. we first need to redefine the variable weights and bias in a way where all of the four variables w_1, w_2, b_1, b_2 are included.

w_1 now is of the dimension (784, 1028) and b_1 has the size of (1, 1028)
w_2 is of dimension (1028, 10) and b_2 is of size (1, 10). Keep the right dimension in mind, we need to create four variables where two of the weights need to be initialized by normal distribution where the biases can be initialized by zeros, just like the old days.

After that, here comes to developer’s personal preference. You can put all the variables into a dictionary say “myvariables” where it stores all the weights and biases. Or you can create two variables “weights”  and store the two weight variables within that weights dictionary. And the same for biases. Or create one variable all separately and pass them around whenever you need them. First of all, all the design preferences will work in the end if you implement them properly. Damien’s code was following the second design type, which looks clean and tide. Here I am going to package all the variables in one dictionary just to be different.

# Variables
myvars = {
'w_h': tf.Variable(tf.random_normal([n_input, n_hidden])),
'b_h': tf.Variable(tf.random_normal([n_hidden])),
'w_o': tf.Variable(tf.random_normal([n_hidden, num_labels])),
'b_o': tf.Variable(tf.random_normal([num_labels]))
}

Now we are done initializing our variables, the next step is to build the model, or how we are going to predict using all the variables that we just defined. In the network without hidden layers, it was easy and actually a one liner

logits = tf.matmul(tf_train_dataset, weights) + biases

In our case, it will look like this:

layer_hidden = tf.add(
tf.matmul(x, myvars['w_h']),
myvars['b_h']
)
logits = tf.matmul(
tf.nn.relu(layer_hidden),
myvars['w_o']
) + myvars['b_o']

Theoretically you can put everything into one liner to make the statement that “it is still a one liner”, but I highly suggest we at least break them down into components that are more readable. In this case, breaking them down into layers is probably a good idea.

Actually, this block of code will be reused quite a few times other than defining the logits, you will use the model when building accuracy, testing against validation, test, ..etc. So let’s build them into a function:

def model(x, myvars):
logits = tf.matmul(tf.nn.relu(layer_hidden), myvars['w_o']) + myvars['b_o']
return logits

The definition of loss function and even optimizer is totally independent of how the neural internally looks like internally:

pred = multilayer_perceptron(tf_train_dataset, myvars)
loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
logits=pred,
labels=tf_train_labels
)
)
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

After all, we need to define the accuracy against training, validation and testing. Since we have packaged our model into one function, our implementation for a network even with one hidden layer is even cleaned than the homework itself. This is how it looks like right now:

train_prediction = tf.nn.softmax(pred)
valid_prediction = tf.nn.softmax(model(valid_dataset, myvars))
test_prediction  = tf.nn.softmax(model(test_dataset,myvars))

Now we have our new model. And the way we structured our code now, you only need to modify model function and add more variables to myvars even want to add more layers. Everything else should stay exactly the same. When we run our code, the accuracy got improved quite a bit, by about 10% (from 80% to 90%). And the performance difference between GPU and CPU is really substantial. Attached are two screenshots to demonstrate the performance difference between GPU (4 secs) and CPU (37 seconds).

This slideshow requires JavaScript.

# Udacity Deep Learning – Desktop into remote GPU server

After playing with tensorflow example codes quite a bit, I think I am ready, and actually cannot wait to unleash the power of GPU. There are so many benchmarks here and there where people brag about how GPU kicks CPU’s ass when it comes to massive computation tasks. I threw a few hundred bucks into my first GPU purely to play top notch video games – Starcraft II back when it first came out, thanks to Blizzard entertainment. Using GPU for machine learning is purely a afterthought, looking back.

This weekend, I brought a GPU workstation back home. However, after being put off by my wife telling me that “I do not want you to make mess in our living room, you can not put anything on the table….”. The sad part, the workstation does not have wireless internet connection, and what is even worse, we only have ethernet connection right next to the router in the living room. Take all constraints into consideration, the only option left here is to leave the desktop as a remote server and access it through other devices (other computers within the same network).

Thank goodness, it was quite easy to set up any Ubuntu machine to be accessed through ssh. Actually, there is no need to set up remote desktop, since we can use the terminal for anything graphic related like web surfing. The only thing we need to do is set up the desktop as a server, people can ssh in, and for the Python related developer, we also need to set up the desktop as a jupyter notebook server where we are going to conduct our experiment.

As you can see in the end, we merely need to plug a power and ethernet cable to the desktop. The moment you power it on, you should be able to access the desktop via SSH and you can even power it off remotely by issuing comment “sudo shutdown”. Now you can work in your bed and never even come off your bed to shut it down!

Also, to make it slightly easier for consistency access, I suggest you log into the router to reserve an IP so that the desktop will not be assigned to a different IP every time it is reboot or what, due to the fact that DHCP is the default many default devices.

After that, you can SSH into your computer, conda install or pip install to set up the tensorflow CUDA development environment. However, it might still be a PITA to develop without any IDE, and in this case, I will be using jupyter notebook. Jupyter notebook does not handle multi-tenancy that well, say you have multiple people accessing the server at the same time. Jupyterhub is supposed to be the tool to ease that pain and the installation and set up are fairly easy. Since I will the only user within the network, I literally did nothing but ran the command “jupyterhub” and now we have a jupyter notebook running that you can simply enter the address “10.0.0.54:8000” on any laptop within the network (wifi, ethernet,..). And here is how it looks like on my MacBook Pro accessing that GPU desktop:

This slideshow requires JavaScript.

For those curious mind, look at the slideshow above proving that I am running the Udacity fullyconnected notebook and it is actually running on top of a lovely GPU!

The source code is from here. Since the source code was coded to demonstrate the fundamentals of tensorflow which is not necessarily to run the code on GPU. There will not be any proof right out of box any step is actually using GPU (if available GPU and CPU coexist, tensorflow actually will prioritize GPU).

Here is a snippet of code where I want to highlight. tf.ConfigProto, device_count to be 0 will force the machine to use CPU which you can compare with when you have an time consuming task (the notebook was quite fast using either CPU or GPU).

config = tf.ConfigProto(
log_device_placement=True,
device_count = {'GPU': 0}
)
graph = tf.Graph()
with graph.as_default():
...
with tf.Session(graph=graph, config=config) as session:
tf.global_variables_initializer().run()
...
_, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict)
...

Also, log_device_placement set to be true will enable detailed logging about operations.

I think this is the end of this post. Nothing new, nothing ground breaking, but quite a fun to see everything got put together, right?

# UDACITY DEEP LEARNING – Gradient Descent

Following the previous posts, this one will be dedicated to go through the block of code which implemented the gradient descent.

First, I want to share my intuitive understanding of how gradient descent works using an analogy. Think about the goal of the optimizer is to find a set of parameters (w, b) in order to minimize the loss function across all the training data. An oversimplified analogy could be trying to find a position within your city that has the lowest altitude. In this case, the altitude is the loss function and the position is the (w,b) which you can control and tweak. The computation for the loss function could be very time consuming across all the training data (think about I want to locate the exact altitude for any given point to the precision of centermeter, or even nanometer). This is precise and accurate of course, but unnecessarily time consuming. After a whole day, you might only be able to measure the altitude of your home to that level of accuracy. The idea of stochastic gradient descent is to trade off the precision for much better speed gain. It is under the assumption that even using a subset of the data, it should still be representative of the general trend of where the loss function is leaning towards and should provide you with a measurement that is in the ballpark. Or in our analogy, I can easily tell you the rough altitude for any given point to the precision of inches or even feet, or meters, at no time. In that case, you do not need to struggle picking the hairs but can quickly realize, your whole neighborhood is not in the lowest point, and this enable you to quickly head to certain directions that lead you to the low land.

Look at this paragraph of code, first of all, it defined a utility function named “accuracy” which calculate the prediction accuracy based comparing the labels with the predictions. The prediction variable is the output from the softmax function which is, in essence, the probability of certain class happening out of all the possible classes.

For example, we have three classes (‘a’, ‘b’, ‘c’), the predicted logits could be

3.0, 1.0, 0.2

. And the corresponding softmax output (predictions) will be (refer to this stackoverflow question)

[ 0.8360188   0.11314284  0.05083836]

np.argmax function will return the index of the largest element.

compare the maximum index of the prediction with the one for labels will give you the true / false at row level and you can easily calculate the overall accuracy then.

After that, they run 801 loops and each loop will calculate the loss function for all the data points and report out the accuracy every 100 records for easier readability.

As you can see, the implementation is so neat and clean which tensorflow packages all the variable updates behind the scene.

Now let’s look at the source code of the stochastic gradient descent. it is very much like the previous code and the only difference is every step/loop, we only feed a batch of data to the optimizer to calculate the loss function.

For example, the batch size here is 128 and the subset data set is 10,000. This should improve the loss function calculate by 80x faster. And instead of reusing the same data again and again, we will loop through all the training data in a batching way so in that case, we will gain the training performance gain and we used all the data points at least somewhere in our code. The overall accuracy was as good as plain gradient descent.

In the next post, we will try to solve the problem of improving this logistic regression and evolve into a neural network with hidden layers and other cool stuff.

“Turn the logistic regression example with SGD into a 1-hidden layer neural network with rectified linear units nn.relu() and 1024 hidden nodes. This model should improve your validation / test accuracy.”

Challenge Accepted!

# UDACITY DEEP LEARNING – Tensorflow Optimizer

In the previous post, we briefly covered how the iPython notebook read the pickled inputs and transformed into a workable format. Today, we are going to cover the following block of how to create an optimizer which will be used during the iterations.

## GRAPH

Any tensorflow job could be represented by a graph, which is a combination of operations (computation) and tensors (data). There is always a default graph object got established to store everything. You can also explicitly create a graph like “g = tf.Graph()” and then you can use g everywhere to refer to that specific one. If you want to save the headache of worrying about switching between graphs, you can use the “with” statement and all the operations within that block will be saved to the graph handler.

Also, “a picture is better than a thousand words”, there is a tool called tensorboard which is extremely easy to use to help you visualize and explore the tensorflow graph. In this case, I am simply adding extra two lines of code at the end of the graph with statement block.

The writer will write the graph to a file and writer.flush will ensure all the operations which got asynchronously written to the log file is flushed out to the disk.

Then you can open up your terminal and run tensorboard command:

\$ tensorboard --logdir=~/Desktop/tensorflow/tensorflow/examples/udacity/tf_log/
Starting TensorBoard 47 at http://0.0.0.0:6006
(Press CTRL+C to quit)

And this is how it looks like that block of Python code.

The labels on the tensorboard does not necessarily reflect how variables are named in your code. For example, we first created four constants, the true tensor name only shows up when you print out the four constants.

tf_train_dataset:  Tensor("Const:0", shape=(10000, 784), dtype=float32)
tf_train_labels Tensor("Const_1:0", shape=(10000, 10), dtype=float32)
tf_valid_dataset Tensor("Const_2:0", shape=(10000, 784), dtype=float32)
tf_test_dataset Tensor("Const_3:0", shape=(10000, 784), dtype=float32)

Then, it is also pretty cumbersome to map each variable to the corresponding unit in the graph. Here is how the graph looks like after I highlighted some of the key variables.

## LOSS

The loss function or cost function for a deep learning network are usually adopting a cross entropy function. It is simply

C=−(1/n) ∑[y*ln(a)+(1−y)*ln(1−a)]

y is the expected outcome or label. In a one hot coding fashion, y is either 0 or 1 which simplified the C = -1/n * ln(a), and here “a” is the predicted probability, which is the normalized outcome done by the softmax function.

In this paragraph of code, it first created a variable weight of size (28*28~784, 10) and the bias variable of size (10). Weight was initialized by a normal distribution but throw away all the variables outside two standard deviation range while biases are initialized to be zero to get started. Just to recapture, our tf_train_dataset variable is now a N by 784 size matrix where N is the number of records/images the user specify.

logits = x * w + b = (N,784) * (784, 10) + (10) = (N, 10) + (10) = (N, 10)

Now logits is a matrix of N rows and 10 columns. Each row contains 10 numbers which store a unnormalized format of the “probability” of which hand written digits that record might be. The highest number definitely indicates the column/label it falls under is mostly likely to be the recognized digits, however, since all the numbers are simply the outcome after W*x+b which is necessary to be bounded to be between (0,1) and further more, they do not add up to 1. Those normalization all happens behind the scene in the nn.softmax_cross_entropy_with_logits function. As an end user, we only need to make sure we memorize what logits truly stands for and how you calculate logits. Tensorflow.nn will take care of the rest.

## Optimizer

In this example, the author used the GradienDescentOptimizer. 0.5 is the learning rate.

# Predictions for the training, validation, and test data.
# These are not part of training, but merely here so that we can report
# accuracy figures as we train.
train_prediction = tf.nn.softmax(logits)
valid_prediction = tf.nn.softmax(tf.matmul(tf_valid_dataset, weights) + biases)
test_prediction  = tf.nn.softmax(tf.matmul(tf_test_dataset, weights) + biases)

Last but not least, they called the softmax function on training/valid/testing dataset in order to calculate the predictions in order to measure the accuracy.

In the next post, we will see how tensorflow frame work will iterate through batches of training dataset and improve the accuracy as we learn.

# Udacity Deep Learning – Format Data

Deep learning has been widely adopted to recognize images. A necessary step prior to use other people’s model or build your own is to massage your data in a way that is machine learning friendly. Udacity is offering a deep learning class with Google where they published a Jupyter notebook which covered the whole process in tens of lines of code. There is a snippet of the code in lesson 2 – Fully Connected which I want to highlight, and hopefully this method will come handy when you work with data.

They have a function called “reformat” which is very interesting. First of all, the train_data variable is a nested array of size (N, M, M) where N is the number of records, think it as the number of images that you want feed. We are also assuming we are feeding normalized images in a unified square format of the size M * M (M pixels both horizontally and vertically). In this case, let’s imagine we have 3 pictures of size 2*2. And the ndarray might looks like this:

[
[  <- first image
[ <- first row of pixels of first image
0, <- first column of first row (the very top left pixel of the first image)
0  <- second column of first row
],
[0,0]  <- second row of pixels of first image
],
[  [0,1],  [1,0]  ],    <- second image
[  [1,1],  [1,1]  ]    <- third image
]

And a requirement is to convert this highly nested structure into an easier format where each element contains all the data points for one image in a unified way. One way is that we still keep the method of looping through pixels left to right and then top down. So the final format should be

[
[0,0,0,0],
[0,1,1,0],
[1,1,1,1]
]

Ndarray has a method called reshape can totally do this magic easily.

We only need to tell reshape that we want each element to store an image, which has the size of 2 * 2 = 4, and leave the first element as -1 so that numpy will be intelligent enough to infer how many total elements they need to create to hold all the images, which is three in this case.

Of course, if you know the end result dimension for sure, you can pass in the first argument explicitly while omitting the second or provide two arguments at the same time. As you can tell, the end result are the same. After that, astype(np.float32) will convert all the elements from type int to be numpy.float32. We are now done for the training data grooming. The next step will be working with the labels. The label variable even simpler which it is a 1D array which each element is an integer between 0 and 9. For example, the first element could be 4, which means that the first hand written image has been recognized as a hand written 4, however, a key step to work with classification problem is to convert “factors” or “classes” into a one-hot encoding format.

There are two ways I came across who people do it. One way is to first create an empty expanded matrix of the right size and then flip the corresponding element from 0 to 1. The second approach is what the author has been using in this notebook. Let’s take a look at both of them.

First, say our label variable is [1,2,5] which we have ten classes from 0 to 9. The end result will be to transform [1,2,5] into the following result:

[
#0  1  2  3  4  5  6  7  8  9   <- the corresponding bit of each record need to be flipped
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
]

Approach One:

Approach Two:

The second approach really nailed this down but fully utilizing how to slice a ndarray. Clearly, np.arrange(10) is of the size (10,1) and label[:,None] is of the size (3,1,1). Doing a equal comparison between the two will force this magic to happen. This looks very short and efficient but lack quite a readability for people who have never seen the magic, read more about numpy broadcast will help you understand what is happening behind the scene.

In the end, hopefully you have learned a thing or two massaging data.

You can view the source code of tokenizer.c from Github. I highly suggest you download that file to your computer and view it in a legit C development IDE in order to easily navigate through and potentially fold or unfold some big blob of code for easier readability. I am not going to compile/debug/run the code in this post. So tools like sublime will suffice. (Also, I am not a C programmer by any means and only took 2 courses back in college, so do not feel intimated at all because we are probably on the same boat).

The code starts with disclaims that is always a good practice to articulate the license and responsibility for using the code. After that, it state the high level purpose of the code:

*************************************************************************
** An tokenizer for SQL
**
** This file contains C code that splits an SQL input string up into
** individual tokens and sends those tokens one-by-one over to the
** parser for analysis.
*/

So now we must expect the code will take a string as input, and split into tokens, and call some parsing function to analyze them one by one.

/* Character classes for tokenizing
**
** In the sqlite3GetToken() function, a switch() on aiClass[c] is implemented
** using a lookup table, whereas a switch() directly on c uses a binary search.
** The lookup table is much faster. To maximize speed, and to ensure that
** a lookup table is used, all of the classes need to be small integers and
** all of them need to be used within the switch.
*/
#define CC_X 0 /* The letter ‘x’, or start of BLOB literal */
#define CC_KYWD 1 /* Alphabetics or ‘_’. Usable in a keyword */

#define CC_ILLEGAL 27 /* Illegal character */

Then this blob of code defined several constants, which wherever the identifier like “CC_X”, “CC_ILLEGAL”, or actually any constants defined here starting with prefix “CC_”, will all be substituted  by the token-string which is the small integer 0 to 27. This will increase the readability in the code and guarantee the optimized performance when it is actually executed. To learn more about why using small integers is faster, check out this article written by Zuoliu Ding.

static const unsigned char aiClass[] = {
#ifdef SQLITE_ASCII
/* x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 xa xb xc xd xe xf */
/* 0x */ 27, 27, 27, 27, 27, 27, 27, 27, 27, 7, 7, 27, 7, 7, 27, 27,
/* 1x */ 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
/* 2x */ 7, 15, 8, 5, 4, 22, 24, 8, 17, 18, 21, 20, 23, 11, 26, 16,
/* 3x */ 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 19, 12, 14, 13, 6,
/* 4x */ 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 5x */ 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 9, 27, 27, 27, 1,
/* 6x */ 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
/* 7x */ 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 27, 10, 27, 25, 27,
/* 8x */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* 9x */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* Ax */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* Bx */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* Cx */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* Dx */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* Ex */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
/* Fx */ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
#endif

When I first saw this paragraph of code, I just found it to be so well formatted that anyone who has the minimum amount of patience and understand it perfectly. It is a lookup table which can quickly map the first FF (16 * 16 = 256) characters to its corresponding character class. Then there is this discussion of ASCII EBCDIC which we will omit because the code has a condition group “#ifdef” which will be initialized only based on the underlying coding format.

Then there is a basic normalization which is to convert upper case to lower case which is a bit more complex in EBCDIC. This also explains why sqlite is not case sensitive for certain tokens like keywords, you can use “SELECT”, “select” or even “Select” which makes no difference.

/*
** The sqlite3KeywordCode function looks up an identifier to determine if
** it is a keyword. If it is a keyword, the token code of that keyword is
** returned. If the input is not a keyword, TK_ID is returned.
**
** The implementation of this routine was generated by a program,
** mkkeywordhash.c, located in the tool subdirectory of the distribution.
** The output of the mkkeywordhash.c program is written into a file
** named keywordhash.h and then included into this source file by
** the #include below.
*/
#include “keywordhash.h”

Here is a link to the genrated keywordhash.h.

Again, depending on the platform, if we are using ASCII encoding format. This paragraph of code is the same as below, much shorter right?

#define IdChar(C) ((sqlite3CtypeMap[(unsigned char)C]&0x46)!=0)
#ifndef SQLITE_OMIT_COMPILEOPTION_DIAGS
int sqlite3IsIdChar(u8 c){ return IdChar(c); }
#endif

That is plenty of preparation work that we just went through. Now let’s take a look at this exciting function “sqlite3GetToken” that will

/*
** Return the length (in bytes) of the token that begins at z[0].
** Store the token type in *tokenType before returning.
*/

for example, if z is a string “select * from table1”, then z[0] is of course “s”, and this sqlite3GetToken will return two things:

1. length of the token: 6 is clearly the length of the first token “select”
2. token type is clearly a keyword which is CC_KYWD

Then you can imagine it will start looping this type of execution from the left to the right.

/* line 177 */int sqlite3GetToken(const unsigned char *z, int *tokenType){
int i, c;
switch( aiClass[*z] ){ /* Switch on the character-class of the first byte
** of the token. See the comment on the CC_ defines
** above. */
case CC_SPACE: {
testcase( z[0]==’ ‘ );
testcase( z[0]==’\t’ );
testcase( z[0]==’\n’ );
testcase( z[0]==’\f’ );
testcase( z[0]==’\r’ );
for(i=1; sqlite3Isspace(z[i]); i++){}
*tokenType = TK_SPACE;
return i;
}
case CC_MINUS: {

case CC_KYWD: {
for(i=1; aiClass[z[i]]<=CC_KYWD; i++){}
if( IdChar(z[i]) ){
/* This token started out using characters that can appear in keywords,
** but z[i] is a character not allowed within keywords, so this must
** be an identifier instead */
i++;
break;
}
*tokenType = TK_ID;
return keywordCode((char*)z, i, tokenType);
}
case CC_X: {
#ifndef SQLITE_OMIT_BLOB_LITERAL
testcase( z[0]==’x’ ); testcase( z[0]==’X’ );
if( z[1]==’\” ){
*tokenType = TK_BLOB;
for(i=2; sqlite3Isxdigit(z[i]); i++){}
if( z[i]!=’\” || i%2 ){
*tokenType = TK_ILLEGAL;
while( z[i] && z[i]!=’\” ){ i++; }
}
if( z[i] ) i++;
return i;
}
#endif
/* If it is not a BLOB literal, then it must be an ID, since no
}
case CC_ID: {
i = 1;
break;
}
default: {
*tokenType = TK_ILLEGAL;
return 1;
}
}
while( IdChar(z[i]) ){ i++; }
*tokenType = TK_ID;
return i;
/*line 448*/ }

If you look at the source code of this function, it accounts for (448-177) / 598 ~ 45% which is almost half of the tokenizer.c. And looking at the structure of the code, it is nothing other than a giant switch case statement. As you can see from the code, “i” is a very important variable that stores the length of the current token and return it when it matches.

For example, looking at the first case,

case CC_SPACE: {
testcase( z[0]==’ ‘ );
testcase( z[0]==’\t’ );
testcase( z[0]==’\n’ );
testcase( z[0]==’\f’ );
testcase( z[0]==’\r’ );
for(i=1; sqlite3Isspace(z[i]); i++){}
*tokenType = TK_SPACE;
return i;
}

So first, it checks if the “first” character is whitespace or not, then it goes through a for loop where the number of loops is returned by “sqlite3Ispace(z[i])”. So if there are 4 space contiguously and i was initialized to be 1, then sqlite3Ispace(z[1]) will check if the second element is space or not, which holds true, then i got incremented to be 2, and it goes to the second loop, and this will keep iterating till i = 4, which z[4] is actually the fourth element which is not a space any more. then the for loop statement will not hold true and break which left i to be 4. The tokenType will be returned to be TK_SPACE and the length of the token will be 4.

As you can see in the source code, many of the case statements were implemented in a looping format of either for loop or while loop meant to iterate through all the characters and complete this token.

Now we understand how this function sqlite3GetToken works, the last function which is sqlite3RunParser fits together with sqlite3GetToken.

There are several group conditions defined in the rest of the code. I took a screenshot of a while loop which I personally think best describes the functionality of how a tokenizer operates.

First, a few variables were defined a few lines above this screenshot where zSql is a pointer to the input SQL statement, for example, zSql[0] is the very first character “s” if the input statement is “select * from table1”. Then sqlite3GetToken function will take the input of the sql statatement and return the length of the token, and meanwhile updates the pointer of tokenType to refer to the right tokenType for the token. Hopefully this is not too confusing after the explanation of the study of sqlite3GetToken first. Next, there is a variable mxSqlLen has we first need to understand how it is initialized. It was defined as a int first and then initialized to be the value of “db->aLimit[SQLITE_LIMIT_SQL_LENGTH]”. db is the connection to the sqlite database and let’s take a look at the rest. Here is a link which listed a few SQLite run time limits which SQLITE_LIMIT_SQL_LENGTH is actually an identifier whose value is 1. At the same time, aLimit is a structure which stores all the limits and the value of index 1 which is the second element in the structure stores the maximum sql statement length. In the end, we know that mxSqlLen was initialized to be the possible maximum sql statement length. For every loop, sqlite3GetToken will retrieve the next token length, and this variable will subtract the length of that token. The following if statement also showed that as we tokenize the SQLstatement, if the length is too long, it will exceed the limit and error out and break. This is a good break condition check but probably the code is rarely executed for most of the scenarios. The else statement describes the scenario when it is reaching to the end of  the statement. In the end of the else statement is to subtract n from zSql, so the pointer zSql now should point to the next token of the input sql statement.

This 20 lines of code is probably the most important part. Here is the decision tree where it determines if we need to pass the token to be parsed or not. In the if statement, it listed the scenario that if the token type if interrupt or illegal, then exit by either interrupt or error out, otherwise, it is a space, it will just update the zSql pointer to point to the next token by skipping the space token. In the else statement, it means it a valid a token that need to be passed to the parser. It will hand over the pointer and length of the next token to the parser.

The rest of tokenizer.c are mostly group condition which are suggested to read if you want to explore all conditions when specific flags are set up, etc.

In this tokenizer.c, we have covered two functions very extensively, sqlite3GetToken and sqlite3RunParser, hopefully it gave you a rough idea of how a tokenizer is implemented and how it is hooked up with the next step in the workflow – Parsing.

In the next post, we will look into how the parser are generated by lemon and potentially a lot of information around how parser works in general.

# SQLite – Tokenizer Requirements

A simple Google search led me to Wikipedia who defined Tokenization  as the process of “breaking a stream of text up into words, phrases, symbols and other meaning elements called tokens”. The next step is of course, to pass the list of tokens as the inputs for further processing like parsing as indicated in the architecture of SQLite.

The SQLite tokenizer is implemented in the src/tokenizer.c file of its source code, for easy access, we will use its Github repo as the future master place when we discuss code. That file was written in 2001 ( 16 years ago!) and has only a few hundred lines of C code including comments. But before, we even start going through the code, we want to have a better understanding of the purpose for this SQLite tokenizer. “what should my code do?”

As we have articulated above, this tokenizer.c should taken in a stream of characters, and then do something to break them into tokens, and categorize them into different types of tokens “words”, “phrases”, .etc. So there are a few key questions that we have to agree on or define before we move on.

What is the definition of a “token” within the realm of SQLite, that might include several sub questions like what should be included in a token, and what should not be included in a token, and what is allowed as being the “separator”, .etc. Then the second big topic will be what are the key categories of tokens, variable? SQL keywords? separators? I found this great “draft” that we can follow in order to better answer those questions.

## Character Classes

In this draft documentation, it first lay down the foundation of defining all the characters (actually unicode) into 6 fundamental classes, whitespace, alphabetic, numeric, alphanumeric, hexadecimal, special.

There are two classes that I want to highlight and discuss more.

First is the definition of whitespace, based on the definition from a unicode lookup website.

• u0009 character tabulation
• u000a line feed (LF)
• u000c form feed (FF)
• u000d carriage return (CR)
• u0020 space

I think if you were born after the typewriter era, please check out this stackoverflow question and this wikipedia page to better understand what is the difference forms of switching to a new line. Also, a picture borrowed from Stackexchange can probably explained in a much better way.

Second is the interesting definition of alphanumeric, or actually alphabetic essentially, it is not only A-Za-z  but also include “_” and any other character larger than u007f, so it includes all the “weird” characters including emojis like white smiling face and foreign languages .etc.  Since the definition of alphabetic is so wide hence the definition of special has actually been narrowed down to only the limited number of “weird” characters at the beginning of the unicode world.

## Token Categorization

After understanding how character classes are defined, the draft listed all the requirements related to token. There they categorized tokens into several categories:

• Whitespace tokens
• Identifier tokens
• Literals : string, integer literals .etc.
• Variables :  Variables are used as placeholders in SQL statements for constant values that are to be bound at start-time
• Operator tokens: “-“, “=”, “(“, etc.
• Keyword tokens: “AFTER”, “SELECT”, “JOIN”, etc.

The tokenizer requirements are extremely helpful to read through just to get a sense of how the code is “supposed” to behave like in a plain English way. For example, we will go through one or two tokenizer requirements listed in the draft as an exercise:

The first requirement there is described as below:

Processing is left-to-right. This seems obvious, but it needs to be explicitly stated.

H41010: SQLite shall divide input SQL text into tokens working from left to right.

In the draft, all the requirements come in this format where a description comes first and then there is a more concise form in the tone of”shall”  starting with an identified in the format of “Hddddd”.  It is good that the formats are consistent but actually this format is not the “ideal state”. There is an official SQLite requirements page which documented ALL the requirements in one place. You are more than welcome to read through only the first section “About SQLite Requirements” just to get a sense of why the tokenizer requirements document is only a draft 🙂

OK, that is quite a high level overview of the tokenizing process along with some of the requirements. However, before jump into reading the source code, it will be great if there is a way where we can play with the “SQLite Tokenizer” directly in order to have a hands-on and intuitive understanding of what a tokenizer does by “writing code”! Sadly, a quick Google search did not provide any fast way of really hooking up the SQLite tokenizer directly to a tool that end user can easily use, there are tools out there like Python sqlite3 which can interact with SQLite pretty fast from the application perspective, but not as deep as the level we want.

So maybe we can go through the source code of tokenizer.c and see if there is a good place where we can inject some print statement to print out the output of the tokenizer before it feed into the parser. Since the source code is a few hundred of lines and we might even want to discuss and delve into the details. We will write up another post and go through the source code.

# SQLite – Architecture

I am planning to spend sometime learning the internals about a open source database, and SQLite has been recommended to me as the “light” and well written tool to get started. Thanks for the community of SQLite, there are several good tutorials, articles and thorough technical documentations available on SQLite’s documentation page.

I started with the “Architecture of SQLite” and it provided a very good high level overview of how the components of SQLite pieces together.

Here is a great graph highlighting the ecosystem of SQLite and its major components. To me, there are two major components that I am specifically interested in at first glance, first is the SQLCompiler, I have never taken any true computer science class which I am assuming “compiler” is a core class that focus on this. I think better understanding of a SQL compiler works will be a good starting point. Second is the Backend which is the true “meat” of any database, how the data are allocated to the disk efficiently, and how the search happens, how you can optimized the allocations using canonical data structures and algorithms to provide outstanding performance is extremely interesting to me.

I am not going to reiterate too much what this article has already stated in this post. I will start with Tokenizer and hopefully have a better understanding of what a tokenizer is and how it is implemented in SQLite in the next chapter.

ln file1 file2  # this will create a hard link file2 linked to the content of file1
stat f1    # this command will get the inode information