Python socketserver ForkingMixIn and ThreadingMixIn

In the previous post, we showed that can do more than one thing at a time using threaded and process arguments. In this article, let’s dig a little deeper into how Flask did it.

That magic must be attributed to two interesting implementations ForkingMixIn and ThreadingMixIn under the socketserver module. socketserver is a Python built-in library for networking and web server. If you have never used the library yourself, it is because most people use higher-level frameworks that actually use socketserver behind the scene. For example, Flask is using Werkzeug, Werkzeug is using http.server for serving, http.server uses socketserver. By looking into the source code of socketserver, it only uses very low level libraries like socket, os, threading, hence, having a knowledge how socketserver works definitely will help with the understanding how Python web serving works.


The implementation of ThreadingMixIn is very straightforward. Every time when `process_request` is called, a new Python thread is created and starts the thread.


I was not familiar with the concept of forking when I first read it. I thought it must be calling library like multiprocessing just as it did in ThreadingMixIn by calling threading library. After some research, it turned out forking is such an important concept in operating system that it is probably the most key concept behind process management. Even multiprocessing itself is probably using os.fork behind the scene.

If you have never used fork before the code of a if statement based on pid might be confusing. os.fork is function that the return value will be the process id. The magical part is after os.fork got called, a branded new process will be forked / cloned. Previously you had one, and now you have two cloned processes. In order to distinguish the old one from the new one, pid will be assigned to 0 in the new process so you know it is newly created. In the old process, instead of 0, the value of pid will be a process id of the newly created process.

Knowing that, we know to put the instructions for the parent process in the if statement, and put the instructions for the newly created child process into the else statement. That is exactly how the code was written, the parent process only keep track of child process id. The child process will finish the request and exit it with the proper status.


To make sure I fully understand how os.fork works. I wrote the following script. In fact, I ask a process to calculate factorial for some big number making sure it completely tie a CPU for a good period of time. If it is just multithreading sharing the same CPU, nothing will be accelerated.

I ran four factorial calculations in total, with a os.fork operation between each.

At first glance, the logs might be a bit hard to read so formatted them a bit into this spreadsheet. The 81(85) is the main process. And it took 9.6 seconds to calculate the factorial of 200K. Then once the fork is called, you can see the 81(86) got spawned and 85 immediately start calculating, right next line indicated the new spawn process recognized itself and also started calculating just < a us later (569 us = 671651-671082). Now we are two processes running in parallel, 85 and 86.

In the next round (separated by the red line), 85 finished first, spawned 88, and both of them got to work also within a ms (513us = 671850 – 671337). Then right after it, 86 finished, spawned 89 and both of got to work within a ms (571us = 685073-684502).

The same story repeat itself in the last round with 8 processes existing at the same time busy calculating factorial. One interesting observation is that all the first 3 rounds finish in similar time with about 9.6 seconds, then 19.68 > 19.2=2*9.6, and 29.24 > 28.8=3*9.6. The biggest difference is at the last round with (13.97=43.206446 – 29.239630), that is 45.5% (13.97/9.6) performance deterioration, why?

That is because CPUs were busy, I have a 6core i7 on my MacBook. If all 6 cores were busy doing work, that will finish 6 factorial calculation, if the rest 2 got assigned to two cores, it will take another 9.6 seconds so the total should be 19.2 seconds, which it did not. If everything is perfectly optimized, and all 6 cores all contribute to the calculation, it should take 8 * 9.6 / 6 = 12.8 seconds. So the theoretical minimum is 12.8 and maximum is 19.2 and the reality is 13.97. I have to say that is the proof that there is some cross CPU collaboration with some overhead.

Well, that is enough fun for today. In conclusion, we had a good view of the two implementations for threading based and process based mixin. We looked into the os.fork in more detail and demonstrated how forking got used to distribute computation heavy jobs.

Is Flask synchronous – threaded vs processes

Flask has been claimed as synchronous on many occasions, yet still possible to get async working but takes extra work. So, let’s see how a naive web server synchronously handle requests. We expect a bunch of requests were received, and the server will process them one at a time, meaning that when a request is being served, other requests cannot be processed and will have to wait.

Let’s see it in action. We first define an endpoint where a simple 1 second delay is applied everytime a request is received.

L9~L21 defined the endpoint. We captured several interesting data points like process id, thread id and execution time for demonstration purpose. At L26, we have to specify `threaded=False` in order for it to be single threaded.

By looking into the source code of, it is running a development server leveraging werkzeug.serving.run_simple method. Even Flask’s own documentation has repeated several times its own built-in documentation is not for production. In production, Flask can only be used as the web application framework and we do need a WSGI server.

Below is a diagram demonstrating the underlying dependencies behind, how the development server got built from some of the most foundationational built-in libraries like os.forking, threading, etc.

If we put aside the question of how the server got built for a second, let’s spawn some requests and test the performance.

The code above will spawn 10 requests simultaneously. There are several other ways of doing it and the easiest way to make asynchronous request is grequset, and the code is as easy as above. There is a whole discussion between imap and map about are they truly issuing 10 requests at a time that worth noting, but the code above works for me.

There is a clear difference between the threaded=True and False on the server side. When threaded=False, even if all the requests were issued simultaneously, the requests were processed one by one.

After we changed the to be threaded=True, this is how the responses look like now.

All the requests come back at 1 second (1.457s) at once.

This is the difference between using threads on the server side. We can also test it using different processes. The end outcome looks similar as all the requests are processed quickly but the difference is the server is creating many new processes, issuing one process for a given new requests.

These two different approaches has their own pros can cons. There is a great comparison between thread and process.

After seeing what just happened, next time people say flask is synchronous and it can handle only one request at a time. Knowing that by configuring thread, process at, you know for sure we will be able to handle more than a request at a time. Should we use it, clearly not, even the is not supposed to be called but can/cannot is different from should/should not.

Have a bet with your coworkers and win yourself a beer!

a good stack(deque&list) and a good queue (deque) in Python

Stack and queue are probably the mostly commonly used data structures, for example, a stack is required in the depth first search and a queue is required in a breath first search. However, when it comes to the explicit implementation, one usually settle with the old friend – list, since it is so powerful and have native operations micmic both structures behavior. Here is a quick post about the performance comparison.


the stack data structure should support Last In First Out (LIFO), and basic operations should include quick put and get. Both list and deque are good for a stack given their performance are very comparable. However, each of them has their own advantages over the other in an obvious way, so pick wisely based on the use case.

Here is a performance comparison demonstrating both list and deque can be used for stack, both constant O(1) for one side put and get.

This is another performance test that shows deque’s random access O(n) is much slower (300x slower) than a list O(1).


the queue data structure should support the concept of First In First Out (FIFO), and basic operations include put and get too, but the put and get should happen at different ends.

You can use list append to push objects which is fast, but in order to get the first element, you will need to use the pop(index=0) which requires shifting all rest of elements.

You can find that using the deque data structure achieves the best performance of 152ms to add 1M objects and only 120ms to retrieve all those objects. For the queue.Queue, it took 2.118s (14x slower) to store and 2.214s (18x slower) to get all objects. Last for the list, it is pretty quick to add all the objects 162ms but the pop(0) took close to 3 minutes (1450x slower)!

Lesson learned, use deque if you need a queue.

choose wisely - County EM

gzip is append”able”

gzip is a commonly used data compression file format. There is also an application gzip that works with this file format. Recently I came across a use case where I need to apply some data transformation to a large dataset stored in the cloud. The data is so large that I cannot fit all the raw input data to the disk, not even the output data, but compression of the output, I should be able to based on estimation. This article is to discuss some of the characteristics of gzip file in general by walking through my use case.

I thought compression is not divisible. You need to first have ALL the raw input, then you can apply the compression, the compression will analyze the global frequency and build the dictionary to encode all the data. In that case, I might be able to process the raw data little by little, but still won’t be able to fit the uncompressed output, let alone compression. However, if the compression can be done piecewise, I can process a small fraction of the input, transform it, compress it, and iterate through this process until finish and put everything together in the end. Most importantly, this means I can even parallelize.

Now the whole question depends on whether a compression can be divided. The short answer is Yes. It can be.

Gzip fileformat is defined in RFC1952.

2.2. File format

      A gzip file consists of a series of "members" (compressed data
      sets).  The format of each member is specified in the following
      section.  The members simply appear one after another in the file,
      with no additional information before, between, or after them.

My previous impression of analyzing frequency could be true within a member, but based on the documentation, a gzip file could be a simple concatenation of multiple gzip file where each “member” is independent. By reading through the deflate documentation, the compression is actually happening block by block, the block size can vary but when it compress, it process about 64KB input at a time.

gzip -> members -> blocks

[114]. we first write the Titanic dataframe to a csv compressed in gzip format, p.s., if the file suffix is gz, pandas will autoapply compression to it. Then we created a loop to repetitively append the compressed datastream to the same gzip file.

[115]. is the same process but it is uncompressed, basically how to rbind csv files.

This code snippet demonstrated that we can keep “concatenating gzip files”. We still need to be aware of what underlying data format we are using. In this case, I am using a CSV which is very easy. The only tricky part is to make sure all the columns are consistent and you don’t mistakenly include the headers when you append data.

In the end, you have one gz file, after you uncompress it, you still have one large CSV file.


GZIP file format specification version 4.3
Python gzip, zlib
DEFLATE Compressed Data Format Specification version 1.3

Python CST and AST

There are two key concepts in parsing any Python code, one being the concrete syntax tree and the other being abstract syntax tree. This post provide some introductory information about what they are, their roles in the parsing process, commonalities and differences between them.

In fact, as this post is written in 2021, it is a possibility, or probability that the term CST might be rendered obsolete because of the new parser implementation parser in newer versions of Python. parser is a built-in python language that has been there for a long time that works with CST. However, given the latest implementation, it won’t be a two stage process CODE -> CST -> AST anymore. As CST is still a very important concept in understanding any programming language, I think it will still be a good idea to learn about it since it will be a while for us all to move into 3.9 and CST and AST still serves its general educational purpose when it comes programming language.

Python Software Foundation News: Replacing CPython's parser - Python  Language Summit 2020

CST is the exact parsed tree representation of your code conforming to the language grammar, while the AST is a simplified CST keeping only its gist. Based on this Stackoverflow post, depending on how the developer writes the code, eg. a = 1 + 2 * 3, there could be different variations like a = 1 + (2*3) or a = (1 + (2*3)), the CST for each variation is different as the code literally are different, the CST will contain all the different placement of parenthesis. After all the operation precedence got taken into considerations, the CSTs will be simplified and end up being the same AST. Of course, for the same CST, if the underlying executions are different, the same code can lead to different AST.

In the following example, two implementations have different CST but all share the same AST in the end.


book: CPython-Internals

Jupyter Notebook Markdown

Today I was studying the notebooks that published along with the book AI Powered Search and realized that author was using the “fence code block” quite a lot. We all know the jupyter notebook comes with the markdown which is a content focused language. To be content focused, it means the users use the precanned markdown syntax to explicitly set the relationship between contexts, like what is a header, what is a paragraph. In that way, you don’t have to worry about the formatting yourself, as there is a fixed way and consistent way of doing it. If you ever failed to add in a new line in your content, you will understand that this “inability of adding a new line” is a feature and benefit that comes with Markdown (just like latex) instead of a bug (“this sucks, I cannot even add in a new line”).

Attached are several markdowns of using the fenced code blocks.

Can you guess what will be the output after it is rendered?

Even if there seems like two new lines between L1 and L4. Due to the very reason mentioned above, Markdown will treat as two paragraphs and only insert a new line. And the fenced code block will be treated as inline, if the opening triple backticks are on its own line, like L11, it will be formatted as a standalone code block with a different formatting. Inline code blocks comes with greyed background while standalone comes with a white background but indented differently. In addition, when you have fenced code block with a programming language next to the open backticks, the text within the code block will be highlighted per the syntax of that programming language.

Container Sizing

When you consider deploying your model into production, a key step is to understand the resource requirement. It is not only the machine learning prediction part, but everything that goes inside the container. This requires us do some profiling and understand the performance of the container under different scenarios of load.

Here we are going to profile the container which has the local deployment of the sample AzureML project – diabetes prediction which is a ridge model.

The concept is simple. We artificially generate some requests, test the performance under different resource allocation – how long it took to serve these requests. If the computing resource is insufficient, your container will be slow. On the other hand, if the resource is over allocated, your performance won’t improve further and extra resource will only be a waste of money.

The first question will be how to create the load. Ideally, you will want to the payload to be as close to production requests patterns as possible. For example, if you reuse the same payload and only change the number of requests, there could be activities like caching that skew your tests results (your code looks faster in testing but act slower in production). And at the same time, the true performance of a web server can hardly be measured synchronously. You can also make asynchronous calls to maximize the throughput.

Here is a small test script to demonstrate the idea. You can pass in the number of workers and the total number of requests for a test, and the final returned value is how long it took to finish all these requests, ideally the shorter, the better.

After you have a test script, the next step is to try out different configurations and take some measurements.

In this test, I am only changing different cpu_limit assigned to the containter.

cpu_limit is a threshold that your container at most can consume. There are many other parameters could limit the actual cpu resource available to your container. For example, setting cpu_limit to be 24 doesn’t guarantee any type of cpu resource available. In a typical laptop, you might have only 4 cores in the first place. In that case, setting cpu_limit to be 24 makes no sense. Another is competing activities, if there are other activities during the testing, the cpu available to your container could take another hit.

During the test, if the payload is low, and your application is perfectly capable of handling it, you might not reach either the physical and cpu_limit assigned to it. During the stress test, you probably want to avoid other irrelevant activities so the cpu resource available is sufficient and it is only the cpu_limit that you apply actually dominates.

In my testing script, I generated a list of cpu_limit starting as small as 0.01 cpu ~ 10 millicpu. (I actually started with large cpu assignment because they finish faster). And for each cpu_limit, I tried out different number of workers to simulate the behavior of required concurrency.

Following is the result of the first test, instead of 1000 calls per test, I used 100 calls just to be fast. It is quite straight forward that the more CPU resource you assign to it, the better the performance. For example, when you assign 0.512 cpu, it took only less than 3 seconds to make a 100 calls. And by giving it more cpu, the performance is not further improved. Meanwhile for the worker count, the performance is improvement is not obvious as you expected. One imagine that if the tasks is io bounded, having two workers will double the throughput etc.

In the next test, I increased the number of calls from 100 to 1000. And you can immediately see the performance improvement from the client side. When you assign sufficient cpu, the performance will keep improving from the client side the more workers you assign, until no more.

How surprising! The more resource we give it, the faster we get? 🙂 This is like a typical optimization problem that for a fixed amount of CPU resources, how to size the container so we can handle the most throughput.

Let’s do some math, say if we have total TOTAL_CPUS, and each cpu is being sized with the cpu_limit as SIZE. Then we can afford TOTAL_CPUS/SIZE number of containers without running out of resources. Based on the our choice of SIZE, we will be able to understand how many requests each container can handle.

In order to find the optimal point, let’s first model the response time versus the cpu_limit. We can fit a curve to it. In this case, I picked the exponential decay curve and it works pretty well.

My take away is that the smaller you can make a container, actually the more efficient it is despite the fact that it will be slow. As you can see from the blue line, the latency is getting better and better as you throw more cpus to it, but that limit the total number of containers that you can fit into an environment, and the overall total time almost grow linearly.

Rather than find the optimal point, this is almost like a greedy theory in which you just need to understand what is the acceptable SLA, and almost work backwards and chose the minimal amount of cpu to guarantee that SLA. In addition to that, any extra cpu assignment won’t help at all.

Shortest Distance – A Dynamic Programming Approach

As we have discussed previously, finding the most efficient keyboard could be viewed as a shortest distance problem where we arrange all the keys in a way that will minimize the total traveling distance given a specified typing task.

At first glance, it is different from commonly used graph algorithms as they mostly start with a graph (V, E) and our problem is actually to find the graph given the path. This blog post is meant to implement a solution using dynamic programming, reducing a challenging problems into smaller ones and then we will discuss the efficiency next.

Previously we have mentioned that there are n! permutations of different key layouts. For example, we face n different possibilities when we decide the the position of A, then if we fix B, there could be n-1 positions available as one position just got occupied by A, so on and so forth. We have n*(n-1)*…*1 = n! possibilities.

To refresh everyone’s memory, here is a graph of computational complexity. Factorial growth is obviously the worst growth pattern, so bad that with only 60 elements, you can end up with 60! ~ 10e82 which is the same number as the total atoms count in the known universe. Actually, there are more than 100 keys on a full size standard keyboard. So here we go, we know that a brute force approach will stand no chance of searching for the best layout.

The Simplicity of Computational Complexity: Street Fighter II VS. the Big O  | Hacker Noon

Now let’s discuss what are some of the unique properties that might enable us to avoid some searches. Or if we view our search paths as a tree, what are some techniques that we can use to prune the tree. Or can we reduce a huge problem into smaller subproblems.

Say for example, we need to map N characters to N keys and there also exists a typing task which is a sequence of those characters which we need to enter. There must exist one or at least one layouts that will yield the shortest distance. So let’s find a random layout called L. And we can decompose the total cost into several components. In the layout, let’s start with a key say K1.

Total Distance for L = Distance (sequences that contains K1) and Distance (sequences that don’t contain K1)

To be more specific, the sequence is a simple string of characters which we can break down into pairs of characters (from, to). For example “hello” can be broken down into “he”, “el”, “ll”, “lo”. If we pick character “e” as the target, the total distance can be categorized into two groups, (“he”, “el”) and (“ll”, “lo”). The reason being now we can reduce it to a smaller problem, one that is independent and easier to compute. In Distance (sequences that don’t contain K1), it has one less key, one less position, and the sequence doesn’t contain K1.

Distance (sequences that contains K1) is not determined yet as it is dependent on the positions on the rest of the keys. This is also the part where we can potentially improve some efficiency as the frequencies are not evenly distributed, you probably have pairs like “th” much more often than “xz”. If we will be able to follow certain order (maybe by frequency) when loop through all the keys, the most influencers will be focused on first. Even if we cannot afford a full search and we still follow a strategy like breath first search, those unknowns might only contain less frequent used keys, positions and pairs which could be immaterial so it might turn out to be a good approximation.

D(K, P) = sum( D(k, p) + D(K-k, P-p) ) for p belongs to P for a given k

If we need to first and only fix one key, and this key can have certain number of positions, for each position, we can call the same function to a smaller problem for the rest of the keys and rest of positions.

By looking into this search tree, we can find opportunities for improvement. For example, In K2 layer, there should be N * (N-1) total children, but in reality, we can cascade down to layer N-2, we already see computational duplications that can benefit from memoization. For example, when we fix K1 at Pos_i and then fix K2 at Pos_j contains the same subproblem as we fix K1 at Pos_j and then fix K2 at Pos_i. How many redundant calls we can eliminate, N*(N-1) – C(N,2).

Now if we go deeper, for any given layer i, we theoretically will have N * (N-1) * (N-2) * … * (N+1-i) in the most naive way but there will be tremendous amount of redundancies. We will have (N-i) many keys that need to be mapped in the next layer.

There is an interesting property of combination. Which C(N, i) = C(N, N-i) = N! / (i! * (N-i)!). Intuitively speaking, we have N nodes in the first layer, and we can also say at the very last layers, we only need to calculate how to place that last key on the keyboard which is still N different choices. Given this symmetric property, the real execution actually has a shape similar to an envelope where the head and tail are skinny and grow as it approaches the middle.

So in layer1, there will be N = C(N,1) nodes. In the last layer, we also need only N = C(N,1) nodes. So if we are going to arrange 26 keys. The widest layer will be right in the middle. C(26, 13) = 26! / 13! / 13! ~ 10e6 ~ 10 million, this is much smaller than 26! ~ 10e26.

So the total nodes across all the layers can be summed up as:

C(N,1) + C(N,2) + … C(N,N) = 2^n, actually the complexity is exponential rather than factorial. This is a huge progress by just using memoization.

As you can see, this only includes the recursion part, it still has not included the known key part. As we fix more and more keys, it position will need to propagate back to the previous layers to update the distance. So, let’s think about how complex that will be.

When we fix one key, the typing pairs related to that one key will be stored separately and excluded from future recursive calls, and its value can only be gradually updated as more and more keys got fixed.

In the first layer, we have N positions, and all those N branches share the same characters pairs as they are the same subset related to the key that we are fixating. The only difference being that key is being stored at different positions.

In the 2nd layer, we have N – 1 positions left with N – 1 characters. And we fix another key, as we are positioning the second character at N -1 positions, it will go back to the previous step and being able to calculate the distance with pairs that contain these first two keys. In reality, we will have N*(N-1) different unique position pairs that we will be able to populate. There is an interesting property here as switch the position between two keys won’t change the distance, so in reality, we only need to calculate N*(N-1) or C(N,2) positions pairs. At the same time, it will identify and include all the pairs with key2 (without key1), store it at this layer and send the rest of the pairs (without key1, key2) to the next layer.

In the 3rd layer, we fix another key with N-2 positions available. As we now have the 3rd key positioned, it will send its position to the previous layers. It will not only send its position to the 2nd layer so the pairs with key2, key3 can now being computed, and theoretically speaking, there is also C(N,2) positions or pairs that we need to calculate. Also send back to the first layer so key1 and key3 and also now being computed. So the total is C(N,2) for key1 and C(N,2) for key2, which the total “call back” are like 2*C(N,2).

So on and so forth, we can calculate the total work for this kind of call back

Total Call Back = 1 * C(N,2) + 2 * C(N,2) + … N * C(N,2) = (1+2 + … + N) C*(N,2) = (1+N)*N/2 * N * N (N-1)/2 ~ N^4

Comparing with the recursively calls 2^N, N^4 is still polynomial and its computational complexity is probably so small that can be ignored as we increase N. (eg. if N=30, 2^N=1B, and N^4 < 1M).


  1. Is there something else that we can leverage by looking into the properties of our specific problem?
    1. As we are searching for the minimum, there could be cases that we might be able to prune the a whole branch of tree if there are cases that “we have not arranged these keys yet, but assuming that they all share the same possible minimal distance, the sum still exceed our record minimum, then we can give up those branches completely”.
  2. Looking into the distance calculation, it is strictly looking up spatial distance in a 2D Euclidean space through like a distance matrix, if that is the case, is it possible to vectorize some of the calculations so we are not going to boring for loops to improve efficiency.
  3. Taking legacy QWERTY layout into consideration, a complete different layout without material improvement won’t be well accepted. Is it possible that we start with qwerty and then switch keys 1, 2, 3 at a time so the majority of the layout stay intact but we find the ones that with the highest return on investment given the minimum amount of changes.
  4. There are many other approaches to solve NP-hard problems which I other people advised me this problem belongs to. Like certain type of heuristic search, simulated annealing, genetic algorithm or different types of reinforcement learning. Those all will be interesting topics to look into.

Before we think about how to find a better algorithm, this is actually already too much talk and we better off write some program to evaluate our reasoning making sure it is not flawed. In the next post, we will be writing some programs and validate many of the statements here.

Thinking about a more efficient keyboard layout other than qwerty

Likely you are using a keyboard with the letters ‘q’ , ‘w’, ‘e’, ‘r’ on the top left, which is also commonly being referred to as the qwerty layout. This layout was created in 1870s. You may have asked the same question as everyone first started learning typing “why A is not next to B, why the layout looks like total random”. The answer to it was the inventor back then claims this randomness actually helps improve the typing speed by distributing the tasks equally to different fingers, and also due to some mechanical designs limitation that jamming too many frequently typed keys together is not a good idea.

Nowadays, many of those premises no longer holds true and constraint also being removed. Now you can literally make the keys as small as you want, or concentrate certain keys as close as you want without worrying about physical limitations. At the same time, the content that end users entered has also shifted so much that certain keystrokes got typed more frequently than 150 years ago. For example, in the modern digital world, the symbols got typed much more often and sometimes dramatically exceeded even certain letters. For a programmer, they use symbols like semicolon as frequently as how writers use period because semicolon actually means end of a statement in several programming languages. In a typical tweet, people have to use characters like “@” and “#” maybe without using any traditional punctuations.

There has been some efforts to adjust the keyboard layouts by adding certain keys like multimedia keys but the qwerty has certainly never been challenged to a degree that the society cares enough to make a change. Maybe because it is not a problem at all. I agree that for those who work extra hours they certainly not think because they cannot type fast enough. And maybe it is just such a prevalent layout that the cost of change and inconvenience that it will cause outweigh the benefit (many softwares like video games, editors actually choose certain shortcuts because they were convenient under qwerty).

The question now becomes, can you quantify the efficiency or even prove that for a different layout, there will be a significant efficiency gain, or if we take it one step further, out of all the different layouts, is there a way that we can find the most optimal layout that worth the effort to ditch what everyone has been using for the past 150 years.

Key Performance Indicator

There is a saying “if you cannot measure, you cannot manage”, so let’s first start by coming up with a way of measuring efficiency. There are common metrics like (word per minute – WPM) that people use to brag to their friends how fast a typer they are. If we introduce a different layout to two groups of people who are not biased (maybe students or kids who are just learning typing). That will be a great experiment. However, just like covid vaccine test, in order to be scientific, we will need to conduct experiments and the cost of that is just too prohibitive at the early selection stage. But sometime down the road, it certainly will be required.

Now let’s think about the typing process, it boils down to a sequence of actions about moving fingers fast and accurately to push down certain button. There has already been tremendous amount of conversations around what kind of keys are the easier to type, then there comes along inventions like mechanical keyboard, what kind of cherry switch is the most suitable, clicky versus linear etc. However, it is quite obvious that time spent is not only about pushing the button, but also moving the fingers to the right position. Think about how easy it is to type “asdf” because it is literally all next to each other. Now try to type “?\=`”, you will need to lift your pinkie finger to the corner of the keyboard and move it around even with the assistance from your eyes too.

So maybe one angle when we are looking for the best key layout is actually to seek for one that that the user to not move their fingers that much.

For those of you who has an engineering background, this probably already sounds similar to an optimization problem and some of you might already start thinking about “shortest distance”, “traveling salesman”. Yeah, there are many mature algorithms which is designed specifically to solve similar problems and find the shortest distance.

The question is now “can we find a keyboard layout that yields shorter traveling distance than ‘qwerty’ or even better, can we find the best layout that guarantees the shortest travelling distance given modern typing task”.

There are many assumptions that we are going to make in this exploration process and those assumptions might be too naive which lead us to flawed conclusion, but they will dramatically help simplify the problem at the beginning and we can challenge them later and build more complexities as we want.

Key Positions and Size

In a typical keyboard, most keys are same size (square or round) key caps that are perfectly aligned horizontally and tiled vertically. Certain keys even come with a different size like space bar being the biggest, left and right shift, enter, etc. For now, let’s assume that all keys have the same size and are positioning perfectly on a meshgrid, like a chess board. In that way, it is much easier to calculate any distance for two given keys without worrying about the specifics. Another benefit is that the keys can be stored in certain data structure and being calculated in an efficient way like ndarrays when it comes to computing.


Each key on your keyboard usually carries two meanings like you can change the case for letters and entering a different symbol by holding the shift key. This is certainly an important topic as it is those symbols that previously got assigned to the corners or even share the same physical switch that grows its usage lately. If there is a need to separate out certain characters, or even promote certain character to be at a prominent positions are certainly changes that worth exploring. In that case, we will treat all printable characters on a keyboard as its own entity with several exceptions.

Maybe for numeric characters, it worth the effort to exclude them for now as regardless of which frequency to use, they gains its benefit of staying in the numerical order regardless of its usage. At the same time, the case for letters might keep sharing the same key as I can totally imagine what kind of criticism that we will receive if a share the same key as upper case Z. What the hell? There might certainly relationship in symbols like pairs of brackets, smaller than or greater than. In reality, there are even autocomplete features that very few developers have to type the closing parenthesis or brackets because they come autocompleted by IDE. In that case, maybe temporarily removing those closing symbols also makes sense.

So now this is what we ended up with 58 characters:

26 letters
32 symbols: ~!@#$%^&*()_+`-=[]\;’,./{}|:”<>?

Distance Definition

This is probably the most important part of this analysis as it directly determines how our final score will get calculated.

Simple Travel Distance

A naive scenario will be treating all different keys as physical locations. Calculating the total traveling distance for typing a sentence will be equivalent to calculating the total distance of a taxi trip which we just need to sum up the distance between subsequent characters in this physical space.

In order to type the word “hello”, it includes 5 keystrokes and 4 inter button movements. The totally traveling distance will be the sum of moving D(h, e), D(e, l), D(l, l) and D(l, o). One might say that D(l,l) requires no travel at all as it is just pushing the same button twice.

In our case, we can easily calculate the Euclidean distance between e and h as their horizontal difference is 3 and vertical difference being 1. So based on the Pythagorean theorem that the direct distance will be calculated as sqrt(3^2 + 1^2) = 3.16 = D(e,h). In this way, we can calculate the distances being

D(h,e) = sqrt(3^2 + 1^2) = 3.16
D(e,l) = sqrt(6^2 + 1^2) = 6.08
D(l,l) = 0
D(l,0) = sqrt(1) = 1
=> D(hello) = 3.16 + 6.08 + 1 = 10.24

Well, pretty straightforward right? so let’s see if we can write a small Python script so it can be calculated easily.

After we populated the keyboard layout for qwerty and extract the coordinates, now we can write a function that takes in any given layout, any given text, and then we will calculate the total traveling distance. (note: the np.linalg.norm is a pretty handy and high performing function to calculate the Euclidean distance between coordinates)

It matches our calculation!

Different Layouts in Action

Now we have some basic functions to help us evaluate performance. Why don’t we feed it some real life layouts and content and see how it works in action.

So far, other than qwerty, there are indeed some alternatives like coleman or dvorak. Let’s create those two layouts and compare it with qwerty.

In order to work with some real life content, there are some small adjustments that I made to the code which I previously mentioned to skip through travels which involves the characters that we have not entered yet. And also created a counter so the returned travel distance is normalized so we can compare it across even different content or text length.

Here we grabbed the Shakespeare’s sonnets from the Gutenberg project and it is interesting to see that our popular qwerty outperforms other mainstream layouts by quite a bit.

However, now this prompts a few other questions:

  1. is there any other layout that we can generate that proves to outperform qwerty. In theory, we can generate all the possible permutations of layouts and calculate the travel distance for each and find the best one, is it computational feasible, is there a good algorithm that we can use, if not, will there be any approximation technique that we can deploy?
  2. here we are assuming that we actually have only 1 finger. In reality, there are regions and the distance should be calculated in a different way that letter “f” and “j” are really fast traveling wise as both of your index fingers are resting there.
  3. we should also try to work with some different content just to play with as there might be some adjustments depending on your profession that certain keys might need to be adjusted

Hopefully this helps you understand the keyboard is actually a fascinating topic and I will try to cover some of the pending questions in more depth in future writings.

An algorithm question – shortest travelling distance

I have a interesting question that is related to calculating the shortest distance. However, rather than choosing a path for any given graph (vertices and edges), in this problem, the path is given and it requires generating a graph under certain constraint.

So let’s go through this problem in a bit more detail:


There is a fixed amount of positions (N) into which we need to place N different objects. The traveling distance between all the positions are pre-determined and fixed. The task includes a sequence of objects to pick in an order that cannot be changed and the same objects can be picked multiple times. The question is how to place those objects in an order so that the total traveling distance can be minimized.


Now, let’s give an example.


Assuming that we have 4 positions in a 1D dimension (list). And we can easily calculate the pair wise distance between those positions.

P1 012
P2  01
P3   0


Now let’s go through our task. Our task might be any sequence of any length of different combinations of 4 objects. For example, it could be 1232213. It means first to pick up object1, and then pick up object2, then 3 so on and so forth. To calculate the total distance, it is as simple as adding up the distance between each objects. (for the first element and last element, we can simply by assuming a situation which all picks needs to originates from a starting point and ending point that doesn’t belong to any of these positions, but they take equal amount of travel to go to all the positions). In that case, our total distance will be:

Total Distance (1212103) = Const + D(1,2) + D(2,1) + D(1,2) + D(2,1) + D(1,0) + D(0,3) + Const

Distance is a function of positions and to pick objects, objects will be mapped to the position.

Distance between two objects x, y = D(p(x), p(y)) where p is one mapping


Keep in mind that the numbers above are not indices to positions, they are objects. But if we put all the objects in the same order as the positions, the calculation is as straight forward as:

Total Distance (1212103) = Const + D(1,2) + D(2,1) + D(1,2) + D(2,1) + D(1,0) + D(0,3) + Const
= … + 1 + 1 + 1 + 1 + 1 + 3 … = 8

Simple right? in this case, for this type of positioning, we achieve a total cost of 10.


Now let’s try a different positioning, let’s swap the position of object0 and object1.

 P0 (O1)P1(O0)P2 (O2)P3 (O3)
P0 (O1)0123
P1 (O0) 012
P2 (O2)  01
P3 (O3)   0

Total Distance (1212103) = Const + D(1,2) + D(2,1) + D(1,2) + D(2,1) + D(1,0) + D(0,3) + Cons
= D(0,2) + D(2,0) + D(0,2) + D(2,0) + D(0,1) + D(1,3)
= 2 + 2 + 2 + 2 + 1 + 2 = 11

So far, we have two different positioning and each yields a different total traveling distance. And the default one is considered better than the second one.

The goal is to find the most optimized positioning that yields the shortest total distance.

Bruce Force

Here is a small demo of the bruce force approach given 4 objects at 4 positions. There are in total 4! = 24 different permutations and there are two positioning that yield that shortest distance. Luckily, one of the examples that we covered above happens to the the most optimal solution.

I have thought about some of the previous models that I have worked with but none seems to be a good match right off the bat. For example, the Dijkstra is to find the shortest path or sequence given the graph, the Knapsack’s is to find the subset of the goods to maximize the weight.

Now the question is can we find an existing algorithm that solve this problem in an efficient way, if so, how efficient?