Today I was chatting with my coworker what happens when you over saturate a server with containers, each with a small number of CPU quotas. In k8s, since it is container-based, you can literally assign CPU quotas of less than 1 CPU, like 0.5CPU 500milicore or 0.001 CPU 1 millicore. That opened up a whole lot of possibilities.
One interesting scenario that we covered is below. Assuming we have a CPU intensive request that a single core will take 1 second to finish. Assuming that we have 4 requests arrive at the same time, one can decide to allocate two requests to the two cores, one each, and assign the next two once finished. Clearly, the first two requests will cost 1 second each and because of queuing up, the second two requests will take 2 seconds (1 sec waiting) to respond. And the average is 1.5 second. In scenario2, assuming that we can slice and dice the tasks into smaller pieces, and the computing is fairly assigned to all the tasks. Surprisingly, it will all four requests close to 2 seconds to be finished, with the average being (2+2+2+2)/4 = 2.
Another interesting scenario is we still need to split tasks into smaller ones, but this time, we will focus all the horse power on one requests at a time. This time our average will be (0.5 + 1 + 1.5 + 2) / 4 = 1.25.
Just some food for thought. In the next chapter, it will be interesting to implement a scheduler and reproduce this in a containerized environment.
In the previous post, we showed that flask.run can do more than one thing at a time using threaded and process arguments. In this article, let’s dig a little deeper into how Flask did it.
That magic must be attributed to two interesting implementations ForkingMixIn and ThreadingMixIn under the socketserver module. socketserver is a Python built-in library for networking and web server. If you have never used the library yourself, it is because most people use higher-level frameworks that actually use socketserver behind the scene. For example, Flask is using Werkzeug, Werkzeug is using http.server for serving, http.server uses socketserver. By looking into the source code of socketserver, it only uses very low level libraries like socket, os, threading, hence, having a knowledge how socketserver works definitely will help with the understanding how Python web serving works.
The implementation of ThreadingMixIn is very straightforward. Every time when `process_request` is called, a new Python thread is created and starts the thread.
I was not familiar with the concept of forking when I first read it. I thought it must be calling library like multiprocessing just as it did in ThreadingMixIn by calling threading library. After some research, it turned out forking is such an important concept in operating system that it is probably the most key concept behind process management. Even multiprocessing itself is probably using os.fork behind the scene.
If you have never used fork before the code of a if statement based on pid might be confusing. os.fork is function that the return value will be the process id. The magical part is after os.fork got called, a branded new process will be forked / cloned. Previously you had one, and now you have two cloned processes. In order to distinguish the old one from the new one, pid will be assigned to 0 in the new process so you know it is newly created. In the old process, instead of 0, the value of pid will be a process id of the newly created process.
Knowing that, we know to put the instructions for the parent process in the if statement, and put the instructions for the newly created child process into the else statement. That is exactly how the code was written, the parent process only keep track of child process id. The child process will finish the request and exit it with the proper status.
To make sure I fully understand how os.fork works. I wrote the following script. In fact, I ask a process to calculate factorial for some big number making sure it completely tie a CPU for a good period of time. If it is just multithreading sharing the same CPU, nothing will be accelerated.
I ran four factorial calculations in total, with a os.fork operation between each.
At first glance, the logs might be a bit hard to read so formatted them a bit into this spreadsheet. The 81(85) is the main process. And it took 9.6 seconds to calculate the factorial of 200K. Then once the fork is called, you can see the 81(86) got spawned and 85 immediately start calculating, right next line indicated the new spawn process recognized itself and also started calculating just < a us later (569 us = 671651-671082). Now we are two processes running in parallel, 85 and 86.
In the next round (separated by the red line), 85 finished first, spawned 88, and both of them got to work also within a ms (513us = 671850 – 671337). Then right after it, 86 finished, spawned 89 and both of got to work within a ms (571us = 685073-684502).
The same story repeat itself in the last round with 8 processes existing at the same time busy calculating factorial. One interesting observation is that all the first 3 rounds finish in similar time with about 9.6 seconds, then 19.68 > 19.2=2*9.6, and 29.24 > 28.8=3*9.6. The biggest difference is at the last round with (13.97=43.206446 – 29.239630), that is 45.5% (13.97/9.6) performance deterioration, why?
That is because CPUs were busy, I have a 6core i7 on my MacBook. If all 6 cores were busy doing work, that will finish 6 factorial calculation, if the rest 2 got assigned to two cores, it will take another 9.6 seconds so the total should be 19.2 seconds, which it did not. If everything is perfectly optimized, and all 6 cores all contribute to the calculation, it should take 8 * 9.6 / 6 = 12.8 seconds. So the theoretical minimum is 12.8 and maximum is 19.2 and the reality is 13.97. I have to say that is the proof that there is some cross CPU collaboration with some overhead.
Well, that is enough fun for today. In conclusion, we had a good view of the two implementations for threading based and process based mixin. We looked into the os.fork in more detail and demonstrated how forking got used to distribute computation heavy jobs.
Flask has been claimed as synchronous on many occasions, yet still possible to get async working but takes extra work. So, let’s see how a naive web server synchronously handle requests. We expect a bunch of requests were received, and the server will process them one at a time, meaning that when a request is being served, other requests cannot be processed and will have to wait.
Let’s see it in action. We first define an endpoint where a simple 1 second delay is applied everytime a request is received.
L9~L21 defined the endpoint. We captured several interesting data points like process id, thread id and execution time for demonstration purpose. At L26, we have to specify `threaded=False` in order for it to be single threaded.
By looking into the source code of Flask.run, it is running a development server leveraging werkzeug.serving.run_simple method. Even Flask’s own documentation has repeated several times its own built-in documentation is not for production. In production, Flask can only be used as the web application framework and we do need a WSGI server.
Below is a diagram demonstrating the underlying dependencies behind fask.run, how the development server got built from some of the most foundationational built-in libraries like os.forking, threading, etc.
If we put aside the question of how the server got built for a second, let’s spawn some requests and test the performance.
The code above will spawn 10 requests simultaneously. There are several other ways of doing it and the easiest way to make asynchronous request is grequset, and the code is as easy as above. There is a whole discussion between imap and map about are they truly issuing 10 requests at a time that worth noting, but the code above works for me.
There is a clear difference between the threaded=True and False on the server side. When flask.run threaded=False, even if all the requests were issued simultaneously, the requests were processed one by one.
After we changed the app.run to be threaded=True, this is how the responses look like now.
All the requests come back at 1 second (1.457s) at once.
This is the difference between using threads on the server side. We can also test it using different processes. The end outcome looks similar as all the requests are processed quickly but the difference is the server is creating many new processes, issuing one process for a given new requests.
These two different approaches has their own pros can cons. There is a great comparison between thread and process.
After seeing what just happened, next time people say flask is synchronous and it can handle only one request at a time. Knowing that by configuring thread, process at app.run, you know for sure we will be able to handle more than a request at a time. Should we use it, clearly not, even the app.run is not supposed to be called but can/cannot is different from should/should not.
Have a bet with your coworkers and win yourself a beer!