Python Code Profiling

It is very important to understand how fast is your Python code so that your ml endpoint is fast and efficient. In the modern ML application deployment, endpoints are usually deployed inside a container, the speed of your “def predict(input)” function usually directly determines your throughput and cost. To put it in plain words, your inference service might cost you $100K a year at the latency of 200ms. If the performance reduced to be 400ms, expect your cloud cost will simply double. (there could be opportunities where io bounded and CPU bounded steps are decoupled into different microservices to improve the efficiency)

Here is a list of ways to profile:

  1. print(
  2. Profiler: CProfile, yappi, vmprof

First, we need to have a simple block of code to profile, in the following block of code, we created a function myfunc that will call three different implementations of summations, between each call, there will also be a short period of sleep that varies between 0 and 3. And myfunc will be called in total 5 times. The expectation is that after the script got profiled, we will discover the func_slow will take a significant amount of the execution and by how much.

After the cProfile command is executed, the output will be dumped into a separate file profile.output which can be fed to other tools to analyze or visualize.

Sometimes the output is a bit hard to interpret and we will like to visualize. There is a lightweight tool called snakeviz that we can visualize.

If you are using an IDE like Pycharm, they have the functionality integrated.

Youtube 360 Video Download

Above is a Youtube video uploaded by J Utah captured using Insta360 Pro back in 2018. If you have never seen a 360 video before, it is basically an immersive experience where all angles are captured. Since traditional display (monitor, cellphone) has limited real-estate, they only display the portion that fit into your display, users can pan/zoom/rotate to look in all directions. If you have a VR device (like Oculus, Google cardboard), these videos can create the viewing best experience.

Due to my internet speed, I downloaded this video locally using a site called Y2mate. After it is downloaded, it is just a typical mp4 file that looks … a bit weird.

It is easy to notice this video got divided into two halves, the top half includes left, front and right from the point of view of a driver. The lower half is like someone sitting on top of the car facing backwards that incorporated the up, back and down.

It may be hard to picture how all angles stitch together the 360 view if this is the first time you see a layout like that, a quick way to help understand is to take a piece of paper and cut out two strips (size of 1X3) each representing the upper and lower half. Then it is very easy to understand how all those 6 views fit together into a cube.

The cube mapping is a very popular environment mapping method to represent the 360 space, after rearranging the tiles, you get something like this. I marked C90 to represent clockwise 90-degree rotation and CC90 for counter clockwise 90 degree. (Note: the bottom view is basically the car hard top which some reflection of the top view, my rotation may be wrong as it is hard to read, all buildings look similar)

Now we know the views are all captured and arranged in an organized way. Can we process the video and tinker with the output format so it can be recognized as a 360 video that you can view in VLC?

The mapping and math behind it will be discussed in the next post.

KITTI dataset

KITTI is one of the most popular public datasets and industry benchmark when it comes to autonomous driving research. This article will include data explorations for the published dataset so readers will have a more intuitive understanding of how the it is captured and how it should be used.

Let’s work backwards by diving straight into the dataset. There is a section of raw data where you can download. In the bottom of the page, you can find different scenaries and linkage to the downloads.

For example, here is one.

After data is downloaded, we can see there synced data is of size 458MB which is likely the 0.4GB that appears in the title. The extract file is ~50% bigger and probably contains all the raw data points before any processing. The calibration and tracklet are small files.


Calibration is the process where you reset your sensors so it is benchmarked or calibrated to a known measurements for accuracy purpose. Just like you have a scale, the calibration is to make sure your scale says 0 without any weight. It is the same for the different sensors in an AV.

There are 3 small plain text files.





Extract folder contains 6 folders where 4 contains the all the images captured by the cameras, one from lidar and one from the motion sensor.

By putting the first image of all the cameras together right next to each other, it is pretty easy to tell the first two are greyscales and next two are color images. Then we can tell the left cameras from the right cameras by comparing its relative distance to some benchmarks like the light rail or road lanes.

Inside each of the camera folder, there also exists a timestamp text file that stores all the timestamp of when the images were captured. The frequency for each camera were about 10Hz and all cameras seem started capturing “at the same time”, the earliest being 445ms and the latest being 454ms, a nominal difference of 9ms, however, we are not sure if the 4 cameras all share the same clock, and if not, if the 4 clocks are perfectly synced.

IMU (Inertial measurement unit)

Interestingly, we can see the car is driving north at negative -6.94 and east at negative -11.36, so basically it is driving towards southwest more to the west.

It also matches the forward velocity of 13.32184 captured by the sensor itself, which is about 29 mph or 46 kmph.

Then it showed various types of accelerations.

LIDAR / Cloud points

The lidar was also capturing at a 10Hz frequency.

Does DIVIDE&Conquer Help?

Today I was chatting with my coworker what happens when you over saturate a server with containers, each with a small number of CPU quotas. In k8s, since it is container-based, you can literally assign CPU quotas of less than 1 CPU, like 0.5CPU 500milicore or 0.001 CPU 1 millicore. That opened up a whole lot of possibilities.

One interesting scenario that we covered is below.
Assuming we have a CPU intensive request that a single core will take 1 second to finish. Assuming that we have 4 requests arrive at the same time, one can decide to allocate two requests to the two cores, one each, and assign the next two once finished. Clearly, the first two requests will cost 1 second each and because of queuing up, the second two requests will take 2 seconds (1 sec waiting) to respond. And the average is 1.5 second. In scenario2, assuming that we can slice and dice the tasks into smaller pieces, and the computing is fairly assigned to all the tasks. Surprisingly, it will all four requests close to 2 seconds to be finished, with the average being (2+2+2+2)/4 = 2.

Another interesting scenario is we still need to split tasks into smaller ones, but this time, we will focus all the horse power on one requests at a time. This time our average will be (0.5 + 1 + 1.5 + 2) / 4 = 1.25.

Just some food for thought. In the next chapter, it will be interesting to implement a scheduler and reproduce this in a containerized environment.

Python socketserver ForkingMixIn and ThreadingMixIn

In the previous post, we showed that can do more than one thing at a time using threaded and process arguments. In this article, let’s dig a little deeper into how Flask did it.

That magic must be attributed to two interesting implementations ForkingMixIn and ThreadingMixIn under the socketserver module. socketserver is a Python built-in library for networking and web server. If you have never used the library yourself, it is because most people use higher-level frameworks that actually use socketserver behind the scene. For example, Flask is using Werkzeug, Werkzeug is using http.server for serving, http.server uses socketserver. By looking into the source code of socketserver, it only uses very low level libraries like socket, os, threading, hence, having a knowledge how socketserver works definitely will help with the understanding how Python web serving works.


The implementation of ThreadingMixIn is very straightforward. Every time when `process_request` is called, a new Python thread is created and starts the thread.


I was not familiar with the concept of forking when I first read it. I thought it must be calling library like multiprocessing just as it did in ThreadingMixIn by calling threading library. After some research, it turned out forking is such an important concept in operating system that it is probably the most key concept behind process management. Even multiprocessing itself is probably using os.fork behind the scene.

If you have never used fork before the code of a if statement based on pid might be confusing. os.fork is function that the return value will be the process id. The magical part is after os.fork got called, a branded new process will be forked / cloned. Previously you had one, and now you have two cloned processes. In order to distinguish the old one from the new one, pid will be assigned to 0 in the new process so you know it is newly created. In the old process, instead of 0, the value of pid will be a process id of the newly created process.

Knowing that, we know to put the instructions for the parent process in the if statement, and put the instructions for the newly created child process into the else statement. That is exactly how the code was written, the parent process only keep track of child process id. The child process will finish the request and exit it with the proper status.


To make sure I fully understand how os.fork works. I wrote the following script. In fact, I ask a process to calculate factorial for some big number making sure it completely tie a CPU for a good period of time. If it is just multithreading sharing the same CPU, nothing will be accelerated.

I ran four factorial calculations in total, with a os.fork operation between each.

At first glance, the logs might be a bit hard to read so formatted them a bit into this spreadsheet. The 81(85) is the main process. And it took 9.6 seconds to calculate the factorial of 200K. Then once the fork is called, you can see the 81(86) got spawned and 85 immediately start calculating, right next line indicated the new spawn process recognized itself and also started calculating just < a us later (569 us = 671651-671082). Now we are two processes running in parallel, 85 and 86.

In the next round (separated by the red line), 85 finished first, spawned 88, and both of them got to work also within a ms (513us = 671850 – 671337). Then right after it, 86 finished, spawned 89 and both of got to work within a ms (571us = 685073-684502).

The same story repeat itself in the last round with 8 processes existing at the same time busy calculating factorial. One interesting observation is that all the first 3 rounds finish in similar time with about 9.6 seconds, then 19.68 > 19.2=2*9.6, and 29.24 > 28.8=3*9.6. The biggest difference is at the last round with (13.97=43.206446 – 29.239630), that is 45.5% (13.97/9.6) performance deterioration, why?

That is because CPUs were busy, I have a 6core i7 on my MacBook. If all 6 cores were busy doing work, that will finish 6 factorial calculation, if the rest 2 got assigned to two cores, it will take another 9.6 seconds so the total should be 19.2 seconds, which it did not. If everything is perfectly optimized, and all 6 cores all contribute to the calculation, it should take 8 * 9.6 / 6 = 12.8 seconds. So the theoretical minimum is 12.8 and maximum is 19.2 and the reality is 13.97. I have to say that is the proof that there is some cross CPU collaboration with some overhead.

Well, that is enough fun for today. In conclusion, we had a good view of the two implementations for threading based and process based mixin. We looked into the os.fork in more detail and demonstrated how forking got used to distribute computation heavy jobs.

Is Flask synchronous – threaded vs processes

Flask has been claimed as synchronous on many occasions, yet still possible to get async working but takes extra work. So, let’s see how a naive web server synchronously handle requests. We expect a bunch of requests were received, and the server will process them one at a time, meaning that when a request is being served, other requests cannot be processed and will have to wait.

Let’s see it in action. We first define an endpoint where a simple 1 second delay is applied everytime a request is received.

L9~L21 defined the endpoint. We captured several interesting data points like process id, thread id and execution time for demonstration purpose. At L26, we have to specify `threaded=False` in order for it to be single threaded.

By looking into the source code of, it is running a development server leveraging werkzeug.serving.run_simple method. Even Flask’s own documentation has repeated several times its own built-in documentation is not for production. In production, Flask can only be used as the web application framework and we do need a WSGI server.

Below is a diagram demonstrating the underlying dependencies behind, how the development server got built from some of the most foundationational built-in libraries like os.forking, threading, etc.

If we put aside the question of how the server got built for a second, let’s spawn some requests and test the performance.

The code above will spawn 10 requests simultaneously. There are several other ways of doing it and the easiest way to make asynchronous request is grequset, and the code is as easy as above. There is a whole discussion between imap and map about are they truly issuing 10 requests at a time that worth noting, but the code above works for me.

There is a clear difference between the threaded=True and False on the server side. When threaded=False, even if all the requests were issued simultaneously, the requests were processed one by one.

After we changed the to be threaded=True, this is how the responses look like now.

All the requests come back at 1 second (1.457s) at once.

This is the difference between using threads on the server side. We can also test it using different processes. The end outcome looks similar as all the requests are processed quickly but the difference is the server is creating many new processes, issuing one process for a given new requests.

These two different approaches has their own pros can cons. There is a great comparison between thread and process.

After seeing what just happened, next time people say flask is synchronous and it can handle only one request at a time. Knowing that by configuring thread, process at, you know for sure we will be able to handle more than a request at a time. Should we use it, clearly not, even the is not supposed to be called but can/cannot is different from should/should not.

Have a bet with your coworkers and win yourself a beer!

a good stack(deque&list) and a good queue (deque) in Python

Stack and queue are probably the mostly commonly used data structures, for example, a stack is required in the depth first search and a queue is required in a breath first search. However, when it comes to the explicit implementation, one usually settle with the old friend – list, since it is so powerful and have native operations micmic both structures behavior. Here is a quick post about the performance comparison.


the stack data structure should support Last In First Out (LIFO), and basic operations should include quick put and get. Both list and deque are good for a stack given their performance are very comparable. However, each of them has their own advantages over the other in an obvious way, so pick wisely based on the use case.

Here is a performance comparison demonstrating both list and deque can be used for stack, both constant O(1) for one side put and get.

This is another performance test that shows deque’s random access O(n) is much slower (300x slower) than a list O(1).


the queue data structure should support the concept of First In First Out (FIFO), and basic operations include put and get too, but the put and get should happen at different ends.

You can use list append to push objects which is fast, but in order to get the first element, you will need to use the pop(index=0) which requires shifting all rest of elements.

You can find that using the deque data structure achieves the best performance of 152ms to add 1M objects and only 120ms to retrieve all those objects. For the queue.Queue, it took 2.118s (14x slower) to store and 2.214s (18x slower) to get all objects. Last for the list, it is pretty quick to add all the objects 162ms but the pop(0) took close to 3 minutes (1450x slower)!

Lesson learned, use deque if you need a queue.

choose wisely - County EM

gzip is append”able”

gzip is a commonly used data compression file format. There is also an application gzip that works with this file format. Recently I came across a use case where I need to apply some data transformation to a large dataset stored in the cloud. The data is so large that I cannot fit all the raw input data to the disk, not even the output data, but compression of the output, I should be able to based on estimation. This article is to discuss some of the characteristics of gzip file in general by walking through my use case.

I thought compression is not divisible. You need to first have ALL the raw input, then you can apply the compression, the compression will analyze the global frequency and build the dictionary to encode all the data. In that case, I might be able to process the raw data little by little, but still won’t be able to fit the uncompressed output, let alone compression. However, if the compression can be done piecewise, I can process a small fraction of the input, transform it, compress it, and iterate through this process until finish and put everything together in the end. Most importantly, this means I can even parallelize.

Now the whole question depends on whether a compression can be divided. The short answer is Yes. It can be.

Gzip fileformat is defined in RFC1952.

2.2. File format

      A gzip file consists of a series of "members" (compressed data
      sets).  The format of each member is specified in the following
      section.  The members simply appear one after another in the file,
      with no additional information before, between, or after them.

My previous impression of analyzing frequency could be true within a member, but based on the documentation, a gzip file could be a simple concatenation of multiple gzip file where each “member” is independent. By reading through the deflate documentation, the compression is actually happening block by block, the block size can vary but when it compress, it process about 64KB input at a time.

gzip -> members -> blocks

[114]. we first write the Titanic dataframe to a csv compressed in gzip format, p.s., if the file suffix is gz, pandas will autoapply compression to it. Then we created a loop to repetitively append the compressed datastream to the same gzip file.

[115]. is the same process but it is uncompressed, basically how to rbind csv files.

This code snippet demonstrated that we can keep “concatenating gzip files”. We still need to be aware of what underlying data format we are using. In this case, I am using a CSV which is very easy. The only tricky part is to make sure all the columns are consistent and you don’t mistakenly include the headers when you append data.

In the end, you have one gz file, after you uncompress it, you still have one large CSV file.


GZIP file format specification version 4.3
Python gzip, zlib
DEFLATE Compressed Data Format Specification version 1.3

Python CST and AST

There are two key concepts in parsing any Python code, one being the concrete syntax tree and the other being abstract syntax tree. This post provide some introductory information about what they are, their roles in the parsing process, commonalities and differences between them.

In fact, as this post is written in 2021, it is a possibility, or probability that the term CST might be rendered obsolete because of the new parser implementation parser in newer versions of Python. parser is a built-in python language that has been there for a long time that works with CST. However, given the latest implementation, it won’t be a two stage process CODE -> CST -> AST anymore. As CST is still a very important concept in understanding any programming language, I think it will still be a good idea to learn about it since it will be a while for us all to move into 3.9 and CST and AST still serves its general educational purpose when it comes programming language.

Python Software Foundation News: Replacing CPython's parser - Python  Language Summit 2020

CST is the exact parsed tree representation of your code conforming to the language grammar, while the AST is a simplified CST keeping only its gist. Based on this Stackoverflow post, depending on how the developer writes the code, eg. a = 1 + 2 * 3, there could be different variations like a = 1 + (2*3) or a = (1 + (2*3)), the CST for each variation is different as the code literally are different, the CST will contain all the different placement of parenthesis. After all the operation precedence got taken into considerations, the CSTs will be simplified and end up being the same AST. Of course, for the same CST, if the underlying executions are different, the same code can lead to different AST.

In the following example, two implementations have different CST but all share the same AST in the end.


book: CPython-Internals