QA against a dataframe using langchain

analysts have always been a highly demanded job. Business users and stakeholders needs facts / insights to guide their day to day decision making or future planning. Usually it requires some technical capability to manipulate the data. This can be a time consuming and sometimes frustrating experience for most people, everything is urgent, everything is wrong, some can be helpful, and you are lucky if this exercise actually made a positive impact after all.

In the old world, you have your top guy/gal who is a factotum and knows a variety of toolkits like Excel, SQL, python, some BI experiences and maybe above and beyond, have some big data background. To be a good analyst, this person better be business friendly, have a beta personality with a quiet and curious mindset, can work hand in hand with your stakeholders and sift through all the random thoughts and turn it into a query that can fit into the data schema.

In past few years, there has been improvement in the computing framework that many data related work is democratized to a level anyone with a year or two engineering background can work with almost infinite amount of data at easy (cloud computing, spark, snowflake, even just pandas, etc), there are tools like PowerBI, Tableau or Excel that can have people query well formatted data themselves and even innovations like Thoughtspot, Alteryx, DataRobot to make it a turn-key solution.

Recently I came across a demo using langchain. Without any training, the solution can generate interesting thoughts and turn natural languages into working queries with mediocre outcomes.

Set Up

The setup is extremely easy, sign up to an OpenAI account, download langchain and that is it.

Once you create an agent, you can just call and ask any question you may have in plain language.

To confirm everything is working properly, we asked “how many rows are there” and got the answer of “891” for the titanic dataset.

Interesting Questions to Ask:


To get started, you can ask about the schema, typically people will do df.describe(), or df.dtypes

As the number of columns increases, it is sometimes time consuming to navigate and find the date column types, you can now just ask the dataset yourself.


Thanks to openai, they even support multilingual. If you ask the question in Chinese, it will reply back in Chinese.




Machine Learning?

Not Perfect

There is an interesting observation that I asked the question of “how many women survived”, and I came across 3 different answers for the exact cell that I executed before. The first time I executed, the answer is wrong as it only returned how many females were aboard without taking into the survivorship, then after several other questions, the same answer got asked and somehow it returned 0. When I restarted the prompt by recreating the LLM, I got the correct answer.

Again like any other solutions, “Trust but verify”

Marriage Penalty or Blessing

The US tax code has a few options for people in relationship to file taxes, for example, they can file separately as single individuals prior to marriage, and once married, they can file jointly or even separately as a married couple. After some analysis , it turns out the tax code is really incentivizing marriage in so many ways, so to all the singles out there, go get married (just kidding). 🙂

  1. Tax Schedule

You can easily find tax scheduled for different filing status on the internet. Attached are two screenshots from the 2022 Forbes. The more you earn, the more tax you will have to pay on those excessive income, “$xxx plus a higher percentage of the excess over…”

If we look at the tax visually, if we are given a flat tax rate, you should expect a straightline shooting from the bottom left to the top right, and the fact that the lines have different segments and they are curling to the top left is an indication that the marginal tax rate is getting higher and higher.

2. Calculate Single and Married Couple Tax

Now we have a good understanding of the tax, we can wrap all the different tax logic behind a function so for any given income, we can calculate the total tax required.

If we imagine a couple and both of them have income, we can simulate how much tax they will have to pay filing separately as singles (don’t get married for tax reasons, stay single) vs get married. We can even iterate through all different scenarios for every possible combination of a couple.

In the code above, the p1 and p2 columns represent different income for each people, tax_p1 and tax_p2 are the taxes they will need to pay file as singles. tax_p1p2_single and tax_p1p2_couple are the total different taxes they owe before and after marriage. In order to understand the numerical differences, we created two new columns. marriage_tax is the additional taxes the couples will have to pay. If you got the newly wedded tax penalty, the value will be positive, if you saved money, it will be negative. Also, the percentage column will represent the proportional difference.

3. Plot and Understand the Difference

After the difference columns are populated, we can plot the heatmap using seaborn.

Each axis represents one individual’s income situation. If you two make the same amount of money, you should be on the diagonal line from top left to bottom right. If there is one big earner and the other makes significantly less (like a stay at home dad or mom), you will be pushed away where the top and left most edge are extreme scenarios where there is only one earner in the relationship.

In the colorpicker below, blue means the percentage is really low, which indicates the marriage saves a lot of money tax wise. I created a mask to hide all the areas where there is no tax difference which is th empty area mostly on the diagnal and bottom right.

Some take aways:

  • a couple can save up to 30% of tax getting married instead of staying single. This happens when there is a big income gap between the two.
  • if there are two earners, and they are earning roughly the same amount, marriage doesn’t make a big difference
  • very likely, it doesn’t “hurt” to get married (except the small red triangle when a couple both earns around 300K)

Disclaimer, this article is absolutely NOT meant to provide any official tax, financial or relationship guidance.

Needle in a haystack

Cambridge dictionary defines the phrase “find a needle in the haystack” as “something that is impossible or extremely difficult to find, especially because the area that you have to search is too large”. If we are being really serious, how difficult it is to find a needle in a haystack, be specific.


There are many needles for sale on Amazon, here is one of the most popular ones, there are in total 23 needles in the pack with 0.317oz of weight. If we first convert weight to metrics, 0.317ounce = 8.98grams. That means each needle weight on average 0.4g/piece. Depending on how much the package itself weigh, the real weight should even be less than 0.4g.

When it comes to needle size, just like different clothes manufacturers have different size chart, different needle manufacturers have their own manual explaining their size system. For example, John and James is a popular UK brand and they have a size chart here. The smallest needle that I can find is their JJ195 for tapestry and cross stitch with size28 of length 32.5mm and diameter of 0.53mm. The volume for a cylinder of that shape is pi * (d/2)^2 * l = 7.17mm^3.

Plain steel’s density about 7.85 g/cm3 (reference). The inferred weight is about 0.06g. This is a significantly smaller number compared to the 0.4g previously guessed. However, if we take another look, the previously estimated 0.4g was based on relatively medium size needles. Even in JJ’s manual, you can find many needles with a larger diameter size, ranging from 0.53 to 2.34, if we moderately increase the needle size by only a factor of 2, that will increase the volume by a factor of 8. 0.06g * 8 ~ 0.48g which doesn’t sound that crazy at all.

To summarize, a small needle can weight 0.06g with a size of 7mm^3.


For lack of better term, haystack is merely a stack of grass. They exist in many forms. On modern farms, hays are usually handled by industrial machines and packed into square or round shape of various sizes.

John Deere is one of the larges manufacturers for this kind of industrial machines. The type of machine that pack loose hays into compact hay bales is called baler. For example, 450M round baler.

Interestingly, you can find the baler specifications in the user manual. The largest bale is probably packed by 560M with the weight of 1089kg. And volume of 4.13 m^3.


By Weight

needle ~ 0.06g, haystack ~ 1089kg.

haystack / needle = 1089kg/0.06g=18,150,000.0 = 18M

By Volume

needle ~ 7.17mm^3, haystack ~ 4.13m^3

haystack / needle = (4.13*1000)^3/7.17 = 9,824,964,714 = 10B


Fun fact, there are about 7.8B people in the world, and finding a needle in a haystack is about the same odds of finding another person on earth. Long story short, it is hard to find a needle in a haystack. Hold on to your needle and don’t drop it.

OBS live streaming

In this post, we will set up a live stream to broadcast a live camera, a chrome browser window playing youtube, and a video game arranged in one view. The stream will be created by OBS while we will use Nginx RTMP server. We will “watch our live stream” via VLC on another machine within the same private network.

A screenshot of desktop station with a lot of applications running.

OBS studio running with the three windows created.

Properly configure stream to RTMP server first.

Start the docker container running Nginx with RTMP plugin.

Point the VLC stream to open RTMP stream

Side by side of the streamer’s view and watcher’s view.



Python Code Profiling

It is very important to understand how fast is your Python code so that your ml endpoint is fast and efficient. In the modern ML application deployment, endpoints are usually deployed inside a container, the speed of your “def predict(input)” function usually directly determines your throughput and cost. To put it in plain words, your inference service might cost you $100K a year at the latency of 200ms. If the performance reduced to be 400ms, expect your cloud cost will simply double. (there could be opportunities where io bounded and CPU bounded steps are decoupled into different microservices to improve the efficiency)

Here is a list of ways to profile:

  1. print(
  2. Profiler: CProfile, yappi, vmprof

First, we need to have a simple block of code to profile, in the following block of code, we created a function myfunc that will call three different implementations of summations, between each call, there will also be a short period of sleep that varies between 0 and 3. And myfunc will be called in total 5 times. The expectation is that after the script got profiled, we will discover the func_slow will take a significant amount of the execution and by how much.

After the cProfile command is executed, the output will be dumped into a separate file profile.output which can be fed to other tools to analyze or visualize.

Sometimes the output is a bit hard to interpret and we will like to visualize. There is a lightweight tool called snakeviz that we can visualize.

If you are using an IDE like Pycharm, they have the functionality integrated.

Youtube 360 Video Download

Above is a Youtube video uploaded by J Utah captured using Insta360 Pro back in 2018. If you have never seen a 360 video before, it is basically an immersive experience where all angles are captured. Since traditional display (monitor, cellphone) has limited real-estate, they only display the portion that fit into your display, users can pan/zoom/rotate to look in all directions. If you have a VR device (like Oculus, Google cardboard), these videos can create the viewing best experience.

Due to my internet speed, I downloaded this video locally using a site called Y2mate. After it is downloaded, it is just a typical mp4 file that looks … a bit weird.

It is easy to notice this video got divided into two halves, the top half includes left, front and right from the point of view of a driver. The lower half is like someone sitting on top of the car facing backwards that incorporated the up, back and down.

It may be hard to picture how all angles stitch together the 360 view if this is the first time you see a layout like that, a quick way to help understand is to take a piece of paper and cut out two strips (size of 1X3) each representing the upper and lower half. Then it is very easy to understand how all those 6 views fit together into a cube.

The cube mapping is a very popular environment mapping method to represent the 360 space, after rearranging the tiles, you get something like this. I marked C90 to represent clockwise 90-degree rotation and CC90 for counter clockwise 90 degree. (Note: the bottom view is basically the car hard top which some reflection of the top view, my rotation may be wrong as it is hard to read, all buildings look similar)

Now we know the views are all captured and arranged in an organized way. Can we process the video and tinker with the output format so it can be recognized as a 360 video that you can view in VLC?

The mapping and math behind it will be discussed in the next post.

KITTI dataset

KITTI is one of the most popular public datasets and industry benchmark when it comes to autonomous driving research. This article will include data explorations for the published dataset so readers will have a more intuitive understanding of how the it is captured and how it should be used.

Let’s work backwards by diving straight into the dataset. There is a section of raw data where you can download. In the bottom of the page, you can find different scenaries and linkage to the downloads.

For example, here is one.

After data is downloaded, we can see there synced data is of size 458MB which is likely the 0.4GB that appears in the title. The extract file is ~50% bigger and probably contains all the raw data points before any processing. The calibration and tracklet are small files.


Calibration is the process where you reset your sensors so it is benchmarked or calibrated to a known measurements for accuracy purpose. Just like you have a scale, the calibration is to make sure your scale says 0 without any weight. It is the same for the different sensors in an AV.

There are 3 small plain text files.





Extract folder contains 6 folders where 4 contains the all the images captured by the cameras, one from lidar and one from the motion sensor.

By putting the first image of all the cameras together right next to each other, it is pretty easy to tell the first two are greyscales and next two are color images. Then we can tell the left cameras from the right cameras by comparing its relative distance to some benchmarks like the light rail or road lanes.

Inside each of the camera folder, there also exists a timestamp text file that stores all the timestamp of when the images were captured. The frequency for each camera were about 10Hz and all cameras seem started capturing “at the same time”, the earliest being 445ms and the latest being 454ms, a nominal difference of 9ms, however, we are not sure if the 4 cameras all share the same clock, and if not, if the 4 clocks are perfectly synced.

IMU (Inertial measurement unit)

Interestingly, we can see the car is driving north at negative -6.94 and east at negative -11.36, so basically it is driving towards southwest more to the west.

It also matches the forward velocity of 13.32184 captured by the sensor itself, which is about 29 mph or 46 kmph.

Then it showed various types of accelerations.

LIDAR / Cloud points

The lidar was also capturing at a 10Hz frequency.

Does DIVIDE&Conquer Help?

Today I was chatting with my coworker what happens when you over saturate a server with containers, each with a small number of CPU quotas. In k8s, since it is container-based, you can literally assign CPU quotas of less than 1 CPU, like 0.5CPU 500milicore or 0.001 CPU 1 millicore. That opened up a whole lot of possibilities.

One interesting scenario that we covered is below.
Assuming we have a CPU intensive request that a single core will take 1 second to finish. Assuming that we have 4 requests arrive at the same time, one can decide to allocate two requests to the two cores, one each, and assign the next two once finished. Clearly, the first two requests will cost 1 second each and because of queuing up, the second two requests will take 2 seconds (1 sec waiting) to respond. And the average is 1.5 second. In scenario2, assuming that we can slice and dice the tasks into smaller pieces, and the computing is fairly assigned to all the tasks. Surprisingly, it will all four requests close to 2 seconds to be finished, with the average being (2+2+2+2)/4 = 2.

Another interesting scenario is we still need to split tasks into smaller ones, but this time, we will focus all the horse power on one requests at a time. This time our average will be (0.5 + 1 + 1.5 + 2) / 4 = 1.25.

Just some food for thought. In the next chapter, it will be interesting to implement a scheduler and reproduce this in a containerized environment.