This is a blog post to familiarize ourselves with the functions that we are going to use to calculate the cross correlation of stock prices. In this case, we are going to create some dummy time series data, one is the leading indicator for the other and hopefully pull the necessary strings to detect it and plot and understand it how it works in the Python realm.
1. time series
Time series data is the best representation of signals like temperature history, pricing history, inventory history, balance history and pretty much any kind of history used in day to day life. We can either use a pandas dataframe or actually, in this case, use the Series class and make the datetime field to be the index.
In this case, we generated a series of 8 elements starting at 2018/01/01. Then we are going to generate another series which is a leading indicator of 2 days ahead of s_a.
Before we hard code another series which is, say one day of ahead of the first series, like [0,0,1,2,3,2,1,0]. Let’s check out if there is any method of pd.Series that we can use. There is a whole lot of functions that can be used to time series data. And the closest function that might serve our purpose looks like shift, tshift, sliceshift.
shift method indeed looks very powerful where it cannot only shift to fix on the datetime window and shift the value away by filling in NA, but also, if required, will be able to shift the window by a specified frequency. The last print statement shows a perfect way to generate another leading indicator of s_a by two days.
After generating the leading indicator, we can put them side by side so that it is obvious to you. pd.concat is a really powerful function that I will dedicate another whole article to talk about but for now, it serves the purpose of doing a full outer join of those two time series data by date.
Cherry on top of the cake, this is the visualization of two signals with one 2 days of ahead of the other.
2. cross correlation
Cross correlation is to calculate the dot product for two series trying all the possible shiftings. For example, let’s fix the s_a and assume that you slide s_b from the left to the right. At the beginning, s_b is far away and there is no intersection at all.
- First intersection, Then as we move s_b to the right, the first intersection will be the far right element of s_b cross the far left element of s_a. In this case [1] from s_b and [0] from s_a. And the dot product is 0. Hence, the first 0 in the corr variable.
- Second intersection, it will the be two far right elements of s_b, [2,1] crosses the two far left elements of s_a [0,0], which still ends with a 0.
- …
- Actually, it is not until there are four elements intersect which is [0,0,0,1] and [2,3,2,1] where the dot product is 1.
- so on and so forth till the far left element of s_b cross far right element of s_a.
- Then s_a keep moving to the left and s_b moving to the right and they will never cross again.
As you see, in our dummy example, the dot product is maximized when these two list perfectly aligned with each other perfect vertically. However, here we are only aligning the values, let’s take a look at the index. In this case, we can pick at element in either list. The first 0 from s_a represent 2018-01-01 and the first 0 from s_b represent 2017-12-30. Now we know that s_b is 2 days ahead of s_a purely by analyzing the cross correlation and that is exactly how we constructed s_b in the first place, isn’t it?
In this case, we are simply calculating a sliding dot product which is not necessary the traditional correlation like pearson correlation, for example, how could a correlation be greater than 1, right? There is a good stackoverflow question that sort of addresses this problem.
We can see that the cross correlation is maximized at position 8th, and the length of both s_a and s_b are 8. so no doubt, the two series need to be perfectly aligned. Let’s take a look at another example when two series have different patterns and lengths.
The cross correlation is maximized when s_b is shifted to the right by 7 in this case, actually is when the maximum of s_b align with the maximum of s_a aligned.
3. summary
cross correlation is useful when you try to find a position (lagging/leading) when you compare two time series that doesn’t have to necessary share the same length.
(note: don’t confuse yourself with the pearson correlation, cross correlation doesn’t have to necessarily be between -1 and 1)