Mac – Some notes installing tsoutliers

I am trying to use the tsoutliers package from R which could only be installed from source code. The installation is not friendly at all based on my experience. To guarantee the success, you have to make sure that you have the necessary dependencies ready, like the proper compiler.

There are indeed many different versions of compilers available across all the platforms. I am using a MAC, and when I first started programming. My friends told me a easy way to get a lot necessary developers tools is to download XCode and you can download the “Developer Command Line Tool”.

xcode_cmdlinetools
I did some research, seems like there are two commonly used compilers available for mac users. GCC which is the one from GNU, and Clang from Apple. The simple reason that why Apple rebuilt the GNU hippo is because of license where GCC is GPL based which means whichever code uses GPL licenced code need to be open source too… however, Clang is BSD based which allows the the code to be implemented into proprietary software.

clang

R – tapply

Somehow I have read a post long time ago showing that the importance of understanding all the different types of *pply, like apply, lapply, tapply, vapply..etc. Today, I watched the awesomeness of tapply ‘LIVE’! Also, I really think the people who understand R writes code in an absolute different style from people’s “intuition”. Here is the Stackoverflow post that I got totally blown away.

tapply_vector

R – How dare you use “==” in if statement

There are usually so many places in each language that just don’t make that much sense but so easy for an inexperienced programmer to adopt and take for granted.

“==” in R is such a thing. I came from a Python programming background and I take for granted that to use “==” in an if statement whenever comparing two objects based on value.

However, in R, I just noticed that even in the documentation of “==”, they clearly write down a misunderstanding of “==” will lead to a disaster.

r_equal

R – ifelse as a function

Today I was working on a problem analyzing inventory movement. Like calculating inventory turns, inventory replenishment cycle, average inventory value ..etc. However, if you are not familiar with those supply chain terms. You can simply think it as a time series problem where you are supposed to calculate the sum of the total drop and the sum of the total increase.

Usually you are supposed to see a saw shape time series problem, where the seller bought a fair amount of inventory and put on the shelf, then as days go by, people buy product from the seller at random quantity, which lead the inventory value decrease. Then in the perfect scenario, seller noticed that the on-hand quantity is below a certain threshold(safety stock), then the seller will send out another big order to the supplier to buy more of those products before it run out of stock. Sometimes, the supply chain is not well optimized which end up with a situation where there is not enough supply to meet the demand. ie. leaving money on the table.

However from the perspective of a data scientist, you need to get your hands dirty and deal directly with numbers. The data is not clean, sometimes, you get a customer product return which should not be counted as a transaction plus a inventory replenishment, sometimes, you might get a inventory typo in the data which is an absolute outlier.

In this case, you probably need to first, remove some outliers and then figure out a way dealing with missing values and in the end, calculate those business inventory metrics.

Today, I came across a very handy function in R which makes it so easy to mutate a column value based on certain condition, which doesn’t require a complete loop with nested if statements.

x <- rep(10:2, 4)
> x
[1] 10 9 8 7 6 5 4 3 2 1000 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10
[29] 9 8 7 6 5 4 3 2
x[10] <- 1000 # outlier
x_limit <- 5 * median(x)
x_new <- ifelse(x < x_limit, x, NA)

# Here you successfully identify outliers which has the definition of being greater than 10 times the median..
# From now on, you can use some built-in methods to deal with missing values like, locf (last observation carrier forward)
# or nocb (next observation carrier backward) or interpolate the missing values based on both sides (na.approx)..etc

> x_new
[1] 10 9 8 7 6 5 4 3 2 NA 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2
> zoo::na.locf(x_new)
[1] 10 9 8 7 6 5 4 3 2 2 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2