Today I was working on a problem analyzing inventory movement. Like calculating inventory turns, inventory replenishment cycle, average inventory value ..etc. However, if you are not familiar with those supply chain terms. You can simply think it as a time series problem where you are supposed to calculate the sum of the total drop and the sum of the total increase.
Usually you are supposed to see a saw shape time series problem, where the seller bought a fair amount of inventory and put on the shelf, then as days go by, people buy product from the seller at random quantity, which lead the inventory value decrease. Then in the perfect scenario, seller noticed that the on-hand quantity is below a certain threshold(safety stock), then the seller will send out another big order to the supplier to buy more of those products before it run out of stock. Sometimes, the supply chain is not well optimized which end up with a situation where there is not enough supply to meet the demand. ie. leaving money on the table.
However from the perspective of a data scientist, you need to get your hands dirty and deal directly with numbers. The data is not clean, sometimes, you get a customer product return which should not be counted as a transaction plus a inventory replenishment, sometimes, you might get a inventory typo in the data which is an absolute outlier.
In this case, you probably need to first, remove some outliers and then figure out a way dealing with missing values and in the end, calculate those business inventory metrics.
Today, I came across a very handy function in R which makes it so easy to mutate a column value based on certain condition, which doesn’t require a complete loop with nested if statements.
x <- rep(10:2, 4)
> x
[1] 10 9 8 7 6 5 4 3 2 1000 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10
[29] 9 8 7 6 5 4 3 2
x[10] <- 1000 # outlier
x_limit <- 5 * median(x)
x_new <- ifelse(x < x_limit, x, NA)
# Here you successfully identify outliers which has the definition of being greater than 10 times the median..
# From now on, you can use some built-in methods to deal with missing values like, locf (last observation carrier forward)
# or nocb (next observation carrier backward) or interpolate the missing values based on both sides (na.approx)..etc
> x_new
[1] 10 9 8 7 6 5 4 3 2 NA 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2
> zoo::na.locf(x_new)
[1] 10 9 8 7 6 5 4 3 2 2 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2