R – ifelse as a function

Today I was working on a problem analyzing inventory movement. Like calculating inventory turns, inventory replenishment cycle, average inventory value ..etc. However, if you are not familiar with those supply chain terms. You can simply think it as a time series problem where you are supposed to calculate the sum of the total drop and the sum of the total increase.

Usually you are supposed to see a saw shape time series problem, where the seller bought a fair amount of inventory and put on the shelf, then as days go by, people buy product from the seller at random quantity, which lead the inventory value decrease. Then in the perfect scenario, seller noticed that the on-hand quantity is below a certain threshold(safety stock), then the seller will send out another big order to the supplier to buy more of those products before it run out of stock. Sometimes, the supply chain is not well optimized which end up with a situation where there is not enough supply to meet the demand. ie. leaving money on the table.

However from the perspective of a data scientist, you need to get your hands dirty and deal directly with numbers. The data is not clean, sometimes, you get a customer product return which should not be counted as a transaction plus a inventory replenishment, sometimes, you might get a inventory typo in the data which is an absolute outlier.

In this case, you probably need to first, remove some outliers and then figure out a way dealing with missing values and in the end, calculate those business inventory metrics.

Today, I came across a very handy function in R which makes it so easy to mutate a column value based on certain condition, which doesn’t require a complete loop with nested if statements.

x <- rep(10:2, 4)
> x
[1] 10 9 8 7 6 5 4 3 2 1000 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10
[29] 9 8 7 6 5 4 3 2
x[10] <- 1000 # outlier
x_limit <- 5 * median(x)
x_new <- ifelse(x < x_limit, x, NA)

# Here you successfully identify outliers which has the definition of being greater than 10 times the median..
# From now on, you can use some built-in methods to deal with missing values like, locf (last observation carrier forward)
# or nocb (next observation carrier backward) or interpolate the missing values based on both sides (na.approx)..etc

> x_new
[1] 10 9 8 7 6 5 4 3 2 NA 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2
> zoo::na.locf(x_new)
[1] 10 9 8 7 6 5 4 3 2 2 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2

After Effects: Andrew Kramer Keynote Speech on Adobe Conference


Data scientists need to communicate really well literally and figuratively. You might pay attention to how you frame your words or how you present the info beyond simply words – like a picture, like a chart, or even a video. The reason that people are usually into plots instead of videos is simply because it is easy to make plots! You can simply use ggplot2 to make awesome static plots. Then people become picky and they want to make plots interactive, then it came in rCharts, highcharts or even d3.js. HOWEVER, if you have a team developing a data product, and how the output of the project is really beyond people’s intuition. Say you try to make better marketing decisions based on historical data. Then you might think about using some expertise to summarize your project into a few minutes kickass video that really catch people’s eyeball. Also, the output of data science group is usually directly to the executive who will have almost zero experience in data or advance math. But, you really need to make an attractive way to infuse your understanding into their new year roadmap and decision making. Don’t try to let them figure out what should be the take away, repeat to them again and again what they should take away. Touch, Convince and Infuse.

Andrew Kramer is the founder of “videocopilot” which really blowed me away when I saw the quality of the videos he has made on his website. And what is more, He has some decent tutorials showing you how to make awesome videos.

Python – More About Multiprocessing – BigFile

One of my colleagues doesn’t know map reduce since he think “why would I need map reduce since I know multiprocessing, multithreading”, on the other side, I think why would you need to use multiprocessing since you can use mapreduce. Clearly, there is some commonality between the fucntionalities between these two. Mapreduce probably has the advantage of not only running multi-threading given a server, but also can easily run on multiple physical machines in parallel. In another way, mapreduce can do some work that multiprocess cannot handle.

However, if we have a relative big file, where it will take long time for a single thread to process but meanwhile it is still small enough to fit into our server, or even fit into memory(64 GB for a server is very common). which approach will be faster? not only from the execution perspective, but also from the development/coding time perspective.

Here is some code that I have written in Python using the multiprocessing (just want to side-bypass the GIL for now because I am newbie 🙂 ).


The goal is to read one file line by line and do something with each line, and then write the result to the same output file line by line, leveraging the multi-core and multi-threading what so ever to fully utilize the power of the whole computer.