After Effects: Andrew Kramer Keynote Speech on Adobe Conference

Data scientists need to communicate really well literally and figuratively. You might pay attention to how you frame your words or how you present the info beyond simply words – like a picture, like a chart, or even a video. The reason that people are usually into plots instead of videos is simply because it is easy to make plots! You can simply use ggplot2 to make awesome static plots. Then people become picky and they want to make plots interactive, then it came in rCharts, highcharts or even d3.js. HOWEVER, if you have a team developing a data product, and how the output of the project is really beyond people’s intuition. Say you try to make better marketing decisions based on historical data. Then you might think about using some expertise to summarize your project into a few minutes kickass video that really catch people’s eyeball. Also, the output of data science group is usually directly to the executive who will have almost zero experience in data or advance math. But, you really need to make an attractive way to infuse your understanding into their new year roadmap and decision making. Don’t try to let them figure out what should be the take away, repeat to them again and again what they should take away. Touch, Convince and Infuse.

Andrew Kramer is the founder of “videocopilot” which really blowed me away when I saw the quality of the videos he has made on his website. And what is more, He has some decent tutorials showing you how to make awesome videos.

Path of a data scientist

This is a super fun picture that I came across from R-bloggers.

Python – More About Multiprocessing – BigFile

One of my colleagues doesn’t know map reduce since he think “why would I need map reduce since I know multiprocessing, multithreading”, on the other side, I think why would you need to use multiprocessing since you can use mapreduce. Clearly, there is some commonality between the fucntionalities between these two. Mapreduce probably has the advantage of not only running multi-threading given a server, but also can easily run on multiple physical machines in parallel. In another way, mapreduce can do some work that multiprocess cannot handle.

However, if we have a relative big file, where it will take long time for a single thread to process but meanwhile it is still small enough to fit into our server, or even fit into memory(64 GB for a server is very common). which approach will be faster? not only from the execution perspective, but also from the development/coding time perspective.

Here is some code that I have written in Python using the multiprocessing (just want to side-bypass the GIL for now because I am newbie 🙂 ).

The goal is to read one file line by line and do something with each line, and then write the result to the same output file line by line, leveraging the multi-core and multi-threading what so ever to fully utilize the power of the whole computer.

Python – Subprocess

This is a fantastic tutorial from RootOfTheNull. He has done a few series of tutorials about several commonly used python built-in packages. Here is the first tutorial of the subprocess package series.