Stackoverflow – Happy Thanks giving

It has been 1 year and 10 months since I joined the forum Stackoverflow, the Q&A platform for programmers asking coding questions. The people there are helpful (I tried to avoid using the word friendly since they are professional and critical). I have been fairly active in a few tags like r, Python and data harvesting related. Today, it is a milestone of 3000 which is not only an acknowledgement for my activities in the community (174 questions and 130 answers), but also an evidence how the community helped me grow in my professional career. I don’t write diary but whenever I review my questions on stackoverflow, it reminds me of every project that I have been working on and sometimes I can even recall where I was while working on it, even with whom.

Anyway, big thanks to the community and big thanks to myself, I will keep using stackoverflow as a tool to hone my skills!

stackoverflow_milestone3000

R – something about stdin and stdout

You are probably working with R inside Rstudio most of the time and most of your code is either interactive and disposable. In that case, you might not feel the importance of understanding R’s stdin and stdout that much. However, after I started learning running Rscript on Linux server, running Hadoop Streaming using R as reducer or mapper, it forces you to understand how stdin, stdout and stderr work.

As you can see from tldp, there are always three files in Linux, and lets first start with stdout.

STDOUT: 

In Hadoop streaming, the information is passed through each stage through stdin and stdout. In that way, if you don’t have proper logic to control the cleaness of your output, random output from third-party functions might be added to standard output in which totally screw up your code. SINK() is definitely a function that you have to learn, which will “direct the output to a file”, the file here could be a txt file or csv file, it can also be “/dev/null” or NULL(default as “stdout”). So to make sure your Hadoop Streaming only outputs the content you want. You can use sink function to suppress your output, i.e., diverts all your output to /dev/null. And only open the output right before you wants to write output and remember to switch back after that. Here is a tutorial from weblogs.java.net which I found super helpful to read through a hands-on example.

 

r_sink
(
This is a screenshot of a very short example demonstrating how to switch on and off output)

STDERR:

r_warnning

Error handling in R is super important in the process of writing a robust Hadoop Streaming job. However, maybe R users might found the error handling, or actually the documentation of error handling is not that straight-forward. Here is a great tutorial from WorkingWithData showing the ins-and-outs of the tryCatch function in R.

 

 

TSOUTLIERS – How to implement a paper

I am trying to understand how they implemented tsoutliers, and you can easily access the source from a CRAN mirror here, and this package is mostly based on this paper “Joint Estimation of Model Parameters and Outlier Effects in Time Series“. Lets first start with the function “locate-outliers”.

There are a few parameters here

  1. resid: residuals of the ARIMA model over the real data
  2. pars: parameters of the AR(auto-regressive) and MA(moving average) from the ARIMA model
  3. cval
  4. types: a list of the types of outliers, like AO: additive outliers, LS: level shift and TC: transient change
  5. delta:
  6. n.start

Before compute the test statistic of outliers, we have to first estimate the residual standard deviations since they are easily contaminated by taking outliers into consideration, as indicated in the paper 1.4 (Estimation of Residual Standard Deviation). They mentioned three approaches to have a better estimation.

  1. MAD: the median absolute deviation (scaled by factor of 1.483)
  2. a% trimmed method
  3. the omit-one method

In the code, they first calculated the sigma and then called the function `outliers.tstatistic` to calculate the test statistics. The outliers.tstatistics will be explained in another post but let’s assume that we have the metrics ready for every single data point in the time series where they “type”, “indices”, “coefhat” (least squares estimate for the effect of a single outlier) and tstat (maximum value of the standardized statistics of the outlier effects).

Then they removed the rows whose tstat is lower than the cval (threshold, 3.5 as default).

They also mentioned a scenario where consecutive LS outliers have been found. And they will only keep the one with the highest abs(tstat).

Also, a point might be categorized as many types of outliers, where they will choose the one category where it exceed cval and also has the highest abs(tstat).

Then following two big for loops, iloop and oloop.

 

 

python – pyhook – a python library to monitor WINDOWS keyboard/mouse activity

This is a python library who claims to “Python wrapper for global input hooks in Windows. The package provides callbacks for mouse and keyboard events; events can be monitored and filtered.”. It is hosted on Sourceforge and you probably also need to download pythoncom(which comes with pywin32) just to make the example work. After you started the python job. You can see all your mouser and keyboard activity has been logged! You can even see in the console that your activities have been printed to the stdout as shown in the example. I changed the function to only show the keyboard activity and convert the captured event to readable character. I guess this could be a double-blade sward that could either be used as a hacking tool or be used to capture your keyboard activity and a source to capture tons of data to do interesting data mining.

pyhook

My activity of searching for keyword pyhook in google has been logged.

R – RCPP

Again, it started from a Stackoverflow question. I heard of the package Rcpp before and I have learned C++ for a few semesters while I was in college, however, I have never quite use C++ after I graduate and neither did I ever think of connecting it with R. I guess it will be an interesting weekend project to do some research how Rcpp works.

This is an article from JSS (Journal of Statistical Software) which basically talks about the ins-and-outs of Rcpp since it was first initiated at 2004.

First, you have to make sure you are using the right compiler if you are trying to compile any code/package from source. There are a few configurations you can tweak to choose between compiler flavors like clang, gcc..etc.

r_compiler

R – Tsoutliers

tsoutliers is a package developed by Javier López-de-Lacalle, who is also maintaining other packages like KFKSDS(Kalman Filter, Smoother and Disturbance Smoother), meboot(Maximum Entropy Bootstrap for Time Series), stsm(Structural Time Series Models).

In the tsoutliers package itself, there are four categories that all the outliers could be categorized into, you can either dive into these two(paper1, paper2) papers  or take a quick look at this IBM knowledge page to have a one sentence description for each of these terms.

  1. IO (Innovational Outlier)
  2. AO (additive outlier)
  3. LS (level shifting)
  4. TC (transient change)

In a short sentence, AO is a type of outlier that only affect one observation while the other three all have impact on the coming ones following the first outlier. However, LS will lead to a permanently shift. IO and TC are very similar from the shape of the plot, i.e., the initial impact die out gradually a long with time. To figure out the difference between IO and TC, you might need to read the paper, but as the author mentioned “on a time series, the effect of an IO is more intricate than the effects of other types of outliers.”

Here is the mathematical representation of the four types of outliers.

 

Mac – Some notes installing tsoutliers

I am trying to use the tsoutliers package from R which could only be installed from source code. The installation is not friendly at all based on my experience. To guarantee the success, you have to make sure that you have the necessary dependencies ready, like the proper compiler.

There are indeed many different versions of compilers available across all the platforms. I am using a MAC, and when I first started programming. My friends told me a easy way to get a lot necessary developers tools is to download XCode and you can download the “Developer Command Line Tool”.

xcode_cmdlinetools
I did some research, seems like there are two commonly used compilers available for mac users. GCC which is the one from GNU, and Clang from Apple. The simple reason that why Apple rebuilt the GNU hippo is because of license where GCC is GPL based which means whichever code uses GPL licenced code need to be open source too… however, Clang is BSD based which allows the the code to be implemented into proprietary software.

clang