R – memory management

R_memory

  1. m1 <- 1:(10^9)
  2. m2 <- 1:(10^9) it exceeded the 12GB physical memory, then starting swapping, I terminated R process.
  3. restart R
  4. m1 <- 1:(10^8)
  5. m2 <- 1:(10^8)
  6. m3 <- 1:(10^9), then I did `remove(list=ls())` and it removed the object from the environment but the memory did not get released.
  7. gc() clear memory using garbage collection in R

 

R_memory_gc

R – Should I go C++ or should I ditch for loop in R.

I was following the “High Performance Computing using Rcpp” in Hadley Wichham’s Advanced R. I did an experiment in R which there are two functions, one function myRowSum written in C++, and the other one written in plain R. As you can see from the code, they are very similar, same variable name, same for loop, same logic…etc. However, I have been totally blown away by the difference between total time.

I created a dummy matrix with 100,000 rows and 9 columns each.

rcpp_eat_100K

 

Then I am thinking, maybe I should try what is the difference between some `vectorized`  function in R and see how that compares with Rcpp. Again, Rcpp beat apply function, after I changed the record to 1 million rows times 10 columns each row. It took 10+mins and the for loop in R was still running and I have to stop it because I have no idea how long it gonna take.

rcpp_eat_1m

This experiment totally changed some of my impressions and I started to understand why people really hates for loop in R. Again, all these interesting stories happens in R and I have never jumped out of the R environment, Rcpp makes R possible to it easy to boost the performance of R to C++ level.

If you are using R and found it slow, don’t blame R, blame yourself!

Hadoop Streaming – Overhead

I am writing a very basic hadoop streaming job where the mapper is basically to split each line into key and value, and the reducer to echo back the output from the mappers. This is a little bit different from just mapper-only job, it will do the big key-group. And the result will be order by key (not by value as default).

However, the progress from the command line is really confusing and it reminds me of all the criticism about windows installation progress bar, i.e, it is not linear at all!

hadoop_streaming_overhead

As you can see, the mapping part finished in about 10 minutes and And the reducer part finished “67%” very fast, however, it took about 20 mins to got to 68%, another 30mins to get to 69% ..then 10 minutes for each percentage. The whole job finished in about 2 hours which is acceptable but I am really confused and curious what is really going on during that time.

hadoop_streaming_cloudera_monitor

I pulled the cluster performance at that time and I can see the CPU was busy most of the time, which kind of proved that the cluster was not idle.

I also

Hadoop Streaming – Track Map Reduce Job Progress

After you unleashed your mapreduce job, how could you monitor the progress of your map reduce jobs beyond the standard output from the hadoop it self.

hadoop_streaming_progress

An easy to monitor any types of map reduce job running on the cluster is go to the url: http://<namenode&gt;:8088, which is the application dashboard for YARN. From there, you can dive into the nuts and bolts of each job, like how many mapper/reducers there are, how much progress they have made..etc.

hadoop_streaming_progress_8088

If you are in the command line and hard-core enough to prefer command line. You can use the `hadoop job` command to manipulate map reduce jobs.

hadoop_job