Nutch Hadoop

Nutch Hadoop Tutorial – This is a tutorial that shows you how to set up Apache Nutch on a running hadoop cluster and won’t dive into the architect detail too much, which is a perfect tutorial for me. 

A few assumptions before following this tutorial:

1. root 2. ssh 3. cluster 4. maillist for Q&A 5. Java programming background

Hadoop Cluster Setup:

Download Hadoop and Nutch:

Setup the Deployment Architecture

Deploy Nutch to a Single Machine

Deploy Nutch to multiple Machines

Performing a Crawl

Testing the Crawl 

Performing a Search

 

MNIST databse – the Have-To for Image Recognition

I randomly came across a post from Kaggle, which is actually part of a tutorial competition showing people how to get started with machine learning.

More information about the famous MNIST dataset, which is used in this competition, could be found here. I remembered that Andrew Ng’s online class has demonstrated how to do image recognition, using different types of algorithms. However, while I was taking his class from Coursera, the software the class used was Octave. I am mostly using R and I want to give it a try with R.

After I downloaded those MNIST dataset files, again, I realized it is not that easy as I expected. All the files are in binary format and I have never dealt with binary files in R. After a quick good, I know there is a file named after me :), “readBin”. And fortunately, I found a paragraph of R code in git written by brendano, which works out of box.

However, 知其然知其所以然(we should know the hows and also the whys). Here is a very useful post from IDRE – Institution of Digital Research and Education from UCLA.

If you think binary data set is  faraway from your life, you are wrong. The `save` command in R, actually store the data in binary format. “Saved R objects are binary files, even those saved with ascii = TRUE, so ensure that they are transferred without conversion of end of line markers and of 8-bit characters. The lines are delimited by LF on all platforms.”