Nutch Hadoop Tutorial – This is a tutorial that shows you how to set up Apache Nutch on a running hadoop cluster and won’t dive into the architect detail too much, which is a perfect tutorial for me.
A few assumptions before following this tutorial:
1. root 2. ssh 3. cluster 4. maillist for Q&A 5. Java programming background
Hadoop Cluster Setup:
Download Hadoop and Nutch:
Setup the Deployment Architecture
Deploy Nutch to a Single Machine
Deploy Nutch to multiple Machines
Performing a Crawl
Testing the Crawl
Performing a Search
I randomly came across a post from Kaggle, which is actually part of a tutorial competition showing people how to get started with machine learning.
More information about the famous MNIST dataset, which is used in this competition, could be found here. I remembered that Andrew Ng’s online class has demonstrated how to do image recognition, using different types of algorithms. However, while I was taking his class from Coursera, the software the class used was Octave. I am mostly using R and I want to give it a try with R.
After I downloaded those MNIST dataset files, again, I realized it is not that easy as I expected. All the files are in binary format and I have never dealt with binary files in R. After a quick good, I know there is a file named after me :), “readBin”. And fortunately, I found a paragraph of R code in git written by brendano, which works out of box.
However, 知其然知其所以然(we should know the hows and also the whys). Here is a very useful post from IDRE – Institution of Digital Research and Education from UCLA.
If you think binary data set is faraway from your life, you are wrong. The `save` command in R, actually store the data in binary format. “Saved R objects are binary files, even those saved with
ascii = TRUE, so ensure that they are transferred without conversion of end of line markers and of 8-bit characters. The lines are delimited by LF on all platforms.”