Kerberos – Installation and configuration for KDC5

There is a fantastic tutorial here from spinlocksolutions that walked you through how to set up a Kerberos server and service on a Debian based box.

Kerberos is critical for building a robust and secure environment where it has been giving me a hard time since day one. I am planning to follow up with the tutorial as much as I can on a few Ubuntu VMs on AWS and hopefully in the end understand all the Kerberos or Security jargon and solve the problem.

HBase – Contribution – mvn site

In the previous post, I created a patch without really testing as required in the recommended workflow. Actually, I was running the command mvn site as suggested and failed to build successfully without knowing why. Today, I spent sometime and noticed it was due to the lack of memory for the Virtualbox Ubuntu box 2G (only 1 G for the OS and barely any left for the build). After I upgrade the memory allocation to be 4G, it managed to build successfully and, actually the performance of the VM generally was much better in general :). More details are included in the following content.

First off, I was trying to reproduce the build fail by git cloning a brand new HBase master branch. Without any modification, I ran mvn site in the home directory and it failed after 10 minutes of build with the error message clearly in the log.

Then I did a quick check by calling the ‘free -m’ command and we only had 2GB memory in total allocated to this VM and 1.3G at that time available.

This slideshow requires JavaScript.

After I smoothly shutdown the VM, I went to the setting of the Ubuntu VM and changed the memory from 2048MB to 4096MB.

mvnsitememory3

Then I booted the VM back and double checked the memory was upgraded. Then the rest is simply to run the same command ‘mvn site’ in the HBase home directory and this time, it finished without any problem, and actually the whole process took about 36 minutes to finish the complete site build.

After the site built, one can go to the target/ folder in the HBase home directory and you can actually see the complete site there. For example, since I have only modified the documentation, actually only a few places in the REST chapter of HBase Book. You can see the typos have been successfully fixed in the web version.

This slideshow requires JavaScript.

In the end, I updated my progress in the HBase ticket letting them know the typos have been fixed. And now is just a matter of waiting for any committer to come in, review and apply the patch. 🙂

HBase – Contribute to JIRA documentation

Today I was following the Apache HBase documentation, nowadays rebranded to be Apache HBase Book. It was a really good documentation but like any other documentation, I ran into a few typos in the documentation which I would love to be fixed.

Then I quickly noticed that there is actually a section in the documentation regarding how you should report an issue in the JIRA system and at the same time, you can assign the ticket to yourself, DIY and potentially integrate into the master branch.

Here is a ticket that I have created regarding a few typos in the REST chapter in the book where a few forward slashes are missing between the port and table name. Then a guy asked me do I have my patch, ok, so the first question is “what on earth is a patch?” 🙂

After carefully reading this documentation,  I realize that it is a fork, a changed version of the Apache HBase source code. Then I started the journey of following the suggested workflows trying to get my first patch.

The first step is to git clone the master branch of HBase source code, it is nothing more complex than copy a link from github and run a git clone command anywhere on your local file system. That process was pretty time consuming for the first time. It almost took me around 5 minutes to download 228MB source code to the Ubuntu VM that is running on my Windows gaming machine.

gitclonehbasemaster

After downloading all the files, I took a quick look at the source code and realize one can easily get lost there and it will be really hard to locate what you want, even the folder where contains the source code of the HBase book.

I did a quick Google search and a grep command really helped me out locating where the REST chapter is located in the source code. I took a slice of a fairly unique text in the REST chapter and hopefully loop through all the files and locate the one that contains it, which supposed to be the file that I need to edit.

greprnw

Now, one can easily tell that the REST section is located in a file that is under the hbase/src/main/asciidoc folder. Inside that folder, there are a folder for each chapter and the one that I am interested in under the external_apis.adoc file.

Hoorey! The moment I opened up this file, I realized it is another thing that I have to learn because this documentation is using a slight different text document format called “asciidoc“.  I am amazed at how powerful, or how complex this whole asciidoc syntax is but my first reaction is. I pulled up the web version of the book and put it side by side with the asciidoc and it should help you quickly have an idea what ASCIIdoc is.

asciidoc_rest

I have a question that why not use something that is more mainstream or straighforward like Markdown or Latex, but anyway, it is another day thing and lets fix our problem first.

It is not a rocket science project to really add in 6 slashes to the right place :), I quickly did that and did a git add and git commit with a commitment message.

Going through the checklist one by one, it says “If you have made documentation changes, be sure the documentation and website builds by running mvn clean site.”

I then switch back to the home directory of HBase and run mvn clean site with the “.” (dot). However, it gave me an build failure error after two minutes and then I reran the command mvn clean site without the dot. I learned from maven’s website that we have three stages in the lifecycle of a project, deployment, clean and site which I guess the mvn clean site is just to make sure we are in a good shape.

And then again, it took me ONE hour to run into another problem:

mvncleansitefail

I am so surprised that I have only changed a few slashes but there problems i constantly ran into, I have a concern that it might just be the build environment that I have which is different comparing with the build system that HBase is using. in this case, I will simply assume it is a correct fix and add a patch.

The patch is simply a delta/diff file that shows what you have changed and what is different.

hbase15685path

After following a tutorial, I attached that patch file to the JIRA ticket and it changed the status to “Patch Available” which I assume people will review and let me know if I made it or not.

patchavailable

Hadoop File Type

I came across these two fantastic blog posts HFile and SequenceFile+MapFile+SetFile+ArrayFile. This post will be a hands on workshop working with each file type, write some code to read and write maybe along with a few benchmarks.

SequenceFile

The Java doc did a good job explaining the ins and outs of SequenceFile, I first came across SequenceFile was when I started working with Apache Nutch. After Nutch crawls the data, it will be stored on the file system as some weird binary files, and afterwards, I learned that most of the Nutch’s code was implemented using lots of Hadoop components (maybe they were both written by Doug Cutting) and the crawled data format is by default SequenceFile.

I followed the quick start tutorial of Nutch and managed to crawled a few links under the domain nutch.apache.org. There are several “dbs” in the crawl folder including crawldb (crawling status/progress), linkdb (links) and segments (all the detail content). Lets take a look at the current crawldb and here are a few screenshot of how the data looks like there.

This slideshow requires JavaScript.

Quoted from John Zuanich’s blog post “The MapFile is a directory that contains two SequenceFile: the data file (“/data”) and the index file (“/index”).”, we know the crawldb folder itself is a MapFile. And both data and index are sequence files themselves.

The good news is that even in the plain binary format, we can tell the header part of the files matches the documentation of SequenceFile.

  • version – 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)
  • keyClassName -key class
  • valueClassName – value class
  • compression – A boolean which specifies if compression is turned on for keys/values in this file.
  • blockCompression – A boolean which specifies if block-compression is turned on for keys/values in this file.
  • compression codec – CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).
  • metadata – SequenceFile.Metadata for this file.
  • sync – A sync marker to denote end of the header.

Now let’s write some code to read in the data file and deserialize it into human readable format. Then we can verify if that matches exactly what the Javadoc told us.

Here is another blog post from Anagha who shared the code around MapFile. And after I executed her code against the crawldb, i realized that it looked like MapFile but it is actually a file type called “Nutch Crawldatum”. In the end, I decided to find another source of SequenceFile which is by creating a table in Hive stored as SequenceFile.

I created a table stored as SequenceFile and then saved the backend data file to my laptop. And then use the following Java code to read the SequenceFile. Here are two screenshots of how that looked like:

This slideshow requires JavaScript.

Now we managed to see how the SequenceFile looks like and even how to access the SequenceFile using Java. The next step will be why we use SequenceFile considering it is nothing but a file format for storing key value pairs which only supports append operation. Then what are the benefits, this stackoverflow question can probably help answer some of the questions.

MapFile

Quoted from the Javadoc

A file-based map from keys to values.

A map is a directory containing two files, the data file, containing all keys and values in the map, and a smaller index file, containing a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval().

The index file is read entirely into memory. Thus key implementations should try to keep themselves small.

Map files are created by adding entries in-order.

I think this explanation is pretty straightforward but there is one thing we need to point out is it is “in-order”.