I came across these two fantastic blog posts HFile and SequenceFile+MapFile+SetFile+ArrayFile. This post will be a hands on workshop working with each file type, write some code to read and write maybe along with a few benchmarks.

SequenceFile

The Java doc did a good job explaining the ins and outs of SequenceFile, I first came across SequenceFile was when I started working with Apache Nutch. After Nutch crawls the data, it will be stored on the file system as some weird binary files, and afterwards, I learned that most of the Nutch’s code was implemented using lots of Hadoop components (maybe they were both written by Doug Cutting) and the crawled data format is by default SequenceFile.

I followed the quick start tutorial of Nutch and managed to crawled a few links under the domain nutch.apache.org. There are several “dbs” in the crawl folder including crawldb (crawling status/progress), linkdb (links) and segments (all the detail content). Lets take a look at the current crawldb and here are a few screenshot of how the data looks like there.

This slideshow requires JavaScript.

Quoted from John Zuanich’s blog post “The MapFile is a directory that contains two SequenceFile: the data file (“/data”) and the index file (“/index”).”, we know the crawldb folder itself is a MapFile. And both data and index are sequence files themselves.

The good news is that even in the plain binary format, we can tell the header part of the files matches the documentation of SequenceFile.

SequenceFile Header

version – 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)

keyClassName -key class

valueClassName – value class

compression – A boolean which specifies if compression is turned on for keys/values in this file.

blockCompression – A boolean which specifies if block-compression is turned on for keys/values in this file.

compression codec – CompressionCodec class which is used for compression of keys and/or values (if compression is enabled).

metadata – SequenceFile.Metadata for this file.

sync – A sync marker to denote end of the header.

Now let’s write some code to read in the data file and deserialize it into human readable format. Then we can verify if that matches exactly what the Javadoc told us.

Here is another blog post from Anagha who shared the code around MapFile. And after I executed her code against the crawldb, i realized that it looked like MapFile but it is actually a file type called “Nutch Crawldatum”. In the end, I decided to find another source of SequenceFile which is by creating a table in Hive stored as SequenceFile.

I created a table stored as SequenceFile and then saved the backend data file to my laptop. And then use the following Java code to read the SequenceFile. Here are two screenshots of how that looked like:

This slideshow requires JavaScript.

Now we managed to see how the SequenceFile looks like and even how to access the SequenceFile using Java. The next step will be why we use SequenceFile considering it is nothing but a file format for storing key value pairs which only supports append operation. Then what are the benefits, this stackoverflow question can probably help answer some of the questions.

MapFile

Quoted from the Javadoc

A file-based map from keys to values.

A map is a directory containing two files, the data file, containing all keys and values in the map, and a smaller index file, containing a fraction of the keys. The fraction is determined by MapFile.Writer.getIndexInterval().

The index file is read entirely into memory. Thus key implementations should try to keep themselves small.

Map files are created by adding entries in-order.

I think this explanation is pretty straightforward but there is one thing we need to point out is it is “in-order”.

datafireball

Hadoop File Type

SequenceFile

SequenceFile Header

MapFile

Leave a comment Cancel reply

SequenceFile

SequenceFile Header

MapFile

Share this:

Related

Leave a comment Cancel reply