I came across these two fantastic blog posts HFile and SequenceFile+MapFile+SetFile+ArrayFile. This post will be a hands on workshop working with each file type, write some code to read and write maybe along with a few benchmarks.
The Java doc did a good job explaining the ins and outs of SequenceFile, I first came across SequenceFile was when I started working with Apache Nutch. After Nutch crawls the data, it will be stored on the file system as some weird binary files, and afterwards, I learned that most of the Nutch’s code was implemented using lots of Hadoop components (maybe they were both written by Doug Cutting) and the crawled data format is by default SequenceFile.
I followed the quick start tutorial of Nutch and managed to crawled a few links under the domain nutch.apache.org. There are several “dbs” in the crawl folder including crawldb (crawling status/progress), linkdb (links) and segments (all the detail content). Lets take a look at the current crawldb and here are a few screenshot of how the data looks like there.
Quoted from John Zuanich’s blog post “The MapFile is a directory that contains two SequenceFile: the data file (“/data”) and the index file (“/index”).”, we know the crawldb folder itself is a MapFile. And both data and index are sequence files themselves.
The good news is that even in the plain binary format, we can tell the header part of the files matches the documentation of SequenceFile.
- version – 3 bytes of magic header SEQ, followed by 1 byte of actual version number (e.g. SEQ4 or SEQ6)
- keyClassName -key class
- valueClassName – value class
- compression – A boolean which specifies if compression is turned on for keys/values in this file.
- blockCompression – A boolean which specifies if block-compression is turned on for keys/values in this file.
- compression codec –
CompressionCodecclass which is used for compression of keys and/or values (if compression is enabled).
- metadata –
SequenceFile.Metadatafor this file.
- sync – A sync marker to denote end of the header.
Now let’s write some code to read in the data file and deserialize it into human readable format. Then we can verify if that matches exactly what the Javadoc told us.
Here is another blog post from Anagha who shared the code around MapFile. And after I executed her code against the crawldb, i realized that it looked like MapFile but it is actually a file type called “Nutch Crawldatum”. In the end, I decided to find another source of SequenceFile which is by creating a table in Hive stored as SequenceFile.
I created a table stored as SequenceFile and then saved the backend data file to my laptop. And then use the following Java code to read the SequenceFile. Here are two screenshots of how that looked like:
Now we managed to see how the SequenceFile looks like and even how to access the SequenceFile using Java. The next step will be why we use SequenceFile considering it is nothing but a file format for storing key value pairs which only supports append operation. Then what are the benefits, this stackoverflow question can probably help answer some of the questions.
Quoted from the Javadoc
A file-based map from keys to values.
A map is a directory containing two files, the
datafile, containing all keys and values in the map, and a smaller
indexfile, containing a fraction of the keys. The fraction is determined by
The index file is read entirely into memory. Thus key implementations should try to keep themselves small.
Map files are created by adding entries in-order.
I think this explanation is pretty straightforward but there is one thing we need to point out is it is “in-order”.