Nutch – How regex-urlfilter.txt really works!?

In a short sentence: Nutch will go through the content in regex-urlfilter line by line, within each line, it will check if the regular expression in that line matches the URL, if so, it will include or exclude the URL depending on what is the sign +/-, and SKIP THE REGEX(es) THAT BELOW THAT ONE! otherwise, it will keep trying all the filters and exclude the URL in the end if there there is not a single regular expression matches.

To be honest, my first assumption of how that regex-urlfilter.txt works was totally “wrong”. What I was thinking was “OK, whatever regex I put in there will all be applied to be used to filter the URL. And in the end, the URL will be excluded/included depending on the results of all the regex. Which is totally not the case how Nutch is implement. In Nutch, one URL is actually filtered by only one regular expression, ONLY ONE regular expression that first matches.

If you think you have a better time understanding the code instead of my emotional description, here is a snippet of the source code how the filter part.


As you can see, when it is looping through all the rules/regex, the whole method will return either URL(include) or null(exclude) whenever URL matches the rule depending on the “sign”. The source code of Rule object is also attached below to help you understand the accept and match method. It is nothing more than the java.util.regex whose usage is better explained here.


Here are a few examples to let you get started without going through the source code:

Say if you think you want to crawl that URLs that belong to the directory /browse of the website but you don’t want the URLs that contains question mark ‘?’ i.e. dynamic pages.  If you put:

# regex-urlfilter.txt

Then your crawler will not filter out those dynamic pages, because when Nutch start filtering after normalizing, it will first check if the URL, say start with, and the answer is yes, then it will just stop filtering and then categorize the URL as included. Of course, your regex “-[?]” will be totally ignored in this case. So to make this work, you can just switch the order like:

# regex-urlfilter.txt

Then it will be perfect. Also, theoretical analysis always feels weak if you ask “I have a big file of regular expression, will it work or not!”, and you don’t want to start crawling and realize it is wrong after unleashing the monster. You can set up Nutch in Eclipse and test it easily with literally a few lines of code:


Now good luck!

Virtualbox – My New Development Environment : Eclipse in Virtualbox

As a developer, I constantly need to install different kinds of software, packages and sometimes, I even need to download binary software or build it from source. And as days go by, I end up in a situation that the software works on mine won’t work on my colleagues due to all kinds of different reasons. Also, I won’t try some software that needs a lot set up because I am afraid of breaking my current working equilibrium. This weekend, I set up a Ubuntu Virtual box 14.04 with Eclipse installed. After tweaking the 3D acceleration for a while, I have an environment that is ready for Apache Nutch development. And I can really rewind the clock whenever I think “shoot… which file did I changed? which environment variable did I change…It was working perfectly before..”.

You basically install Ubuntu on virtual box, then install Eclipse either from the market place or build from source. Here are a few links that helped me a lot:

1. How to install Eclipse from source
2. Why the GUI is super slow and how to fix it.
3. Wonderful tutorial of setting up Nutch in Eclipse
4. Make subclipse stop complaining

Then you are good to go!


The snapshots feature in Virtualbox is already very handy. It is link the image version idea of Docker, but the GUI in virtualbox is much more mature than Docker.