Apache Nutch is quite old project using Apache Ant to build itself. I wanted to develop a plugin for filtering content of parsed pages and because I am IntelliJ IDEA user I wanted to do it in this IDE. This how-to helps you to setup IDEA project so you can develop and debug the Nutch (parse) plugin. I must mention great article by Emir Dizdarevic the Precise data extraction with Apache Nutch which helped me a lot.
Part of the article is a template project: https://github.com/vkuzel/Nutch-Plugin-Development-Template
Overview
When started from IDEA the project is first built by Maven and then deployed to Nutch binary installation. Nutch task (in this case parse) is then started and Nutch itself runs the plugin. IDEA attaches a debugger to Nutch process and you can debug the code.
Maven project
Apache Nutch is present in the Maven repository but unfortunately contains some weird dependencies. To compile plugin you need to add dependency to Nutch org.apache.nutch.nutch
version 1.10 and to Hadoop org.apache.hadoop.hadoop-core
version 1.2.0 (Hadoop is a core library of Nutch). To compile the project you need to exclude org.apache.cxf
dependency from Nutch library because Maven cannot resolve it. Btw. I figured out that different versions of Nutch need to exclude different libraries.
To allow Nutch to run the plugin it has to be built and deployed to Nutch installation directory during every debug session. This is managed by external shell script deploy_plugin_to_nutch_for_debug.sh which is executed by Maven on a install phase of a build process. For this task I incorporated the exec-maven-plugin
plugin into the pom.xml file.
Nutch installation
Since this article covers development of Nutch plugin there's no need to have complete Nutch source codes. To run plugin you need properly configured Nutch binary installation that can be downloaded from it's official site. In the template project there's empty directory nutch-1.10
where Nutch should be copied. Only necessary change to Nutch cofiguration (apart from default installation process) is to add the plugin to the plugin.includes
directive so Nutch can recognize it.
There's also archive test_data.zip with the pre-downloaded page that can be used to test the parser plugin. This archive is extracted to the nutch-1.10/test_data
directory so the plugin is always working with same data.
Project (debug) configuration
Because project uses an external application (Nutch) an custom application run/debug configuration is needed in IDEA. Add new debug configuration with following parameters:
-
Main class:
org.apache.nutch.parse.ParseSegment
. Nutch is started directly by calling its class not by usual shell script located in thebin
directory of its installation. -
Program arguments:
test_data/crawl/segments/20151010172800
. This is path to test data extracted from thetest_data.zip
archive. Test data contains one html page just to test one pass through the plugin. -
Working directory:
nutch-1.10
. Besides of this working directory it is also necessary to add Nutch'sconf
andlib
directories to a classpath. Do it by adding them in theProject Settings -> Modules -> Dependencies
menu. -
Before launch run Maven Goal:
clean install
. Before every execution module has to be built and deployed to Nutch installation. Because of it it's necessary to execute an install goal of Maven's build process.
Now by running the debug configuration you should be able to debug the plugin.