Apache Nutch is quite old project using Apache Ant to build itself. I wanted to develop a plugin for filtering content of parsed pages and because I am IntelliJ IDEA user I wanted to do it in this IDE. This how-to helps you to setup IDEA project so you can develop and debug the Nutch (parse) plugin. I must mention great article by Emir Dizdarevic the Precise data extraction with Apache Nutch which helped me a lot.
Part of the article is a template project: https://github.com/vkuzel/Nutch-Plugin-Development-Template
When started from IDEA the project is first built by Maven and then deployed to Nutch binary installation. Nutch task (in this case parse) is then started and Nutch itself runs the plugin. IDEA attaches a debugger to Nutch process and you can debug the code.
Apache Nutch is present in the Maven repository but unfortunately contains some weird dependencies. To compile plugin you need to add dependency to Nutch
org.apache.nutch.nutch version 1.10 and to Hadoop
org.apache.hadoop.hadoop-core version 1.2.0 (Hadoop is a core library of Nutch). To compile the project you need to exclude
org.apache.cxf dependency from Nutch library because Maven cannot resolve it. Btw. I figured out that different versions of Nutch need to exclude different libraries.
To allow Nutch to run the plugin it has to be built and deployed to Nutch installation directory during every debug session. This is managed by external shell script deploy_plugin_to_nutch_for_debug.sh which is executed by Maven on a install phase of a build process. For this task I incorporated the
exec-maven-plugin plugin into the pom.xml file.
Since this article covers development of Nutch plugin there's no need to have complete Nutch source codes. To run plugin you need properly configured Nutch binary installation that can be downloaded from it's official site. In the template project there's empty directory
nutch-1.10 where Nutch should be copied. Only necessary change to Nutch cofiguration (apart from default installation process) is to add the plugin to the
plugin.includes directive so Nutch can recognize it.
There's also archive test_data.zip with the pre-downloaded page that can be used to test the parser plugin. This archive is extracted to the
nutch-1.10/test_data directory so the plugin is always working with same data.
Project (debug) configuration
Because project uses an external application (Nutch) an custom application run/debug configuration is needed in IDEA. Add new debug configuration with following parameters:
org.apache.nutch.parse.ParseSegment. Nutch is started directly by calling its class not by usual shell script located in the
bindirectory of its installation.
test_data/crawl/segments/20151010172800. This is path to test data extracted from the
test_data.ziparchive. Test data contains one html page just to test one pass through the plugin.
nutch-1.10. Besides of this working directory it is also necessary to add Nutch's
libdirectories to a classpath. Do it by adding them in the
Project Settings -> Modules -> Dependenciesmenu.
Before launch run Maven Goal:
clean install. Before every execution module has to be built and deployed to Nutch installation. Because of it it's necessary to execute an install goal of Maven's build process.
Now by running the debug configuration you should be able to debug the plugin.