Apache Solr is search server based on Apache Lucene search library that allows you to index and search text content. This is great base for concordancer service. Solr itself cannot gather data from the internet. Apache Nutch was created to handle this job. This article describes basic steps of connecting Nutch to Solr and configuring concordancer service.
This text covers configuration for Solr 5.2.1 and Nutch 1.10 on Linux/Unix/OSX operating system. It's possible that configuration is going to change in future releases.
At first download binary versions of Solr and Nutch from project's websites. Then unpack both projects into one directory, let's say concordancer
.
Go to Solr direcotry (solr-5.2.1
) and start it by calling
bin/solr start
Then create search core named "concordancer".
bin/solr create -c concordancer
Core is the database of indexed data and configuration of how to perform search on this data. For example you can have several cores one for searching intranet sites another for searching internet websites.
Newly created core concordancer uses default configuration that needs to be changed. Without these changes Nuch can't interoperate with Solr. Open the configuration file server/solr/concordancer/conf/solrconfig.xml
.
Almost at the end of file you can see directive <updateRequestProcessorChain name="add-unknown-fields-to-the-schema”>
. This tells to Solr that when indexing data Solr should define index structure (de facto database structure for indexed data) dynamically. This means if new data structure is indexed (like website with all its metadata) new fields will be added to index structure described in generated managed-schema
file. We don't need this feature so remove <updateRequestProcessorChain name="add-unknown-fields-to-the-schema">
and it's content. Also remove <schemaFactory class="ManagedIndexSchemaFactory">
and it's content. Try to find <initParams path="/update/**”>
it should contain add-unknown-fields-to-the-schema
parameter so remove this directive too.
Search for all occurrences of string _text_
in config file and change them to text
. Newly created schema file below does not contain field named _text_
so that’s why we have to change it's name.
Now remove generated file managed-schema
from configuration directory and replace is by schema.xml
file prepared by Nutch developers. This means copying apache-nutch-1.10/conf/schema.xml
to solr-5.2.1/server/solr/concordancer/conf
directory.
Open the copied configuration schema.xml
and find directive <field name="content" type="text" stored="true" indexed="true”/>
. Change parameter stored from value false to true. This means that you don’t want to just index page's content but also store it's text for concordancer purposes.
Find directive <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt”/>
and delete it. Then close the file and restart Solr by calling:
bin/solr restart
Let’s turn our attention to crawling websites by Nutch so go to directory apache-nutch-1.10 and try to run program without parameters:
bin/nutch
You should see output like this:
nutch 1.10-SNAPSHOT
Usage: nutch COMMAND
where COMMAND is one of:
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
.
.
.
If there's some problem please consult it with Nutch tutorial.
Open Nutch configuration file conf/nutch-site.xml
and add configuration of Nutch User-agent http header:
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
Then create directory that is going to contain files with URLs to be crawled. In this directory create file seed.txt
with one URL per line. You should end up with structure like this:
apache-nutch-1.10/urls/seed.txt
Then execute crawling and indexing of pages by calling crawl command.
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/concordancer urls/crawl/1
Troubleshooting: If some error appear while crawling check out the log files solr-5.2.1/server/logs/solr.log
and apache-nutch-1.10/logs/hadoop.log
Actually Solr's schema.xml
file contains some problems like missing field types, etc. Solving of these problems is up to you.
Open the Solr's query page http://localhost:8983/solr/#/concordancer/query
and try to search newly indexed data.
Let's configure Solr's concordancer ability. Open the configuration file server/solr/concordancer/conf/solrconfig.xml
and locate configuration directives of highlighter (tag <searchComponent class="solr.HighlightComponent" name="highlight”>
). Change the default boundary scanner from simple boundary scanner to break iterator and configure break iterator’s bs.type to SENTENCE
. Boundary scanner finds boundaries of sentence so from now concordancer can display whole sentences.
<!— Make sure there’s only one default boundary scanner in configuration file. —>
<boundaryScanner name="breakIterator" default="true" class="solr.highlight.BreakIteratorBoundaryScanner">
<lst name="defaults">
<!-- type should be one of CHARACTER, WORD(default), LINE and SENTENCE -->
<str name="hl.bs.type">SENTENCE</str>
<!-- language and country are used when constructing Locale object. -->
<!-- And the Locale object will be used when getting instance of BreakIterator -->
<str name="hl.bs.language">en</str>
<str name="hl.bs.country">US</str>
</lst>
</boundaryScanner>
Finally we need to modify request handler to display found sentences. Go to required request handler and add highlighter configuration. I choose to modify /query
handler so its configuration looks like following:
<!-- A request handler that returns indented JSON by default -->
<requestHandler name="/query" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">true</str>
<!-- Highlighting defaults -->
<str name="hl">true</str>
<str name="hl.useFastVectorHighlighter">true</str>
<!-- Get snippets from content field. -->
<str name="hl.fl">content</str>
<!-- How much snippets will be displayed per document. -->
<str name="hl.snippets">10</str>
<!-- You can experiment with snippet size. -->
<str name="hl.fragsize">100</str>
</lst>
</requestHandler>
Open the Solr admin http://localhost:8983/solr/#/concordancer/query
and fire some queries on /query
request handler. You should see result similar to this:
"highlighting": {
"http://vkuzel.blogspot.cz/2009/02/zmena-rozliseni-monitoru-v-xbuntu-810.html": {
"content": [
"Problémy: Změna rozlišení monitoru v Xbuntu 8.10 skip to main | skip to sidebar Problémy Některé mé",
" rozlišení monitoru v Xbuntu 8.10 Nainstaloval jsem Xubuntu na starší stroj. Šlo o celeron 1G4 MHz s",
"Problém byl v rozlišení obrazovky a obnovovací frekvenci monitoru, která byla 640x480@60 . V klikacím",
]
}
}