You are here

Lucene Nutch

Keywords: 

I've been playing with Nutch recently, partly as part of my attempts to get back into Java development. I've got it creating a crawl database and can do searches by Lucene via the web interface. It's really fast, which is great. I have this idea of providing specialised search for my own sites, which would involve a lot of customisation to the Nutch web interface. I have yet to get it to compile from source though!

What I'm not able to do yet is update an already crawled database. I can only see how you'd delete the existing database and re-crawl the lot, which can't be right.

For the interested here's my little How-To, which is mostly just following someone else's with modifications for Nutch 0.9:

Based on the most useful tutorial found so far:
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html

Assuming a database of "crawl-tinysite".

Crawl URLs into a new database:
bin/nutch crawl urls -dir crawl-tinysite -depth 3

Show statistics on crawl:
bin/nutch readdb crawl-tinysite/crawldb -stats

(some of the tutorial's commands are no longer valid for Nutch 0.9)

Show database segments created by nutch:
bin/nutch readseg -list -dir crawl-tinysite/segments/