Monday, October 30, 2006

Nutch 0.8.1 is now available


Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.

For more information about Nutch, please see the Nutch wiki.




for DOWNLOAD: http://www.apache.org/dyn/closer.cgi/lucene/nutch/


Changes: 0.8 to 0.8.1

1. Changed log4j confiquration to log to stdout on commandline
tools (siren)

2. NUTCH-260 - Updated hadoop.jar to contain patch from HADOOP-387
(siren)

3. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren)

4. Optionally skip pages with abnormally large Crawl-Delay values
(Dennis Kubes via ab)

5. Fix incorrect calculation of max and min scores in readdb -stats
(Chris Schneider via ab)

6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris
Schneider and Stefan Groschupf via ab)

7. NUTCH-338 - Remove the text parser as an option for parsing PDF files
in parse-plugins.xml (Chris A. Mattmann via siren)

8. NUTCH-105 - Network error during robots.txt fetch causes file to
beignored (Greg Kim via siren)

9. Use a CombiningCollector when calculating readdb -stats. This
drastically reduces the size of intermediate data, resulting in
significant speed-ups for large databases (ab)

10. NUTCH-332 - Fix doubling score caused by links to self (Stefan
Groschupf via ab)

11. NUTCH-336 - Differentiate between newly discovered pages and newly
injected pages (Chris Schneider via ab) NOTE: this changes the
scoring API, filter implementations need to be updated.

12. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf
via ab)

13. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE
(Stefan Groschupf via ab)

No comments: