Wednesday, October 11, 2006

Lucene: How to design your own Web search application ?

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Apache Lucene is an open source project available for free download. Please use the links on the left to access Lucene.


How to design your own Web search application ?:

"Beef up Web search applications with Lucene"

by Deng Peng Zhou

mplement advanced search with Lucene

"Lucene supports several kinds of advanced searches, which I'll discuss in this section. I'll then demonstrate how to implement these searches with Lucene's Application Programming Interfaces (APIs).

Boolean operators

Most search engines provide Boolean operators so users can compose queries. Typical Boolean operators are AND, OR, and NOT. Lucene provides five Boolean operators: AND, OR, NOT, plus (+), and minus (-). I'll describe each of these operators.

  • OR: If you want to search for documents that contain the words "A" or "B," use the OR operator. Keep in mind that if you don't put any Boolean operator between two search words, the OR operator will be added between them automatically. For example, "Java OR Lucene" and "Java Lucene" both search for the terms "Java" or "Lucene."
  • AND: If you want to search for documents that contain more than one word, use the AND operator. For example, "Java AND Lucene" returns all documents that contain both "Java" and "Lucene."
  • NOT: Documents that contain the search word immediately after the NOT operator won't be retrieved. For example, if you want to search for documents that contain "Java" but not "Lucene," you may use the query "Java NOT Lucene." You cannot use this operator with only one term. For example, the query "NOT Java" returns no results.
  • +: The function of this operator is similar to the AND operator, but it only applies to the word immediately following it. For example, if you want to search documents that must contain "Java" and may contain "Lucene," you can use the query "+Java Lucene."
  • -: The function of this operator is the same as the NOT operator. The query "Java -Lucene" returns all of the documents that contain "Java" but not "Lucene."

Now look at how to implement a query with Boolean operators using Lucene's API. Listing 1 shows the process of doing searches with Boolean operators.

Full Article : here

by Deng Peng Zhou is a graduate student from Shanghai Jiaotong University. He works as an intern software engineer in IBM Shanghai Globalization Lab and is interested in Java technology and modern information retrieval. You can contact him at zhoudengpeng@yahoo.com.cn.


No comments: