| home / programming / Lucene / 1 | [previous][next] |
|
|
Different people are fighting the same problem—information overload—using different approaches. Some have been working on novel user interfaces, some on intelligent agents, and others on developing sophisticated search tools like Lucene. Before we jump into action with code samples later in this chapter, we’ll give you a high-level picture of what Lucene is, what it is not, and how it came to be.
Lucene is a high performance, scalable Information Retrieval (IR) library. It lets you add indexing and searching capabilities to your applications. Lucene is a mature, free, open-source project implemented in Java; it’s a member of the popular Apache Jakarta family of projects, licensed under the liberal Apache Software License. As such, Lucene is currently, and has been for a few years, the most popular free Java IR library.
As you’ll soon discover, Lucene provides a simple yet powerful core API that requires minimal understanding of full-text indexing and searching. You need to learn about only a handful of its classes in order to start integrating Lucene into an application. Because Lucene is a Java library, it doesn’t make assumptions about what it indexes and searches, which gives it an advantage over a number of other search applications.
People new to Lucene often mistake it for a ready-to-use application like a file-search program, a web crawler, or a web site search engine. That isn’t what Lucene is: Lucene is a software library, a toolkit if you will, not a full-featured search application. It concerns itself with text indexing and searching, and it does those things very well. Lucene lets your application deal with business rules specific to its problem domain while hiding the complexity of indexing and searching implementation behind a simple-to-use API. You can think of Lucene as a layer that applications sit on top of, as depicted in figure 1.5.
A number of full-featured search applications have been built on top of Lucene. If you’re looking for something prebuilt or a framework for crawling, document handling, and searching, consult the Lucene Wiki “powered by” page (http://wiki.apache.org/jakarta-lucene/PoweredBy) for many options: Zilverline, SearchBlox, Nutch, LARM, and jSearch, to name a few. Case studies of both Nutch and SearchBlox are included in chapter 10.

Figure 1.5 A typical application integration with Lucene
Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can convert it to text. This means you can use Lucene to index and search data stored in files: web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents, HTML or PDF files, or any other format from which you can extract textual information.
Similarly, with Lucene’s help you can index data stored in your databases, giving
your users full-text search capabilities that many databases don’t provide.
Once you integrate Lucene, users of your applications can make searches such as
+George +Rice -eat -pudding, Apple –pie +Tiger, animal:monkey AND
food:banana, and so on. With Lucene, you can index and search email messages,
mailing-list archives, instant messenger chats, your Wiki pages … the list goes on.
Lucene was originally written by Doug Cutting;2 it was initially available for download from its home at the SourceForge web site. It joined the Apache Software Foundation’s Jakarta family of high-quality open source Java products in September 2001. With each release since then, the project has enjoyed increased visibility, attracting more users and developers. As of July 2004, Lucene version 1.4 has been released, with a bug fix 1.4.2 release in early October. Table 1.1 shows Lucene’s release history.

Doug Cutting remains the main force behind Lucene, but more bright minds have joined the project since Lucene’s move under the Apache Jakarta umbrella. At the time of this writing, Lucene’s core team includes about half a dozen active developers, two of whom are authors of this book. In addition to the official project developers, Lucene has a fairly large and active technical user community that frequently contributes patches, bug fixes, and new features.
Who doesn’t? In addition to those organizations mentioned on the Powered by Lucene page on Lucene’s Wiki, a number of other large, well-known, multinational organizations are using Lucene. It provides searching capabilities for the Eclipse IDE, the Encyclopedia Britannica CD-ROM/DVD, FedEx, the Mayo Clinic, Hewlett-Packard, New Scientist magazine, Epiphany, MIT’s OpenCourseware and DSpace, Akamai’s EdgeComputing platform, and so on. Your name will be on this list soon, too.
One way to judge the success of open source software is by the number of times it’s been ported to other programming languages. Using this metric, Lucene is quite a success! Although the original Lucene is written in Java, as of this writing Lucene has been ported to Perl, Python, C++, and .NET, and some groundwork has been done to port it to Ruby. This is excellent news for developers who need to access Lucene indices from applications written in different languages. You can learn more about some of these ports in chapter 9.
At the heart of all search engines is the concept of indexing: processing the original data into a highly efficient cross-reference lookup in order to facilitate rapid searching. Let’s take a quick high-level look at both the indexing and searching processes.
Suppose you needed to search a large number of files, and you wanted to be able to find files that contained a certain word or a phrase. How would you go about writing a program to do this? A naïve approach would be to sequentially scan each file for the given word or phrase. This approach has a number of flaws, the most obvious of which is that it doesn’t scale to larger file sets or cases where files are very large. This is where indexing comes in: To search large amounts of text quickly, you must first index that text and convert it into a format that will let you search it rapidly, eliminating the slow sequential scanning process. This conversion process is called indexing, and its output is called an index.
You can think of an index as a data structure that allows fast random access to words stored inside it. The concept behind it is analogous to an index at the end of a book, which lets you quickly locate pages that discuss certain topics. In the case of Lucene, an index is a specially designed data structure, typically stored on the file system as a set of index files. We cover the structure of index files in detail in appendix B, but for now just think of a Lucene index as a tool that allows quick word lookup.
Searching is the process of looking up words in an index to find documents where they appear. The quality of a search is typically described using precision and recall metrics. Recall measures how well the search system finds relevant documents, whereas precision measures how well the system filters out the irrelevant documents. However, you must consider a number of other factors when thinking about searching. We already mentioned speed and the ability to quickly search large quantities of text. Support for single and multiterm queries, phrase queries, wildcards, result ranking, and sorting are also important, as is a friendly syntax for entering those queries. Lucene’s powerful software library offers a number of search features, bells, and whistles—so many that we had to spread our search coverage over three chapters (chapters 3, 5, and 6).
Let’s see Lucene in action. To do that, recall the problem of indexing and searching
files, which we described in section 1.3.1. Furthermore, suppose you need to
index and search files stored in a directory tree, not just in a single directory. To
show you Lucene’s indexing and searching capabilities, we’ll use a pair of commandline
applications: Indexer and Searcher. First we’ll index a directory tree containing
text files; then we’ll search the created index.
These example applications will familiarize you with Lucene’s API, its ease of use, and its power. The code listings are complete, ready-to-use command-line programs. If file indexing/searching is the problem you need to solve, then you can copy the code listings and tweak them to suit your needs. In the chapters that follow, we’ll describe each aspect of Lucene’s use in much greater detail.
Before we can search with Lucene, we need to build an index, so we start with our Indexer application.
| home / programming / Lucene / 1 | [previous][next] |
Created: March 27, 2003
Revised: January 24, 2005
URL: http://webreference.com/programming/lucene/1