"Full-Text" Search? High Performance Queries!
Apache Lucene is one of the leading full-text indexing frameworks. Actually, from my point of view, the term full-text index/search is somewhat misleading. At least from the "marketing" point of view. For the application developer, Lucene is a framework providing a powerful and very performant query API. For certain, indexing is the technical basis for that, and Lucene allows to index hugh amounts of data: Data is organised in chunks called "documents" and a document can have several fields (name/value pairs). But most appealing for developers, Lucene provides a variety of query options like:
- Search for keywords ("Java")
- Search with wildcards ("Java*" )
- Fuzzy search ("Java~" finds also "Lava")
- Search of terms located close together ("Java Applicationserver~4" both terms within four words)
- Range queries ("500-700")
And additionally these queries can be applied to various fields plus used in combination, i.e. a query like this would be feasible: "Find all Books from Author name is approximiately X where the word Y is in the title, published by publisher C or D and that was published between 1998 and 2003".
Besides the Query API the Lucene framework also offers a query parser, that can be used for user interfaces: The parser eventually uses the abovementioned query API.
Lucenes main application "the usual suspects" are problems like: indexing web-sites, indexing PDF, Office documents on a file server, indexing Wikis (Wikipedia) and so on. Yet, regarding these powerful query options, Lucene is recently used in some projects replacing traditional databases. This can be a good idea, when large amounts of data have to be accessed efficiently, access is mostly read only (few changes/writes) and transactions are not important. Objects could be serialised, e.g., in XML and stored on disk and added to the Lucene index. This can be an easier procedure then using databases. There are drawbacks, of course, like the fact, that data is not as highly structured as in relational databases.
Szabolcs wrote more about this topic in his diploma thesis (see below) and will probably add some thoughts here in the BLOG soon.
Lucene "Multilingual"The success of the Lucene project, that originated as Java project in the Apache Software pool, is meanwhile followed by ports to other languages like: Perl, Python, C++, Ruby and .net. However, it should be noted, that not all port yet show the same quality as the Java version.
Some ExperiencesLuke
One additional tip for work with Lucene is
Luke:
Luke is a very handy tool, that allows to inspect Lucene indices. As they are stored in binary form, they are hardly accessible with other editors. And in developing Lucene applications, Luke is very helpful in checking, whether the index really looks as expected.
Learning LuceneUnfortunately the documentation of the Lucene project is weak. Particularly the material for new users. This is unfortunate, because the project is excellent and this lack of documentation might give a wrong impression. However, there is a good book, that is recommended for every serious Lucene user:
Lucene in Action from Otis Gospodnetic and Erik Hatcher. On the website of the book, also two chapters and some examples can be downloaded.
Remark: Sorry for my previous mistake: First I forgot to mention one of the two authors, and then I wrote the name of the other one wrong...I wrote a brief introduction to Lucene in the current
Infoweek.ch magazine (german/swiss).