Tuesday, October 17, 2006

[Tech] High Performance Queries with Apache Lucene

"Full-Text" Search? High Performance Queries!

Apache Lucene is one of the leading full-text indexing frameworks. Actually, from my point of view, the term full-text index/search is somewhat misleading. At least from the "marketing" point of view. For the application developer, Lucene is a framework providing a powerful and very performant query API. For certain, indexing is the technical basis for that, and Lucene allows to index hugh amounts of data: Data is organised in chunks called "documents" and a document can have several fields (name/value pairs). But most appealing for developers, Lucene provides a variety of query options like:
  • Search for keywords ("Java")
  • Search with wildcards ("Java*" )
  • Fuzzy search ("Java~" finds also "Lava")
  • Search of terms located close together ("Java Applicationserver~4" both terms within four words)
  • Range queries ("500-700")
And additionally these queries can be applied to various fields plus used in combination, i.e. a query like this would be feasible: "Find all Books from Author name is approximiately X where the word Y is in the title, published by publisher C or D and that was published between 1998 and 2003".

Besides the Query API the Lucene framework also offers a query parser, that can be used for user interfaces: The parser eventually uses the abovementioned query API.

Lucenes main application "the usual suspects" are problems like: indexing web-sites, indexing PDF, Office documents on a file server, indexing Wikis (Wikipedia) and so on. Yet, regarding these powerful query options, Lucene is recently used in some projects replacing traditional databases. This can be a good idea, when large amounts of data have to be accessed efficiently, access is mostly read only (few changes/writes) and transactions are not important. Objects could be serialised, e.g., in XML and stored on disk and added to the Lucene index. This can be an easier procedure then using databases. There are drawbacks, of course, like the fact, that data is not as highly structured as in relational databases.

Szabolcs wrote more about this topic in his diploma thesis (see below) and will probably add some thoughts here in the BLOG soon.

Lucene "Multilingual"

The success of the Lucene project, that originated as Java project in the Apache Software pool, is meanwhile followed by ports to other languages like: Perl, Python, C++, Ruby and .net. However, it should be noted, that not all port yet show the same quality as the Java version.

Some Experiences

In our work, we used and still use Lucene in several projects with great success. E.g., Franz Inselkammer used Lucene for his diploma thesis "Question based Knowledge Management in a Groupware Environment". Szabolcs Rozsnyai's diploma thesis "Efficient indexing and searching in correlated business event streams" was the foundation for our ongoing research in Event-Mining strategies.


One additional tip for work with Lucene is Luke:

Luke is a very handy tool, that allows to inspect Lucene indices. As they are stored in binary form, they are hardly accessible with other editors. And in developing Lucene applications, Luke is very helpful in checking, whether the index really looks as expected.

Learning Lucene

Unfortunately the documentation of the Lucene project is weak. Particularly the material for new users. This is unfortunate, because the project is excellent and this lack of documentation might give a wrong impression. However, there is a good book, that is recommended for every serious Lucene user: Lucene in Action from Otis Gospodnetic and Erik Hatcher. On the website of the book, also two chapters and some examples can be downloaded.

Remark: Sorry for my previous mistake: First I forgot to mention one of the two authors, and then I wrote the name of the other one wrong...

I wrote a brief introduction to Lucene in the current Infoweek.ch magazine (german/swiss).


Anonymous said...

Hi. Lucene in Action has two authors. This is the other author of Lucene in Action - Otis Gospodnetic. Also, Eric -> Erik - don't let Erik see that typo! :)

Alexander Schatten said...

Thank you for this comment: I am very sorry for this mistake. I hope everything is correct now!

Member said...

I have my search currently on mysql.
Have been asked to use Lucene to improve the search responce.
Please help me by suggesting where i can find more info on:
1. Why lucene ?
2. What else other than lucene ?
Please note i want to search mainly on structured data
and too many hits on a large database.
Thanks & Regards,
- Ajay