Wednesday, September 13, 2006

[Tech] Derby's next round

Since the "Open Source" move from IBMs Cloudescape to Apache Derby, there were quite some discussions about which would be the Java Database of choice. Currently mainly two systems are prominent and in discussion: hsqldb and Derby.
Java Databases?

The foremost question for many developers/users is, why anyone would want a relational Database written in Java. Actually there are several reasons, mostly interesting for Java-based projects:
  • Databases like Derby are meanwhile very mature and can compete with non-Java systems
  • Java-based databases integrate smoothlessly into Java projects; also using O/R mapping tools like Castor or Hibernate: In this case, the complete application including persistance stack is in Java.
  • Different operation modes (Server, Embedded)
  • In the embedded mode, no network communication takes place, with respective consequences for performance and security (no connection to database from outside the application is possible)
  • No need to install or configure the database at the customer (in embedded mode)!
This makes systems like Derby or hsqldb a very good choice for development, testing and easy installation of the Java application (e.g., potential users can easily download a complete configured webapplication including database). This is helpful for customer evaluation-purposes. But systems like Derby are mature and powerful enough to also serve for many classes of production systems.

Performance

Performance considerations are always very difficult issues in evaluating (relational) databases. Simple tests typically give misleading results. Just an example:

hsqldb in default setup is compared to PostgreSQL (or Derby) in default setup, and indicates a far higher performance.

A second (and third look) shows: hsqldb uses (per default)
  • in-memory tables: all data is kept in memory
  • might be used embedded, hence no network overhead takes place
  • has enabled huge write delay, meaning, that even committed transactions are not persisted at the moment, but up to one minute later!
  • hsqldb has no proper implementation of the database ACID criteria, e.g., allows dirty read
This is just an indication of issues. Of course a system can be very fast, when onle once a minute I/O is taking place, transaction isolation is not implemented and all data is kept in memory and finally no network activity is involved. This is not to say, that hsqldb is a bad choice, one just has to be aware of the limits.

Additionally, the access frameworks have to be taken into consideration: nowadays developers hardly use JDBC directly, but use technologies like iBatis, Hibernate or Castor, and this again has significant impact on performace (using Caches or not, ...).

Generally spoken: Java databases are very fast when used in embedded mode and even faster when in-memory tables are used (these are not yet available for Derby though). As soon, as these databases are used in Server mode, they typically play in the same performance region as other known database systems. But significant differences can be observed depending on specific usage scenarios.

News in Derby 10.1.3.1

This release is mainly a maintenance release and offers and increases stability and reliability, better query performance as well as updated documentation.

News in Derby 10.2

The upcoming Version 10.2. will have some significant improvements (from the Derby Wiki):
  • Scrollable Updatable Result Sets
  • JDBC4
  • Grant/Revoke
  • Online Backup
  • Stronger Network Authentication
Very important is the implementation of the missing Grant/Revoke statements combined with better network authentication. This was a significant lack of functionality with recent Derby versions particularly important when using Derby in server mode.

This release should be out yet in September, so stay tuned!

References

To get information about Apache Derby, read the Website and the Wiki (!) for recent activities.

Check out the very interesting BLOG entry: David van Couvering's Blog: About ACIDity and Java Databases.

Additionally I might refer to my recent article (german) in the Swiss magazine: Infoweek.ch discussing Apache Derby in some more details as well as one of my articles in german iX magazine 2005/5: in this article I discuss and compare with my colleage Marco Zapletal hsqldb with Apache Derby. However, it should be noted, that since then some new versions of Derby were released and not all statement are up-to-date according to these new versions.

3 comments:

Alexander Schatten said...

What Rupert said about embedded and lightweight databases is absolutly correct. However, with the new Derby releases we came some significant steps ahead.

Derby is (for the Java community) the functional most complete open source Java database (particularly compared to hsqldb). And can even replace existing "full" DBMS systems in many occasions.

Also new features are on the way like the upcoming XML and XQuery support.

It's true, Derby is "just" a next step on the way, however, I feel and important one, particularly also with the broad commercial support (Sun, IBM).

Anonymous said...

embedded databases are in my opinion extremely valuable for so called "smart client" applications. it seems to be a more java focused blog, so fancy msdn terms might need an explanation: smart clients are client side applications that are normally connected to some server applications. when the connection fails (eg. while traveling or beeing at a clients place) the smart client swiches to an offline mode where basically the same but maybe a bit restricted functionality is available. to realize this online/offline capabilities a client side datastore is normally used.

in my opinion embedded databases fit perfectely in this smart client world, where temporary, slim and fast databases are needed. to support these scenarios embedded databases should not be too much try to copy the functionality of the big ones, but concentrate on data replication functionalities: think of a database layer that connects to your slim database, which passes the commands through to the super duper database if online, caches the results for faster further queries and serves the client if in offline mode. as soon as the connection is established again synchronization takes place.

--benedikt

Alexander Schatten said...

I want to add some recent experiences: I still like Derby in many aspects, but I also had some very unfortunate experiences, that are hard to believe: Derby has still significant lacks in some SQL expressions, particularly in the ALTER TABLE statement.

At the moment Derby does not support renaming of tables and fields and what is even worse, dropping of fields (sic)!!

The terrible thing is, that one never would believe that a database that is in many party very mature lacks such fundamental functionality.

Now one exampe: I needed to change the field type of one field: Changing via ALTER TABLE is not possible. Adding a new field, moving the data and dropping the old is not possible (see above). So the only solution (discussed in the mailing list) would be:

(1) creating a new table
(2) removing all constraints (foreign keys...) where the original table is involved (!!) hell a lot of issues
(3) moving the data from the old to the new table with the changed column
(4) Drop the old table
(5) Rename the new table
(6) Bring all constraints back in place

So 6 partly quite complex steps, where most other databases have one statement. This is clearly not acceptable.

Fazit: I am quite unhappy about this situation and really hope, that the Derby guys will fix this soon. From my pont of view this is currently a show stopper, and I cannot recommend to use Derby in many applications under this condition. I personally switched back the hsqldb with all issues this database has.