Tuesday, August 19, 2008

[Arch] Trends in Data-Management (aka Databases)

It is interesting for me to observe: relational databases have been attacked several times in the last decades, e.g. with object-oriented databases (gone) or XML databases (gone). Now recently a new trend in data-management seems to appear: databases or better data storage/management mechanisms that follow a much looser paradigm than relational databases often using a lightweigt (often REST or JSON based) access strategy. This demand for new datamanagement strategies seems to have several reasons, some come to my mind:
  • Performance: in some cases, complex queries are not required (or can be replaced by simple ones): databases that perform very fast with pure primary key retrieval
  • Complex datastructures are not needed
  • ACID is not needed, i.e. mostly simpel writes are performed but fast reads necessary
  • Agile development seems to favor rather ad-hoc data-structures vs. carefully planned ones (if this is a good trend is written on a different page)
  • Distribution is important and distributed relational databases are a hard thing to do
  • Access to rather document-oriented datastructures is required
and probably many more. Already older tools like Apache Lucene (actually designed as full-text search engine) is used in several projects as kind of a database replacement. This is particularly possible when reading is more important then writing data and no particular ACID requirements are in place. But Lucene provides a nice and rich query language for that matter.

Recently Amazons EC2 platform made a lot of waves as a distributed deployment platform to be used for applications that have to scale significantly (there is, btw. an Open Source version implementing part of the interfaces named Eucalyptus). Part of the Amazon toolset are two storage mechanisms: S3 and SimpleDB. For both APIs are available to be used from applications. S3 is a storage mechanism for storing rather larger junks of data (like files, documents) and is organised in "buckets". SimpleDB, currently beeing in beta, is a storage mechanism for more fine-grained issues. With SimpleDB chunks of data can be stored using a primary key (item id) and a set of items that can consist of attribute/value pairs. To access SimpleDB a WSDL interface description is available and a sort of REST-style interface.

The newest kid on the block (as appears to me) is Apaches CouchDB, which is currently in the Apache incubator. CouchDB seems to follow a similar strategy like Amazons SimpleDB but is focuses on REST/JSON style access (here is a nice comparison between SimpleDB and CoudhDB). CouchDB is (unfortunately, in my opinion) written in Erlang which makes installation and usage (at least in the Java environment which most Apache projects share) rather a difficult issue. However, conceptually it seems to be quite interesting and I suppose we will see more projects of that sort soon.

Ah, and speaking of marketing: projects like CouchDB explicitly express that they are not alternatives to relational databases :-) However, the first projects appear that provice RESTful interfaces for relational database...

Btw.: does anyone know other projects in that domain that I have not seen yet?


Carl Rosenberger said...

You forgot a major trend in data management that is likely to have more impact than anything you mentioned: LINQ.
Eventually people want to write database code against a strong standard, so they can exchange implementations seamlessly.

Szabolcs Rozsnyai said...

In the context the evolution of relational databases it is also interesting to observe Stonebrakers activities relating to Vertica (http://www.vertica.com/) and C-Store (http://db.lcs.mit.edu/projects/cstore/)