Download | Blog | Forums | Support | Contact us | Search  
Xcalia delivers Data Integration Software for SOA compliant data architecture data access software

 

Picking the Right Inheritance Strategy

March 2005, revised in 2007

In this paper, by Eric Samson, CTO of Xcalia, we'll go through several common strategies to map inheritance to relational databases, and the results might surprise you!


Overview

Most Java developers who use data access products and O/R mapping tools know that mapping inheritance is complex. A commonly held view is that products featuring the “per-hierarchy” mapping strategy are the most efficient and perform best. This paper will briefly review the different inheritance mapping strategies and present a realistic example that will allow everyone to better understand which mapping strategy is really best suited to which situation. We believe that most readers will be surprised by the results.

Inheritance in object models, benefits and limitations

Inheritance is one of the most advanced features of object models yet it is one of the most complicated to reproduce in a database. When object technologies were first being adopted, inheritance was overused. At that time, object programmers defined large, deep inheritance hierarchies.

Inheritance allows one to program systems by “generalization/specialization”:

  • you can put all common states and behaviors in a parent class,
  • you can refine the states and behavior by adding more specific details in the subclasses, and
  • classes can inherit from a single superclass (like in Smalltalk or Java) or from multiple superclasses (like in C++).

Since multiple inheritance can result in many programming drawbacks, it is less frequently used. In most current industrial object languages, specialization is limited to overloading existing behaviors or adding new ones, not removing existing ones or limiting them. For instance, you cannot change the type of a given attribute in a subclass. Usually, inheritance means class inheritance, where you can refine the behavior of one class in subclasses. Some research projects propose highly reflexive models where inheritance can even be used on objects, not just on classes.

Inheritance provides a powerful programming mechanism where the initial effort forces you to be more structured, which produces important benefits for all subsequent projects:

  • creating a new application from an existing one
  • sharing common information at the right level between several projects within the organization, to avoid reinventing the wheel at the beginning of each new project
  • structuring global information in business frameworks

However, designing class hierarchies that are too large can lead to some software issues because the defined model can become too static. Over time, engineers naturally reduce the size of class hierarchies, and use delegation instead of inheritance, which allows them to replace “is a kind of” relationships by “plays the role of” relationships. This leads to more dynamic systems where roles can evolve over time.

It is important to point out that inheritance tends to be in conflict with encapsulation within the hierarchy, as information from the superclasses is made directly available to the subclasses. It is the opinion of this author, that inheritance should take precedence over encapsulation, allowing subclasses to have direct access to data in superclasses. Nevertheless, Java allows this kind of decision to be made by the individual programmer with the protected and private keywords.

That said, when used correctly, inheritance is a very efficient mechanism and thus is massively present in most Java applications today.

Mapping inheritance

Aside from object databases, the notion of inheritance is not available in most data stores. For instance, there is no straightforward way to store inherited classes in a relational database. Indeed, managing inheritance in a RDBMS is one of the goals of a data access layer. As with most of the features in the data access layer, performance and flexibility are also important for inheritance management.

Let us now examine the three well-known inheritance mapping strategies in a relational model.

Per-class inheritance mapping

The first strategy, which we will refer to as “per-class”, tends to reproduce the object model in the database. In this strategy, there is one table per class in the hierarchy. Each table contains the attributes defined at each class level, and each table associated with a subclass contains a foreign key to the table associated with the superclass. There is one table per class even if the parent classes are abstract. This strategy is also known as “vertical mapping” or “pure vertical mapping”.

Given the following object model,

the associated database model could be as follows, depending on the various mapping options that are activated:

In this example, Joe is an employee and Jack is a student.

Note: XIC does not require that all subclasses to use the same column to establish the link to the parent class.

The benefits of this mapping strategy are that

  • it reflects the class hierarchy,
  • it is as normalized as possible, with no duplication of information but the keys,
  • maintenance is easy, as you simply add the equivalent column in the associated table in order to add a new attribute in a class, and
  • it simplifies the use of integrity constraints (foreign keys, uniqueness, etc.).

Conversely, it has some detriments:

  • each creation, update and deletion of a single object implies several database operations in multiple tables (except when you manipulate instances of the root class, if it is not abstract),
  • as each object is spread over multiple tables it uses more locks and database resources in general,
  • because database constraints are typically used to insure referential integrity of such schema, write operations involve more database resources to enforce these constraints,
  • each read of one instance of a subclass implies a query with multiple joins to reload the whole object with all its inherited attributes, and
  • each time one queries a non-leaf class, it requires several queries on all its subclasses, each of them with multiple joins.

For instance, finding all Person objects will generate a query to retrieve all persons, then all employees, and finally all students. When navigating from an object that has a reference to a Person object, we need to be able to determine the actual type of that person, as it could be a person, an employee or a student.

Identifying the actual subclass

This problem of identifying the actual subclass can be solved in four different ways. When the object identifiers (OIDs) are managed by the data access layer,

  • the object's type can be stored within the OID itself (the option we chose in the previous example), or
  • it could be stored in an intermediate typing table.

If you need to access a preexisting schema where primary keys have already been defined arbitrarily,

  • you can have an extra column in the PERSON table that will be used to store the actual class type of each object, or
  • you can use a depth-first algorithm to dynamically find the object's actual type.

The last algorithm is obviously the most resource-consuming as it implies queries on all subclasses; thus, it should only be used when imposed by an existing schema. A good data access product should be able to deal with all of these situations, as all of them could be necessary in a real application.

Per-hierarchy inheritance mapping

The second strategy, which we will refer to as “per-hierarchy”, tends to limit the number of statements needed to store and retrieve objects. With this strategy there is only one single table for the whole class hierarchy, this table contains all attributes defined in all classes of the hierarchy. This strategy is also known as “flat mapping” or “filtered mapping”.

Using the same object model, the associated database model could be as follows (depending on the various mapping options that are activated):

The benefits of this mapping strategy are that

  • only one table needs to be manipulated when objects are created, updated, deleted,
  • retrieving objects at any level in the hierarchy only implies one SELECT statement,
  • from the point of view of the data access layer vendor, it is the easiest to implement, as only one table is involved in queries on the hierarchy (see Scott Ambler's article),
  • maintenance is easy, as you simply add the equivalent column in the associated table in order to add an attribute to a class, and
  • it is easy to define referential integrity constraints (foreign keys).

Conversely, it has some detriments.

  • There is an enormous waste of space in the database, as many columns are left unused for each row.
    For instance, in our previous example, all columns used to store specific information about student are left empty for all persons but students.
  • It leads to the creation of huge tables with too many rows and too many columns.
    This has an important impact on
    • the space reclamation algorithm,
    • index maintenance,
    • the caching system, and
    • the log system in the database server.
  • It quickly saturates the network layer.
    Each row sent through the network contains too many values, most of them being null but nevertheless taking up space.
  • Each query of a non-leaf class implies once again one query with many “OR” logical operators in the filter.
    For instance, if you want to find all employees, a filter like WHERE TYPE = ‘EMPLOYEE’ OR TYPE = ‘MANAGER’ OR TYPE = … must be used to retrieve all instances of subclasses of the Employee class.
  • As seen with the per-class strategy, when navigating from an object that has a reference to a Person object, we need to be able to determine the actual type of that person, as it could be a person, an employee or a student.
    The same four methods allow us to deal with this problem; in the example above we chose the extra typing column.
  • It is not easy to define NOT NULL or UNIQUE columns (except at the root level).
  • Because many classes are stored in the same table, it could raise concurrency issues in situations with many users accessing the same objects.

Per-concrete-class inheritance mapping

The third and last strategy, which we will refer to as “per-concrete-class”, tends to limit both the space and the number of statements needed to store objects. With this strategy, there is one table for each concrete class in the hierarchy. This table contains all the attributes defined in the class itself plus all the attributes inherited from all the superclasses. This strategy is also known as “horizontal mapping”.

Given the previously introduced object model, the associated database model could be as follows (depending on the various mapping options that are activated):

Depending on whether Person is an abstract or a concrete class, it may or may not be associated to a table. Please note that a product like XIC lets you define different column names in tables for subclasses for the same inherited attributes. This kind of flexibility is mandatory when dealing with existing schemas.

The benefits of this mapping strategy are that

  • it is optimized in terms of space in the database,
  • only one table needs to be manipulated when objects are created, updated, deleted,
  • retrieving objects at any level in the hierarchy requires only one SELECT statement, and
  • actual object typing during navigation is easy and efficient as each class is only stored in one table.

Conversely, its detriments are significant.

  • From the point of view of the data access layer vendor, it is the most difficult to implement (thus less well-known and understood), as the generation of SQL statements for queries is more complicated.
  • Each query of a non-leaf or abstract class implies several queries each of them with multiple joins.
    For instance, finding all “Person” objects will generate a query to retrieve all persons, then all employees and all students.
  • You must decide how to manage relationships from a Person object to another class.
    You can define the relationship at each level, or you can define it at the root level, in which case you must be able to create tables even for abstract classes. This could raise some very complex issues when such relationships are managed through reverse foreign keys.
Note: XIC supports all these useful options.

Comparing the three mapping strategies

Anyone can easily understand that the per-class strategy shows very poor performance due to the multiple tables that must be used for object storage. Performance suffers in terms of object manipulation and queries. In any case, it is mandatory to support this kind of mapping because some existing models use it. I have even seen customers claiming it could be very interesting under the following circumstances:

  • when each subclass defines only a few additional attributes, or
  • when most of the business queries are done at the root level, and do not need the extra information defined in the subclasses.

The big question is which of the two remaining strategies makes the most sense to developers and architects. Most O/R mapping tools initially focused on the per-hierarchy strategy probably because it is the easiest to implement and because many developers think it is the most efficient in terms of queries.

To better illustrate the pros and cons of these strategies, let me introduce the following theoretical object model.

  • ClassA has 6 string attributes and 4 float attributes, named a0 to a9.
  • ClassA1 is a subclass of ClassA, and has 6 string attributes and 4 float attributes, named a10 to a19.
  • ClassA2 is a subclass of ClassA, and has 6 string attributes and 4 float attributes, named a20 to a29.
  • ClassA21 is a subclass of ClassA2, and has 6 string attributes and 4 float attributes, named a210 to a219.
  • ClassA22 is a subclass of ClassA2, and has 6 string attributes and 4 float attributes, named a220 to a229.

Comparing the space in the database

In order to stay independent from how different relational database engines are actually managing space allocation on disk for data storage, we just compare the number of cells that are created in the database with each inheritance strategy. A database cell is defined to be the intersection between a row and a column. The total number of cells of a table is its number of rows multiplied by its number of columns. A cell used is a cell that has a non-null value.


These figures are computed for 100,000 instances in each class. Here is a detail of the formulas used above:

  • Per-concrete-class cell count = instanceCount*(10+1) + 2*instanceCount *(20+1) + 2*instanceCount *(30+1)
    There is one additional column is the technical column for OID storage.

  • Per-hierarchy:
    • Cell count = 5*instanceCount*(50+2)
    • Cells used = instanceCount*(10+2) + 2* instanceCount*(20+2) + 2*instanceCount*(30+2)
      There are 2 additional technical columns for OID storage and typing.

  • Per-class cell count = 5*instanceCount*(10+2) + 4*instanceCount*(10+2) + 2*instanceCount*(10+2)
    There are 2 additional technical columns for OID storage and typing.

Even on a simple class hierarchy we can see the waste of space implied by the per-hierarchy strategy. Please note the per-concrete-class strategy is even more optimized than the per-class one.

Comparing the performance of manipulation operations

The population process fills the database. It creates instances in the database, based on the selected mapping strategy and on parameters. We ran it several times changing the mapping strategies and the number of instances.

Here are the results in milliseconds for a database progressively populated with 5000, 50000 and 500000 objects (results based on an average of three runs):


As expected, the per-class strategy is the least efficient. You can also see the per-hierarchy strategy is 50% slower than the per-concrete-class strategy. This becomes more evident as the volume or the number of subclasses increases, but is difficult to notice in a typical benchmark configuration when volumes are small and the number of subclasses is limited to three. This can lead to misleading conclusions at the all-important product evaluation stage.

The result demonstrated above would be exactly the same for UPDATE and DELETE operations.

Comparing the performance of queries

If we push the comparison even further than most benchmarks by querying the root class and then querying on the subclasses, we will see that the performance difference is further entrenched.

Queries on the root class (ClassA)

Here are the results in milliseconds for a database progressively populated with 5000, 50000 and 500000 objects (results based on an average of three runs):


Once again, as expected, the performance of the per-class strategy is very poor. More surprisingly, however, is the fact that the performance of the other two strategies is almost identical.

Remember that in the case of the per-hierarchy algorithm, there is only one query being executed while in the per-concrete-class, 5 queries are executed (one for each class in the hierarchy).

To explain these results, one should keep in mind the huge impact on the database server and the network layer imposed by the per-hierarchy strategy. The reason is that from the database point of view a single big query (with 50 attributes in the result set projection) on one huge table is not faster than 5 small queries (with 10, 20 or 30 attributes in the result set projection) on 5 smaller tables.

Please note that among vendor solutions, XIC best optimizes the per-hierarchy algorithm:

  • No typing filter is used as we retrieve all the instances in the hierarchy. As the query is defined at the root of the hierarchy, XIC detects that special situation and generates no WHERE clause to retrieve all rows from the table instead of specifying the 5 classes to retrieve in the WHERE clause.
  • XIC retrieves all fields of the currently active fetch groups of all classes, so that a single query can be used. Then, XIC populates each instance with the correct set of attributes based on the fetch group of each instance's class.

At this point most lab benchmarks will have stopped, but we are going to push the comparison even further.

Queries on one intermediate class (ClassA2)

Here are the results in milliseconds for a database progressively populated with 5000, 50000 and 500000 objects (results based on an average of three runs):


The performance of the per-class strategy is still the poorest, but, as you can see, the performance of the per-hierarchy strategy noticeably decreases when compared to the per-concrete strategy.

Remember that this is because the per-concrete-class case now requires only 3 simple queries, while the per-hierarchy strategy now requires one query with a complex filter containing 2 “OR” boolean operators. This is imposed because we need to filter the right subclasses (Class2, Class21 and Class22 in this particular case).

Please note once again that XIC fully optimizes the per-hierarchy algorithm:

  • The typing filter is optimized because instead of sending 3 criteria linked by 2 OR boolean operators we can use a LIKE% operator when hierarchical values are used in the discriminating “TYPE” column.
  • The discriminators for classes Class2, Class21, Class22 are "A2", "A21" and "A22". They mimic the hierarchical organization of their classes.
    Instead of generating a filter clause like "WHERE classtype = 'A2' OR classtype = 'A21' or classtype = 'A22'", XIC generates the more optimized "WHERE classtype LIKE 'A2%'" clause.

Queries on one leaf class (ClassA21)

Here are the results in milliseconds for a database progressively populated with 5000, 50000 and 500000 objects (results based on an average of three runs):


In this case, the performance of the per-hierarchy strategy becomes as poor as the performance of the per-class strategy. It would be even worse should you have more subclasses in the hierarchy or more instances per class.

Why not see for yourself?

To better illustrate the pros and cons of these strategies, we have created a project that allows you to discover the implications on your application and on your data store.

We encourage anyone interested in this study and the associated results to go to http://www.xcalia.com/downloads where you can download XIC. Once you've downloaded and installed XIC,

download the Java project.

We used a MySQL engine to run our tests, but you can set the configuration parameters and run it over your preferred database engine.

There are only two files that you need to edit in order to run the project on your system:

  • build.properties, containing settings for your local environment, and
  • params.properties, containing the number of instances to use in your tests.

The application is delivered with an ANT build file. Once you've configured the project for your local environment, simply check the values in params.properties, then run

  • ant perClass to run the per-class inheritance test,
  • ant perConcreteClass to run the per-concrete-class inheritance test, or
  • ant perHierarchy to run the per-hierarchy inheritance test.

Conclusion

Choosing the correct mapping strategy for inheritance is a key step in the design of most Java applications as it can have an enormous impact on performance when database size becomes critical. It is of primary importance to check that all flexibility options, as seen above are available in the mapping definition.

As opposed to widely held belief, the best choice by default is the per-concrete-class strategy, as it optimizes the disk usage and it reduces the load on the network layer and the database server. This same conclusion was reached by Scott Ambler a few years ago in his now famous article about "The design of a robust persistence layer for relational databases".

Because database resources are expensive, it is critical to check how each data persistence vendor implements each inheritance strategy. Products that focus on a per-hierarchy strategy might look good during simple benchmarks but can lead to disastrous situations in production.

Xcalia is a major SOA vendor addressing enterprise IT and business requirements regarding business oriented enterprise data access,