Rate this page del.icio.us  Digg slashdot StumbleUpon

Design Issues in High-Performance Transactional Applications using Java and Linux

by the editorial team

This tip article is edited and republished from the original source.

by Andrew Oliver

In Java Performance Tuning on Linux Servers, Andrew Oliver covered some general and network performance tuning issues for JEE apps running on Linux. He now looks at issues for applications in the data layer. Lower level database issues, the higher level server application persistence model, and the interactions between the two are discussed. As before, it is assumed the application is “transactional” in nature. Most web applications, including client server and business applications fall in this space. (Messaging, analytical and ETL-type applications require different strategies, and different choices).

Database Issues

Choosing a database

When designing a database, we need to be aware of differences in how databases handle concurrency. The complicated term multi-version concurrency control (MVCC) describes a feature that is needed for most J2EE, web, transactional, and operational type systems (as opposed to analytical-type systems). Another option is traditional row and page locking.

MVCC conceptually means that when data is SELECTed into a running transaction, a copy is made and subsequent calls involving that data work on the copy. Traditionally, databases offered pessimistic row and page locks, which means that concurrently running transactions attempting to work on the same data were blocked until the other transaction was completed. Most MVCC databases also allow specific pessimistic locking via “SELECT FOR UPDATE” statements.

A second, but less important issue, is whether the database supports native operating system threads (NPTL) or processes. The effects of this may be marginalized by the use of connection pooling, but may still have some impact on the overall scale of an application. Generally speaking, operating system threads scale better than processes. Processes are more resource intensive and the use of shared memory for inter-process communication limits performance and scalability by comparison to the “pass by reference” communication in threads. However, shared memory is certainly more efficient for IPC than sockets.

A third issue is Binary Large Object (BLOB) support. In general, high performance transactional applications do not require BLOBs, but if you use Java Messaging Service (JMS)-backed by a database store anywhere in your application, BLOBs may become an issue. PostgreSQL, and recent versions of Oracle Database, support BLOBs adequately. MySQL has very poor support for BLOBs as it reads them fully into memory, regardless of the use of getBinaryStream() (it passes a ByteArrayInputStream). The use of emulated SQL-locaters eliminates part of the problem, but at the cost of creating a severe performance problem (there are ways to mitigate this, but that is another topic).




Database

MVCC

Threading

PostgreSQL

Fully Supported

No (Multi-process/shared-memory)

Oracle

Fully Supported

Mixed (more or less processes)

MySQL (InnoDB)

Supported for Reads/Row-locking for writes

NPTL (native threads)

DB2

Not Supported (traditional row/page locks for both reads and writes)

No (Multi-process/shared-memory)

You may also want to look at other issues, such as the database’s support for balancing the data among multiple disks and drive controllers (another side-topic), and the database’s license. Both Oracle and DB2 are fairly expensive. MySQL is either free or fairly inexpensive depending on whether you use the community or enterprise version. PostgreSQL is free, and BSD-licensed (no reciprocation clause, like Apache HTTPD), although commercial editions are available with support through various vendors. Benchmarks for various databases say what the vendor wants them to say. For this type of software, the decision about the best approach to take is not always clear-cut. In the field, I’ve had customers who managed to use a less-well performing free database with an additional processor board- and save substantial sums on license fees, as well as maintenance. On the other hand, more expensive databases tend to have better high-availability and hot-backup options, and none of the popular free databases even claim to have full XA (the 2 phase commit/transaction recovery protocol) support.

Database Design

Database design is a hotly contended issue in many organizations. Schema designs tend to run the gamut from chaos, to congressional committees. From a performance perspective it is best if the database schema closely matches the operational system’s use of the data, and each table provides version columns. Generally, synthetic keys (sequences, auto-number) are preferred to business keys as primary keys. Business keys tend to be imperfect and often require composites that lead to rigid schemas that are unable to evolve, and are difficult to query across joins.

It is also best if analytical schemas (such as those used for monthly reporting systems) are kept separate from your operational system schemas. If “real-time” data reporting is required, then consider utilizing messaging or replication solutions. Note that real-time data often isn’t really required. “Not really old data” may be good enough, provided that you manage your load carefully when providing frequent feeds to data marts and warehouses.

Transactions

Many Java developers erroneously believe that transactions are only for writes. But transactions may also may govern the consistency of your reads - how they do this depends upon the above-discussed locking strategy, as well as the isolation level. Java Database Connectivity (JDBC) allows you to configure this on a per-connection basis, and application servers generally allow you to configure this for connection pools. The isolation levels are as follows:


Isolation Level

Definition

SERIALIZABLE

Transactions occur in a completely isolated fashion, or maintain the illusion of this. (Table/Range locks)

REPEATABLE READ

Read locks (row or page), but no range/table locks.

READ COMMITTED

Data selected in one transaction can be concurrently modified by another.

READ UNCOMMITTED

Dirty reads are allowed (transactions can see data changed in another transaction)

Consider the following problems:


Problem

Definition

Phantom Reads

The collection of rows returned by the same query executed twice in the same transaction is different.

Non-Repeatable Reads

Transaction “A” selects data after which Transaction “B” pdates that data and commits, following which Transaction “A” reissues the same select and receives a different value.

Dirty Reads

Transaction “A” updates a row but does not commit. Transaction “B” selects that data. (consider the implication if transaction A then rolls back!)

Not all databases support all of these ANSI/ISO isolation levels, and worse, their implementation may be different depending on whether MVCC is used/supported or not, as the ANSI/ISO isolation levels were standardized with traditional row/page locking in mind. The default commit level for most of J2EE computing is READ COMMITTED or REPEATABLE READ (depending on the database). It is important to understand the implications of this for statements such as: select count(*) from current_events where bucket=1 Perhaps you only want to allow 100 current_events. Such logic would require range or table locks (such as provided by the SERIALIZABLE isolation level) - however, it would be difficult to scale, and there are probably other ways to accomplish this without locking (such as using timestamps, and ignoring the extras).

Optimistic Locking

It is difficult to scale highly-concurrent systems with pessimistic locking. On the other hand, many Java systems have been written without thinking of the error effects of concurrency altogether!

Optimistic locking means that we allow two transactions to proceed under the assumption that they will likely work out okay and that if they do not, we have some way to resolve the resulting error.

Optimistic concurrency error



Transaction A

Transaction B

Result

update customer set state=’NC’, set city = ‘Durham’ where id =
1

city = ?, state = ?

update customer set city=’Birmingham’, set state= ‘AL’ where id
= 1

city = ? state = ?

commit

city = Durham, state = NC

commit

city = Birmingham, state=AL

Here, we have two separate transactions attempting to update the same customer for two separate places. In this case, the second transaction (”B”) succeeded, resulting in simple data loss. Perhaps customers send us address updates when they move via the mail. In this “outside” case we got two updates from the same customer at the same time. Ideally the second one would have failed and some other business logic would have resolved the conflict.

Optimistic Concurrency in Action



Transaction A

Transaction B

Result

update customer set version=2, set state=’NC’, set city = ‘Durham’ where id = 1 and version = 1

city = ?, state = ?, version = 1

update customer set version=2, set city=’Birmingham’, set state= ‘AL’ where id = 1 and version = 1

city = ?, state = ?, version = 1

commit

city = Durham, state = NC, version = 2

commit

city = Durham, state = NC, version = 2 (”0 rows updated” reported by JDBC)

In this second scenario, the first commit won and we can detect the failure of the second commit and either contact the customer to figure out which was correct, or use whatever business logic we have to determine this. This is even more likely to be a problem in an interactive application where two users have permission to update the same data. The good news is that in Java you do not need to do this by hand. Most OR mapping solutions support this automatically.

While optimistic locking is not dependent upon MVCC, it is easier to pull off with MVCC. Optimistic locking does not depend on version columns. Some databases support MVCC, detect collisions, and fail the second update automatically. This isn’t, however, the default policy in Oracle (and many others) at the “read committed” isolation level. It will allow both transactions to succeed (regardless of the conflict). Secondly, collisions that occur from data being passed OUTSIDE of a transaction are not automatically detected by any database MVCC implementation. With all of these considerations it isrecommended that operational systems include version columns in their schema. A proper version should be a simple numeric, for simplicity. It is acceptable for it to rollover by some policy so long as rollovers could not overlap the length of an application transaction.

Most highly interactive, B2B, and messaging applications have some kind of concept of an application transaction. The simplest way to conceive this is to consider multi-page forms in a standard web application (Web 1.0 that is). In a multi-page form interaction, data is retrieved from the database, modified by users, and eventually results in a series of inserts, updates and deletes. The application would not scale if transactions were kept open for the entire interaction. Application transactions may span multiple request-response cycles where database transactions never should.

Application Transactions with OR-Mapping

As seen above, it would be possible to implement this strategy using standard SQL and the JDBC APIs. However, it is usually preferable in larger scale business systems to use an Object Relational Mapping framework such as Hibernate. Using O-R mapping can greatly simplify our code, but does introduce its own layer of configuration and performance cost. That said, many of the things that are most costly in a persistence layer are things that a proper EE application would do anyhow.

Because it is the most popular, I will focus on Hibernate, though the same technique should be available in other similar frameworks - especially those supporting the new Java Persistance API (JPA) standard. Let’s say that we are writing an issue tracker for a power company’s equipment out in the field. The table structure might resemble this:

For the sake of this example, we’ll define two types of agents at the power company who interact with this data: personnel at the field repair office and phone agents who take customer calls. We need to handle the fact that someone is going to pull down an “issue” on their handheld device, update it (potentially off-line), and sync it later. We also need to handle that other agents might look at the same issue when handling a customer call at the same time.

At the field office we have a basic web application. In the field each agent has a Java(tm) application that uses Java RMI (Remote Method Invocation) running on their handheld device. How do we deal with concurrent modifications? Holding open an actual database transaction is totally infeasible as holding a database connection for days for each handheld device for each issue would not scale, or even work. Yet we still have transactions (and possibly more than one type of object enrolled). This long running transaction is our “application transaction” which spans multiple database transactions. We can implement this manually using the techniques above, or we can use a tool such as Hibernate.

Hibernate gives us two ways to handle “application transactions.” The first is a “detached session” (referring to Hibernate’s Session object) which allows us to disconnect our “persistent context” and store it somewhere (such as the HTTP Servlet “Session” object). This method has some minor advantages over the “detached object” method (below) for de-optimized schemas without version columns, but is not feasible for or handheld scenario. It also may not scale well in clustered web applications (as you must carefully evict objects from the hibernate session before storing it in the Servlet session). The other option is “detached objects” which just means Plain Old Java Objects (POJOs) which have the version field, but have escaped (such as a method return) the Hibernate/Database transaction context. These must be “merged” back in later, and conflicts must be resolved somehow (such as asking the user when a commit fails to update the row). The Hibernate documentation describes specific use of detached objects in some detail. This is not unique to Hibernate, the JPA Persistence Specification which is a part of EJB3 (JSR 220) describes a standardized version of detached objects. Specific support for serializing detached objects is not directly in the specification, other implementations may support it however.

Fetch Strategy with OR-Mapping

Consider a schema like this:

We have a customer object that allows us to retrieve the order object, and the order object allows us to get at the shipping location, as well as the order items. The order items allow us to get at the product that was ordered.

Consider if we allowed a user to edit the shipping location for an order from an order summary screen:

We need the Customer, its Order and Shipping Location objects. This means that the data for each of these relationships needs to be retrieved from the database efficiently. Hibernate annotations (which is an implementation of EJB3’s JPA annotations) allow you to specify which data is fetched, whether eagerly or lazily (meaning only if/when asked for), and how:

@Entity
public class Customer {
...
@OneToMany(cascade = CascadeType.ALL, fetch = FetchType.EAGER, mappedBy = "order")
public List<Order> getOrders() {...}
...
}

However this may not be sufficient or even ideal. First of all, it might be silly to fetch all orders for a given customer, you may not even want the relationship reflected at the entity level in the first place (thus an inverted query against the Order may be appropriate). Secondly, even if you want to fetch all of the orders in THIS case, in other cases you may not need them at all! JPA/Hibernate also allows you to specify fetching specifically in the JPA-QL/HQL query used to fetch the data:

from Customer fetch join Order where customer...

There is far more on this in the EJB3 specification and the Hibernate documentation. This “use case” driven fetch strategy may also drive you to specific structuring in your Data Access Object (DAO) and service objects. Both may also help clue your DBA in to where the indices should go.

Flush Mode

Another thing to watch out for with persistence solutions is when and where they “flush.” Most persistence solutions gain a performance advantage by scheduling updates and inserts to be sent to the database “lazily.” By default, in Hibernate, this happens just before a query is executed or just before a transaction actually commits at the database level. The data must be flushed to the database just before a query to assure accurate results.

Flushing is expensive. Most solutions do this generically without noticing which entities are actually retrieved. You can potentially gain a performance advantage by toggling the flush mode when you “know it’s safe” (e.g. not on large projects with lots of developers of various skill levels); or by ordering your operations smartly to execute non-dependent queries together before updates (although this may be difficult to muster in a proper object-oriented/service-oriented design). Playing with the flush mode and scheduling queries is also how you can potentially run less sophisticated databases which still use row locking in a lower isolation level (with a lot of specific work) to achieve higher concurrency. There is a great deal more on flush modes in the Hibernate Documentation and the EJB3 Specification.

Summary

We have scratched the surface of the “soft side” of the persistence layer in an enterprise transactional system. There is much more to consider here, including understanding modern hardware options and tuning options for specific databases. However, every article must have a scope and a beginning and an end.

When designing a database persistence layer for a transactional system, take care to:

  • Choose a database that supports Multi-version Concurrency Control
  • Be aware of the database’s threading/process design
  • Be aware of the database’s high availability and replication options
  • Design schemas appropriate for their uses, and create separate instances for analytical and reporting systems
  • Understand Transaction Isolation Levels and their implications
  • Make use of Optimistic Locking techniques
  • Design for Application Transactions (vs Database Transactions) where appropriate
  • Understand and make use of detached objects to support Application Transactions.
  • Tune your OR Mapping tool’s fetch strategies and make use of query support for fetch strategies where appropriate.
  • Understand the difference between flush mode and transaction commit, and the performance and integrity implications of flush mode.
  • Understand specific hardware options and tuning options for your database.

Thanks

Thanks to Max Rydahl Andersen from the Hibernate team for his help/suggestions for this article.

About the author

Andrew C. Oliver is the CTO of Bunisoft and a co-founder of the Buni.org open source software development community. He spends most of his time working with developers to create great communications software — in particular the Meldware Communications Suite scheduling and mail server product. You can read more thoughts from him in his blog.

Copyright (C) by 2006 Bunisoft LLC.This article is licensed under a Creative Commons Attribution 2.5 License (CC BY-SA). Attribution must include the copyright notice, link to Bunisoft.com and authors name.

One response to “Design Issues in High-Performance Transactional Applications using Java and Linux”

  1. Ian Fallon says:

    Nice article - thanks.

Leave a reply