On fixing dead locks
Dead locks in “oversynchronized” code and data races provoked by “undersynchronized” code are serious issues in parallel programming. Typically, they appear as volatile, hardly reproducible bugs in multi-threaded applications, and often remain undiscovered during QA. After shipping software products, such issues bite the end users and significantly increase the maintenance cost.
The main reason of such “volatility” is that non-deterministic thread scheduling may produce billions of program execution states and only a few of them are erroneous. As a result, it is tough to meet the conditions necessary to stably reproduce the bugs with stress tests.
As known from practice, altering the picture of thread scheduling, for example by improving code performance, may help reveal such latent issues. In the case of Java, it may be done by running the application not only on the standard JRE, but also on other JVMs which may provide higher execution speed.
Charles O’Dale from Senomix Software Inc. kindly agreed to write down such case occurred when testing a client-server application featuring heavy multi-threading. Senomix is a long-term Excelsior’s customer using Excelsior JET, a Java SE 6 JVM with AOT compiler. Native pre-compilation of Java code often provides better performance as compared to traditional JIT-based JVMs, especially on application startup. Here goes the Senomix’ story.
By Charles O’Dale
Senomix Software Inc.
As part of the development process of Senomix Time Tracking, our networked software, a number of automated tests have been created to stress-test the system and ensure any one-in-a-million threading race conditions will be caught before release. In a typical test, test client applications are left to ‘attack’ our application’s server over the course of a few hours and simulate the amount of traffic the system could expect to see over a few centuries of real-time use. If the server continues to operate through that stress test without any difficulties, we can then conclude it will be able to smoothly run through any peak-period traffic an office will experience. The server program is then packaged up as an Excelsior JET executable and tested again before being distributed to our customers for installation.
Our problem occurred when performing this final set of tests for the latest version of our system with the executable compiled under Excelsior JET. Although the Java .jar version of our server program was able to handle any amount of network traffic our tests could throw at it when run against Sun’s JRE, the JET-compiled executable would freeze within a few minutes, with the application halting at a seemingly random location in the code on every interruption.
Under normal circumstances we would conclude that a new race condition had been discovered and go about correcting it. However, every test run against the pure Java version of our server would perform flawlessly when operated on Sun’s JRE — only tests run against the JET-compiled executable resulted in failure. After implementing every check we could think of to prevent deadlock in the executable program’s threads, we concluded there must be a problem with JET and set about informing Excelsior of the issue.
Our test environment made duplicating this problem a straight-forward process, and the test applications and troublesome Java server jar were sent along by e-mail to Excelsior for review. Excelsior’s support team were then able to use their test environment to duplicate the problem we were seeing in the executable and set about identifying the underlying cause.
It turned out the problem was in our own code after all! A newly created thread involved in communication had its run method mistakenly set to be synchronized, with that declaration causing a deadlock in the faster code generated by the JET compiler.
After correcting that mistaken declaration of:
public synchronized void run()
public void run()
The JET-compiled executable ran flawlessly, with the program demonstrating the same reliability as our standard Java jar file.
If our system’s server application only ever operated as a standard Java jar, it’s unlikely we would ever encounter a problem with this code (as the conditions required to bring about the deadlock would truly be a one-in-a-billion event). However, the improved efficiency of the Excelsior JET
executable increased thread speed just enough to bring this problem to light.
If another office experienced a similar condition, with a Java program operating fine but failing once compiled into an JET executable, we would recommend they find a way to duplicate the problem with a similar reproducible test. If internal examination of that problem code could not identify the source of the difficulty, forwarding that test to Excelsior’s support team for their review may provide helpful insights similar to those which Excelsior delivered to us.
Excelsior JET JVM: product info
A Tale of Four JVMs and One App Server: yet another article on revealing latent bugs by testing with different JVMs