A passion, technology.
Imagine a working environment which promotes technological innovation and curiosity.
Imagine a group where you will have opportunity to work and to share with people among the most gifted.
Imagine a culture and stocks in break with SSII.
Imagine a society where your talents and your ideas will be admitted and encouraged.
Imagine an organisation which gives you the medium reals to work, to advance, to accomplish your personal plans.
Do not imagine any more, live him!
Do not hesitate to contact us across the Form of contact or directly at address recrutement@xebia.fr.
ADDM Adobe Android annotation GAPED development Devoxx Eclipse ESB WORKING FLASH FLEX GOOGLE GROOVY GWT HIBERNATE IBM J2EE POPULAR DANCE JAVAFX JAZOON JBOSS jdk-7 scala SCRUM SOA JEE JPA JSF JVM agile Methods MAVEN ORACLE OSGI PARIS JUG PERFORMANCE RIA SPRING SPRINGSOURCE SUN TOMCAT WEBSPHERE WICKET XEBIA XP
is a brain tumour, exclusively devoted to technologies J2EE.
Except opposite mention, the contents of this blog are under contract Creative Commons.
I was, there is not much, confronted with a problem of performances which it is possible to qualify as interesting - in the mouth of a technical expert, this word has generally tendency to cause a whiff of panic to the most experienced of the managers.
I explain you.
The programme consists in applying in great numbers a treatment identical to a volume mattering from data in short, it is a batch. Operational objective: assure the capacity of the system to treat 50000 files an hour.
The architecture of execution of it batch is relatively classical: an inspector is made responsible for getting job to a service a list of files to be treated, for segment this list in lots, then for subject lots to a pool of threads who are going to accomplish treatments in parallel every lot is treated in a distinct transaction. The unit treatment of a file is relatively long, in the order of 2 seconds; lots are therefore small 4 files to limit the length of transactions.
Beyond his role of segmentation and of sharing out of lots, the inspector provides the whole piloting, really without relation with our problem, but that I mention for information: state of progress of the treatment, break, resumption, calculation of various statistics, reports of errors, etc.
Typically, with such architecture, the complete debit of the system is a function of the time of unit treatment and among simultaneous threads:
D bit Horaire = (3600 / Temps Unitaire) * Nombre de threads
For my batch, theoretical calculation is simple: the treatment of every file asking 2 seconds, a thread can treat 1800 files an hour. The target debit being 50000 files an hour it is therefore theoretically necessary to start a pool of threads 30.
This calculation however leaves an audacious hypothesis: he assumes that the time of unit treatment will remain stable whatever is the number of threads. All in all, that the rise in load of the system is perfectly linear.
And tests real size accomplished in preproduction show a behaviour moved away enough from our platonicienne equation:

Expressed in form of debit, the same graph shows a worrying slope:

The debit per hour moves away more and more ideal theoretical value and strives towards a value hopelessly moved away from the target debit. Worse, he would have tendency to degrade when the number of threads becomes too high. No panic, however, this behaviour could destabilise the veteran of performance who I am: such curves reveal simply that the system attains a border which forbids rise in load. Which physical or software component remain to determine is at origin there.
The first reaction is naturally to prove that no physical component arrives at saturation. Tools ad hoc are used to control consumption CPU, overhead of GC, the paging, the bandwidth, I/O disc, etc. I spare you the detail. Two interesting elements are drawn of this analysis: firstly, no physical component approaches its borders of capacity; secondly, their level of load marries almost perfectly the evolution of the debit introduced in the face 2. In other words, the usage of physical resources, for example medium load of processors, introduces the same evolution as the debit: their rise in load is slowed down and moves away inexorably from linear theoretical progress.
Once again, these are relatively common characteristics of parall lis s systems: it means that the threads of execution enters competition to get some bolts, or to achieve some critical sections; the more number of threads augments and the more competition becomes gripped: it is needed more and more time to acquire bolts.
In our system, two components are likely to create these conditions:
Whoever has already worked on a transactional system strongly parall lis knows that satisfy them on the basis of data are the main reef. To assure the insulation of transactions, SGBD puts down bolts in reading and/or in writing on lines, pages or tables in case of very strong load, the granularit of bolts is sometimes changed dynamically, crossing for example of the line (ROW LOCK) to the page (LOCK PAGE).
Our batch system was consequently conceived: the base is partitionn e according to strict functional axles, the operations of reading and of writing were carefully s quenc es, bolts are put down at the level of pages and the inspector can exploit some characteristics of the physical stocking to calculate and to divide segments between the threads different. An analysis of satisfied on the basis of data showed the effectiveness of these strategies: level of contention remains remarkably low, and explains under no circumstances the strong deterioration noticed during rise in load.
The single explanation stays: of satisfied Popular dance in the layers of execution. And a question: how to identify origin of this satisfied, and to correct them?
A first test is accomplished to validate hypothesis of satisfied popular dance applicatives in code: it consists in starting 5 provided processes each of a pool of threads 4 a total of threads 20, therefore, allowing to attain a rate of parall lisation which in our previous tests headed strongly with the deterioration of the time of unit treatments. This test is concluding: distributed on 5 distinct processes, threads is not any more subjected to a deadly competition, and the rise in load comes true satisfactorily.
A possible solution would be to adopt this architecture, and to distribute treatments on several processes. The inspector, however, was not conceived to co-ordinate several processes. The simplest evolution, consisting in separating the inspector of components jobs for example by means of EJBS implicates to introduce a layer of distribution which seems to us all the less desirable as the inspector is charging up of the piloting of transactions: to separate him from treatments means to distribute transactions, and all small worries which this decision implicates generally. It means so probably the introduction of a server of applications in architecture of batch, with accruing difficulties in term of deployment and of working. Other more massive evolutions are possible: for example lean on technical tables to stock segments, and to d coupler completely a stage of meadow-processing consisting in calculating lots and a stage of treatment consisting in digging over lots into the base and in treating them. It is even possible to imagine more exotic solutions, based on queues JMS... But to put it briefly, we like the density of our initial architecture, and are not ready to leave it without resisting a bit. Question remains therefore whole: how to bring origin to light of our satisfied so as to liquidate them? comment d busquer l'origine de nos contentions aux fins de les liquider ?
The static analysis of code gave nothing: the inspector, following the example of components jobs, uses a zillion token of third libraries, more or less source open and in the stream of execution readily cryptique. Impossible in these conditions to determine precise origin of satisfied by simple consultation of sources: a runtime analysis is necessary.
Popular dance remains to find the good tool in the handbag of the plumber. Even with the gracious help of GOOGLE, the harvest remains thin in tools...
As offer is excessive when it is a question of profiling consumption CPU or memory of applications Popular dance, as analysis of satisfied stays a not much equipped discipline. Certainly profilers popular dance gives mostly a module of analysis of threads; but their overhead is such as it is difficult, or even impossible to instrument a treatment strongly multi-thread as our batch: beyond 3 or 4 threads, collected information loses of their pertinence, and the unit time stretches out so much that any try of analysis raises the purest acrobatics.
Let us twist the neck fast in hprof, profile him integrated in JDK our, in this case, JDK IBM 1.4.2 SR8 is; material points out an option " monitor=y ", which, I name, tells hprof to generate information one numerous contention monitors used to synchronise the Work of threads (I let you translate). Crikey, to us here is saved! Not completely... Besides the fact that to make him work asked for a rise of version of JDK version which we used at origin generated a very pretty core during the activation of hprof - got information showed itself of a rare uselessness (fundamentally, a state of monitors popular dance at the time of the exit of process. No interest!).
Exit hprof, therefore.
Other approach: generate of a way repeated of thread Dump with the aid of education kill -3, then to compare the different dumps ad with the aid of a tool hoc (in this case an available utility on the site alphaWorks, and the complete name of which I give to underline the sense of the marketing of the engineers IBM: ). Although promising, this solution once again did not give final result: she allowed us to identify some methods candidates, but without its being be possible to quantify their respective responsibilities.
A solution stays: hitch up its handles and write one profile specialised by leaning on JVMPI.
JVMPI (for Popular dance Virtual Machine Profiling Interface) is, as its name suggests it, one IPA - INTERNATIONAL PHONETIC ALPHABET native of fairing of JVM (replaced by JVMTI from Popular dance 5, but it is another story). The IPA - INTERNATIONAL PHONETIC ALPHABET works on a system of callback: the agent (that's how they name the programme leaning on the IPA - INTERNATIONAL PHONETIC ALPHABET) records himself to JVM by specifying types of events which he wants to be notified you will have probably admitted the pattern Noticing.
These events concern most low-grade technical operations accomplished by JVM: load of a class, allocation memory, execution of a method, starting of a thread, etc. Among these events, two hold our attention particularly: MONITOR_CONTENDED_ENTER and MONITOR_CONTENDED_ENTERED. The first points out that a thread wants to carry out a synchronised block, and tries to acquire a monitor to this end (from memory, the notion of monitor in popular dance is very close to that of bolt in the rest of the world); second points out that the thread got the monitor in question and starts the execution of the synchronised block. The time interval between both events corresponds as a result to the complete length of waiting in other words in the length of the contention.
Provided with these rudiments of JVMPI, it is possible to formulate the exhaustive specifications of our small tool of fairing: the length all satisfied them met by the programme is measured. If this length is superior to some threshold (say 5 millisecondes), the name of corresponding method is loggu in a file, as well as a simplified pile of calls.
Good. It does not remain more than to encode.
The point of entrance of AGENT JVMPI is method JVM_ONLOAD. It is in this method that they are going to inform the subsystem JVMPI of events that they want to hurry up (EnableEvent), and of the method callback which will be invoked during the cases of these events (NotifyEvent) :
/* Register callback method */
jvmpi-> NotifyEvent = jtprof_notify_event;
/* Enable events */
jvmpi-> EnableEvent (JVMPI_EVENT_JVM_SHUT_DOWN, NULL);
jvmpi-> EnableEvent (JVMPI_EVENT_CLASS_LOAD, NULL);
jvmpi-> EnableEvent (JVMPI_EVENT_CLASS_UNLOAD, NULL);
jvmpi-> EnableEvent (JVMPI_EVENT_MONITOR_CONTENDED_ENTER, NULL);
jvmpi-> EnableEvent (JVMPI_EVENT_MONITOR_CONTENDED_ENTERED, NULL);
jvmpi-> EnableEvent (JVMPI_EVENT_THREAD_START, NULL);
jvmpi-> EnableEvent (JVMPI_EVENT_THREAD_END, NULL);
...
return JNI_OK;
}
callback method, jtprof_notify_event, a specialised handler contents itself with determining the nature of event and with launching:
put JVMPI_EVENT_THREAD_START:
handle_thread_started_event (Event);
return;
put JVMPI_EVENT_THREAD_END:
handle_thread_end_event (Event);
return;
put JVMPI_EVENT_MONITOR_CONTENDED_ENTER:
handle_monitor_contented_enter (Event);
return;
put JVMPI_EVENT_MONITOR_CONTENDED_ENTERED:
handle_monitor_contented_entered (Event);
return;
...
}
}
Things seem relatively simple. The IPA - INTERNATIONAL PHONETIC ALPHABET JVMPI gives even a system of thread local stocking (method GetThreadLocalStorage), allowing to keep any structure of data devoted to every thread. I spare you the details of creation and of destruction of this structure of data the curious will be able to throw an eye on the source code. Know simply that it is created during event THREAD_START, destroyed during event THREAD_END, and that she notably allows to keep the date of the request of acquisition of the monitor by the thread.
In the final, code allowing to measure the length of acquisition of a monitor is relatively coarse:
//Error handling encodes
....
ctx-> timer = system_current_time_millis ();
}
static void handle_monitor_contented_entered (JVMPI_EVENT *EVENT) {
long finish = system_current_time_millis ();
long total;
thread_local_context *ctx = (thread_local_context *) jvmpi-> GetThreadLocalStorage (event-> env_id);
//Error handling encodes
....
total = finish ctx-> timer;
ctx-> timer = 0;
yew tree (total> __ contention_threshold) {
//Log contention details
....
}
}
Things, of course, are in reality a hair more complex needed not to dream, either.
With a view to effectiveness, indeed, the structure of data JVMPI_Event a minimal game of information contains, for every event, only. In practice, it means that the name of invoked method is not available under literal form, but in the form of an internal identifying created during the load of the corresponding class.
Logguer this identifying would not be really sure of a big interest. It is necessary to be able to link him to the name qualified as method. For it, event CLASS_LOAD is necessary be treated. This event gives the complete definition of the class and all its methods including their literal name. Our programme keeps this information therefore in an associative picture (the source code of which is very widely inspired of that of hprof). This picture can then be consulted during the treatment of a contention to determine the name of appelante method from identifying sound. A similar technology is implemented to explain the pile of calls. I let you consult the source code for details...
The programme must be compiled in form of dynamic library (extension .dll under Windows, .so under Unix), inclu in LIB_PATH (or its equivalent) and included in JVM thanks to option -Xrunjtprof (jtprof is the name given to the library). Here is the type of got exits:
Once perfected, this tool turned out to be very precious. His overhead is very weak events most fond of good food, as sites of allocation memory or the consumption CPU of methods, are not treated. He gave us a very definite state of level of contention, and of the location of these. A dozen of methods was corrected (most of satisfied were of very short term - seldom more of 50ms - but their frequency and their length had tendency to augment).
At the end of these optimisations, the curve of debit comes as follows:

There remains a light switch-over beyond 25 threads, but this time, its origin does not make doubt: our machine of test is simply saturated, and the contention, this time, concerns the access to processors.
Those who followed since the beginning remember perhaps the equation introduced higher in this article:
D bit Horaire = (3600 / Temps Unitaire) * Nombre de threads
If they completely did not forget the programme of mathematics of the Certificate of general education, they will have probably understood that the increase among threads is not the only lever of the debit: to reduce the time of unit treatment is other one. But this is another story...
Tickets on the same topic:
You can follow answers accepted by this article thanks to the thread of comments.
Very interesting analysis of a common problem. I am surprised that to profile him IBM you are not had of result. I remember having used that of eclipse with succ s.
You speak about techno as JMS as an alternative, I can say to you for credit to implement on a big structure distributed that it is one excels solution while forgetting to manage as you say it problems of lock on the basis of data.
It is a good initiative in making articles of this type, congratulation!
a very good tool to diagnose this type of thing am introscope of wily-technology http://www.wilytech.com.
it avoids all stage of coding.
This scenario recalls a recent experience...
Very interesting article, palpitating inquiry...
On the contrary about the use of JVMPI, I think that it is less competitive
that JVMTI (since the version 1.5 of the JDK)
As produced by profiling thread, I recommend YourKit to profile> v7
Thank you, Yoann.
Good morning,
Very interesting demonstration.
In stage of developpement, I could tested profile him Eclipse () and that of NetBeans. it is plutot rather well.
But in production, that's true that an agent installed on the server popular dance is necessary.
Thank you
[] of satisfied. In my case, nothing convincing. I rethink then in the article published on our blog Columns of performance: Regarding satisfied.... I recover the sources of the agent and instal him on the server of application, Weblogic 8.1-[...]
Xebia IT Architects lock France
Defence Colis e - 10/12, avenue of The Ark
92419 Courbevoie Cedex
T l : +33 (0) 1 46 91 76 16
Fax : +33 (0) 1 46 91 88 00
E-mail : info@xebia.fr