System metric charts

While analyzing performance problems, people often need to look at system metrics. Until recently, we believed these steps of the analysis process could and were being fullfilled by many other tools which are often available even just at the OS level through the execution of simple command lines (vmstat for example). In addition, many users already have a permanent monitoring solution for OS-level metrics.

However, we realized that switching between multiple tools during the analysis process can be tedious and we decided to implement a simple charting tool to present these metrics to you. All you have to do is open the new “Metrics” tab of the lower pane to see the charted data which matches the time window that you’re currently analyzing (time line filter up at the top of the main pane).

Min bound counts

The “Min Bount Count” feature is a form of embodiment of our “cheap but effective” mantra. What we do here is we try to compute an estimate of how many times did we come out of a given node in the tree, and as a result, based on the sampling interval, we can compute a theoretical max bound for the average response time of that method, in that given path in the tree.

To turn this on, all you have to do is click the calculator logo :

The reason we’ve added this key feature is that, while analyzing sampling results, the question that comes immediately after “in which method(s) is my response time lost?” is, “okay but does this mean this method is called very frequently or does each individual call last a long time or both?” In other words : “am I facing a 1 x n, a n x 1 or a n x m scenario?”. This counter will help you figure that out very quickly and without having to move on to instrumentation mode. This is the key criterion here. We’re constantly working on delaying the moment where you have to draw more information from the system and hence, where you have to be more intrusive in the way you retrieve data from the system.

But you have to be careful about how you interpret this information. Parallelism can strongly influence these counters and you have to be sure you understand the paradigm under which your threads are operating.

Pseudo events

Looking at thread stacks as an aggregated tree is very powerful but we felt we needed to be able to display the method call information in new ways. Not only does this allow us to shed a different light on the data, but this will also serve as the first stepping stone of a new family of views in the future.

As you probably know by now, we’re very dedicated to squeezing as much data we can out of cheap data sources. Our work around thread stack aggregation is probably the most obvious piece of functionality which we derived from this philosophy. But now we’re taking things to the next level and we’re experimenting with APM-style (or distributed transaction style) analysis applied to pseudo-events, meaning trying to use the samples retrieved via discrete monitoring in order to draw conclusions on continuous phenomena.

While this might throw you off a bit at first, we believe this will be an essential part of our monitoring strategy in the future. As of this version however, all you need to worry about is the fact that there’s a new way to look at the events which represent method calls (the pseudo-even table), and that these can come from our JMX harvest.

With pseudo events, you get access to similar analysis capabilities as when working with instrumentation based events. For instance you can further analyze the “transaction” to which a given pseudo event is attached. The “word” transaction is actually not appropriate here, since we’re just talking about the stack trace of a certain thread, resulting from a single given thread dump. But the idea is the same.

You can find out more about events and subscriptions in the section “Subscriptions”. These require using the collector in combination with an agent.

Historical data

As explained in the section “Permanent monitoring”, being able to analyze a problem in real time (what we call “just in time”) may be nice, but it’s a luxury you can’t always afford, depending on the type of problem you’re analyzing.

That’s why we’ve invested so much effort into a collector. The collector allows you to go back and look at what happened to the system in a post mortem fashion, or simply after most of the diagnostic essentials faded away from the system. Thanks to the collector, even if the information is not available in the system anymore, chances are they have been captured and saved for you.

While the main use of this functionality is to store thread dumps and transactional data, it also works for JMX metrics. This means you can go back, browse the values and chart the data which comes from polling the beans at a prior time. The data is simply saved in our mongoDB database, just like everything else which we persist.