RSS Feed"I" in MAPRICOT – Improve Logging
By the time we get to step 5 of Mapricot (our eight step process for handling errors in production systems), we have already identified the error, addressed it, and now we are on the prowl for improving other parts of code, such as logging, the “I” phase of Mapricot.
In the analysis phase, something always catches the eye: tons of “unknown” exceptions, or other exceptions with inadequate details in the log. The most interesting, as well as common, mistake is some code that is absorbing an exception, and logging some generic message, such as the following masterpiece:
try {
// functionality
} catch (Exception e) {
log (“Exception in the functionality!”);
// No details of the exception e, and generic statement
}
This is a classic mistake, where the log statement is generic, and is actually losing out on all the details of the exception. Obviously, this code should be replaced by something as:
try {
// functionality
} catch (Exception e) {
log (“Exception in saving: ” + e, e); // Log both the message, and the exception trace
// throw e; // Rethrow the exception if suitable.
}
Another classic mistake is where the code throws a NEW exception, that is not a wrapper on the old exception.
try {
// functionality
} catch (Exception e) {
throw new Exception (“Exception in functionality”); // bad code – lost the original exception!
}
More appropriate is to do:
try {
// functionality
} catch (Exception e) {
throw new Exception (“Exception in functionality”, e); // Wrap the exception
// or
throw e; // Rethrow the original exception
}
A similar problem comes with SQLException, since that does not have a constructor that takes the inner exception as an argument. In that case, the initCause method can be used:
method throws SQLException {
try {
// try block
} catch (Exception e) {
SQLException sqle = new SQLException(e.toString()); sqle.initCause(e); throw sqle;
// those 3 statements on one line, so that line numbers in debug stmts are consistent
}
Now that we have established patterns of the problem code, we can scan the entire application and find the instances of problem code. Regular expression searches come in quite handy for this.
A second broad category of code instances where “I” phase is helpful is where supposedly significant logging is already in place, but yet we are not able to identify what is causing the error to happen. So, the idea is that while we will not be able to resolve the error in the upcoming build, at least we will improve the logging, and then reexamine the error, and fix in the next build after that. Following is an example of this:
if (matchingInvoiceCount == 0) {
throw new IllegalArgumentException (“Strange, no matching invoice found, for reference # ” + refNumber);
// The logging is fine, and we are struggling to identify why does the invoice not exist
}
One possible improvement that we may then be able to make is:
if (matchingInvoiceCount == 0) {
throw new IllegalArgumentException (“Strange, no matching invoice found, for reference # ” +
refNumber + “, modelPK: ” + invoiceModel.getPK() +
“, model XML: ” + invoiceModel.getXML());
// “Improved” logging – see if the model has a PK, and see if the XML shows any
// reasons why the save could have failed earlier.
// Perhaps we will identify the cause of the error in next release logs
}
All in all, the “I” phase of MAPRICOT should be taken as a way to piggyback some changes about errors that are in general happening, but we are not getting enough information about those errors. The problems that we are able to apply the “I” phase on, will likely be problems that we may be able to apply the prioritization phase in the next build.
"P" in Mapricot – Prioritization of Errors
The prioritization of errors is usually the most critical decision point. As we note in the analysis phase of Mapricot, we have already created some histograms and basic analytical structures that can be used in prioritization. However, it is possible that error counts play only a very small role in the prioritization, due to the following other aspects:
- Effect of an error: For example, if an error is simply causing an error to be logged when nothing bad really happened, that would have lesser priority than an error which actually causes some heartache to a user. Similarly, if an error actually overbills a user by 5$, that may be more important than an error which only loses that user’s default font. The effects of an error are defined by you, and it may not be worthwhile to try to assign any numerical weight to them.
- Effort required for a fix: When we identify an error, what can we do about it? If an error that happens 500 times is easily fixable, than it may have a higher priority than an error that happens 1000 times, and will require 3 days of time before we can have an estimate on the fix time.
- Possibility of a buildless fix: If we identify an error that can be fixed using a very simple change in the configuration and does not necessitate a build, that may be the highest priority action item (This can be considered a specific subcase under “effort required for a fix”).
It is of course easy to note that the “effect of the error” is the benefit analysis, and the “effort for the fix” is the cost analysis if considered using the standard cost benefit analysis for each of the possible action items.
Summing this up, the prioritization phase (“P”) takes the output of the analysis phase (“A”) and attempts to create a following table:
| Error ID | Description | Count | Effect | Effort |
|---|---|---|---|---|
| 1 | Unable to send hourly emails | 168 | No notifications | runtime property change (smtp host) |
| 1 | Unable to change password | 365 | Security issue + User Annoyance | 3 days + deploy cycle |
| 3 | Unable to save home page | 2041 | User Annoyance | DB Alter (no code change) |
"A" in Mapricot – Analysis of Errors
The analyze phase of Mapricot is intended as a quick phase, where all errors are categorized (typically using a group by query) and a frequency count/histogram generated. This phase is usually initiated automatically by using a commercial application such as FDmonitor, which can be configured to generate weekly histograms from the error tables. A typical output is as follows:
Using these histograms, it can be relatively quickly decided which errors require follow up actions (improved logging/offer urgent production/code improvements/training). The analysis phase thus serves as the router and creator for the follow up tasks in the Mapricot framework.
"R" in Mapricot – Release correlation with errors
Correlating errors with releases is the 4th step in Mapricot – our 8 step process in managing production systems. (Remember R for Release Correlation.)
One of good ways of doing it as maintain error counts per day running application (and make sure to account for daily variance if the number of days is too low).
An example is as follows:
Release R4.2.1, error frequency chart:
| Error Code | Count |
| E42320348 | 279 |
| E11239123 | 221 |
| E23927343 | 181 |
| E82302215 | 81 |
| E12914101 | 12 |
Release R4.2.2, error frequency chart:
| Error Code | Count |
| E11239123 | 1124 |
| E23927343 | 741 |
| E82302215 | 445 |
| E12914101 | 73 |
| E42320348 | 16 |
The "M" in Mapricot
Putting a monitoring system is an important step in Mapricot. The goal of this step is to store the errors in an analyzable space (such as a DB table), along with the relevant details, so that the analysis can be done at a later time.
This step consists of following substeps:
- Design a database table to hold all the errors
- Have a common code in the application that does the following substeps. The common code can be a utility/base method, etc, whatever suits the application. (Usually, I prefer the base method option, as it easily flows into the object oriented philosophy, but again, that is a detailed design consideration.)
- catches errors,
- extracts the error details, and
- logs the errors to the DB table
Within the step 2b (extraction of error details), there are two broad options:
- Manually maintain a database of error codes
- Programmatically create an extract of the error
We focus on the option 2, as it works fairly well, and of course it saves a *lot* (a lot a lot) of time, compared to option 1. When an exception happens, usually a trace is available. Any extraction (hashing) algorithm can be used to create an error code from the trace. Following enhancements can help:
- Ignore/prune any components of trace that are not in application domain
- First create a hash of the module names itself.
- Using steps 1 and 2, we reach a simplified stack trace that is of form:
- M1, F1, line 232
- M2, F2, line 230
- M3, F3, line 400
- etc etc.
- In case stack trace is longer than 10, prune trace elements except first 5 and bottom 5.
Apps
