March 17th, 2008
Putting a monitoring system is an important step in Mapricot. The goal of this step is to store the errors in an analyzable space (such as a DB table), along with the relevant details, so that the analysis can be done at a later time.
This step consists of following substeps:
- Design a database table to hold all the errors
- Have a common code in the application that does the following substeps. The common code can be a utility/base method, etc, whatever suits the application. (Usually, I prefer the base method option, as it easily flows into the object oriented philosophy, but again, that is a detailed design consideration.)
- catches errors,
- extracts the error details, and
- logs the errors to the DB table
Within the step 2b (extraction of error details), there are two broad options:
- Manually maintain a database of error codes
- Programmatically create an extract of the error
We focus on the option 2, as it works fairly well, and of course it saves a *lot* (a lot a lot) of time, compared to option 1. When an exception happens, usually a trace is available. Any extraction (hashing) algorithm can be used to create an error code from the trace. Following enhancements can help:
- Ignore/prune any components of trace that are not in application domain
- First create a hash of the module names itself.
- Using steps 1 and 2, we reach a simplified stack trace that is of form:
- M1, F1, line 232
- M2, F2, line 230
- M3, F3, line 400
- etc etc.
- In case stack trace is longer than 10, prune trace elements except first 5 and bottom 5.
March 17th, 2008
Kimball and Inmon are often presented as two different Datawarehouse schoools of thought. In reality however, the choice to choose one versus the other may turn out to be more of a politics (office politics, that is), and less of a technical question. Let us consider an example scenario.
Consider a large cellphone carrier. The IT division consists of different divisions: invoicing, marketing etc. They each maintain different databases, with synchronization processes between them.
While it may make sense to consider an Inmon top down approach, that may simply not be possible if some initial datawarehouses already exist in each division. (You may try to convince the two divisions on how they both stand to gain in the long run, good luck to you on that.) So, you may just be stuck with an bottoms up approach on trying to get them to consistent metadata, but there are a few problems with that.
Consider a basic concept “Customer”. Marketing division may treat a cell phone number (a user) as a customer. After all, each user represents a valid data point for marketing, such as demographic, usage pattern, advertising response etc. The invoicing may treat a customer as an account, only the person that pays (in case of family accounts). Invoicing may be most interested in how the person pays, when they pay and the payment patterns. In such a case, if two different data warehouses exists, and with different interpretations of customer, it would require realignment of corporate terminology so that a consistent metadata can then be built.
So, while it may be interesting to have a mentally stimulating discussion on Inmon’s top down and Kimball’s bottom up approach on that, when you are hauled in as a consultant, you will have to make a decision very much taking into account the internal dynamics.
March 6th, 2008
This morning as I explored the automated filters in my Outlook, I realized that there were 2720 emails from last one year about errors in one of the production systems. Phew! 2720 errors! System must be in an utter mess! Strangely though, the users have never called with errors. When we do site visits, they have been very happy and have asked us for some refinements, some enhancements, but haven’t mentioned any errors.
So what are these errors? These are like the errors when once in a while Yahoo mail tells you it could not get the message, you try 2 seconds later, and it works. You don’t particularly care. Or, once in a while, when it automatically storing a draft, and it can not. You may not even observe this error.
Still 2720 errors in a year, hmmm. For a system with 32 million page views a year, this translates to one error per ten thousand page views. While this may be low, in a production system, the onus falls on the system team to analyze and take some action.
Our eight step plan is summarized as MAPRICOT:
- M (Monitor): Monitor the system for errors and collect the errors in an analyzable place
- A (Analyze): Analyze the errors, generating a frequency count of errors – nothing fancy, but to put things in perspective.
- P (Prioritize): Prioritize which errors seem to be having an impact.
- R (Relate errors with releases): Find if this number has any trend, and correlation with releases. Has a more recent release decreased the average number of errors per day? Increased it?
- I (Improve logging): If the errors are not clear when they are happening, improve logging so that in next release, more data can be collected.
- C (Code improvement): Change/improve any code that can help, for example, by putting another layer of check (session expired), etc? Also, sometimes the error is simply in logging, for example, are we logging a communication specification (which can happen anytime from a user’s browser) as an application exception?
- O (Offer urgent help): If there are any errors that are happening that are causing perceptible data loss, then a workaround needs to be offered right away, even before the error can be fixed in an upcoming release. For example, if one in thousand customers are getting overcharged or undercharged, they need to be identified, and accounts manually corrected.
- T (Train): Assuming errors are not critical, but still a user annoyance (for example, record cannot be saved if using a certain option), then offer any user training that can be given to alleviate situation?
The application error handling mechanism must allow this by logging enough details in a persistent storage, such as a database. Luckily, ours does.