Apps Contact Seminars

## Archive for ‘Analytics & BI’

February 27th, 2012

## Artificial Intelligence

View the entire Rum Raisin Toon book.

January 30th, 2012

## Comparing two confusion matrices

Comparing two confusion matrices is a standard approach for comparing the respective targeting systems, but by no means is it the only one. As we will discuss in the coming days, you can also compare two score based targeting systems by comparing their lists. But for now, let us focus on comparing the targeting systems by comparing their respective confusion matrices.

The standard approach is to use a single value metric to reduce each matrix into one value, and then to compare the metric values.  In other words, to compare M1 and M2, we simply compare f(M1) and f(M2), where function f is the single value metric.

Here are some single value metrics that can be considered as candidates:

1. Kappa Statistic
2. F1 measure
3. Matthews Correlation Coefficient
4. Reward/Cost based
5. Sensitivity (Recall)
6. Specificity (Precision)

A related approach can also be to take a matrix difference of the two matrices, and then using a dot product (or scalar product), but it is easy to see that transforms to using reward/cost based metric.  [A.C - B.C = (A-B).C etc.]

January 16th, 2012

## Targeting Systems vs. Classification Systems

It is generally said that targeting system is a degenerate classification system with only two labels.  However, this is misleading.  For example, consider that you are trying to identify a list of customers who are good prospects for an upselling opportunity.   So, you run the system and generate a list of the “to call” list.  The system has then separated your universe of customers into “good candidate” and “not good candidate” customer sets.  Based on the traditional definition, then the decision system is a targeting system.  However, consider the scenario in which, you decide to separate the list into three parts based on how strong prospects they are – “very likely”, “somewhat likely” and “not likely”.  Then, has the same system suddenly stopped being a targeting system, and become a broader classification system?  Of course not, and this example helps us refine the concept of targeting systems vis-à-vis classification systems.  A system can be a multi-level targeting system which is trying to assign of the possibly many levels of actions.  It is possible to have a targeting system with infinite levels, by simply requiring the system to output a “percent likelihood” instead of yes/no.

On the flip side, a two label classification system, which is trying to separate the given list of objects into two different labels shouldn’t really be considered a targeting system.  For example, consider the case in which the classification system is trying to classify a given fossil to be belonging to one of the many known dinosaur species.  As the age of the fossil and the other characteristics continue to get analyzed, the choice narrows down to either the Struthiomimus or Ornithomimus.  Since it now has two labels, does it suddenly become a targeting system?  We argue that it doesn’t, since the goal is to assign one of the labels to it.  The goal was not to target to do something based on that decision.

Thus, the real difference between a classification system and a targeting system is the intent. If you intend to target some objects out of given set to take one action (investigate, upsell, give a free upgrade, stop from boarding the plane), then that is a targeting system.  If you intend to assign one of the labels to the given object, then that is a classification system.

January 15th, 2012

## Defining the center – Mean, Average, Geometric Mean, Median, Mid Range, or something else?

Frequently, the need arises to define the central value of a given population.  While the problem is a bit harder in sociology (for example, what does an average American want?), the problem is easier when considering the scientific world, where the population is given by a sequence of numbers.

Here are some of the many concepts that are used to define the central value.  A central value tries to represent the population, and is almost never perfect.  It may be tricky to decide which of these concepts to use to define the central value.

• Arithmetic mean (also sometimes simply called mean or average) is the average, that is SUM of all numbers, divided by n. For example, the arithmetic mean of 1, 2, 3, 4, 5 is 3.
• Median is the middle entry in the sorted sequence.  For example, the median of 1, 3, 4, 10, 17 is 4.  If there is an even number of values, then just take the average of those two.  For example, the median of 1, 3, 8, 10, 21, 25 is the average of 8 and 10, that is, 9.
• Geometric mean is the 1/n th root of PRODUCT of all numbers. For example, the geometric mean of 1, 2, 3, 4, 5 = 5th root of (1 * 2 * 3 * 4 * 5), that is 5th root of (120). That comes to be about 2.6.
• Mode is the number that appears most often.  For example, given 1, 1, 2, 2, 2, 3, 3, the mode is 2.
• Range is simply the range, that is the minimum and and the maximum, and therefore it is not really one central value.  For example, the range of 1, 4, 1000 is [1, 1000].
• Mid-range is the middle value of the range.  For example, for 1, 4, 1000, the mid-range is 500.5

Here are some observations on these central values:

• When you have numbers that are in a arithmetic series, like 1, 2, 3, 4, 5 (step size is 1) or 2, 5, 8, 11, 14 (here step size is 3), then the average is approximately the middle number.  For example, for population [1,2,3,...99,100], the average is 50.5.
• Similarly, for all numbers from 300 to 500, the average is 400. (Obviously).

### When to use Arithmetic Mean versus Geometric Mean

• The arithmetic mean is a good measure when numbers are of the same order of magnitude – like students scores on a test.
• Geometric mean would be appropriate if the numbers are in different ranges (ballparks) entirely and you do not want one very large number to affect things that much.
• For example, if you have following numbers: 1, 10, 100, 1000, 10000, the average is more than 2200. But a more appropriate “middle” number is 100 in this case. And 100 is the geometric mean here.
• Real world example: 0.98, 8.7, 121, 1400, 9000. From these 5 numbers, arithmetic mean is about 2100. That number hardly means anything. Geometric mean is about 105, which represents more of a central point.
• Interestingly, median is also a good measure in the previous case. But one problem with median is that the largest number (9000) was changed to 10,000 or 10 million, then still the median would be 121, and would not change at all. However, the geometric mean would change (increase) a bit. Arithmetic mean would change TOO much.
• Yet another scenario where geometric mean is more appropriate than arithmetic mean is when the numbers are given as percentage increases or decreases, rather than absolute values.  For example, if the housing market rose 40% 1 year, dropped 40% next year, then an appropriate representation of average growth rate can be found by taking a geometric mean of 1.4 and 0.6, which comes up to be about 0.91, that is about 9% drop.  The arithmetic mean would have conveyed a 0% average change.

Tags: ,

January 5th, 2012

## Ethical/Policy issues concerning Targeting Systems

Targeting Systems can be used to target a very variety of “things”. The things can be people, tax filings, medical claims, flights, customers, and many others. For example, targeting systems can be used to find which of the tax filings (or medical claims) should be reviewed (targeted) for errors (or fraud), it is just as easy to envision a targeting system that reviews customers arriving at a department store for providing individual assistance. Casinos have been doing this for years to find the “high rollers”. Travel industry does this to find travelers who may be given upgrades.

Frequently though, targeting systems target people. And in even more specific cases, the system doing the targeting belongs to the government. It is in those cases that ethical, policy and privacy issues most commonly emerge in the public domain. One obvious reason is that if the targeting system belongs to the government, then it may be forced to declare more specifics of the system (hotels and airlines may have no obligation to declare their system to identify high profile travelers). As an example, US Food and Drug Administration has been publishing import alerts and import bulletins for many years (Incidentally, FDA’s risk targeting system is called PREDICT, you can watch their YouTube video here).

An interesting example of such a scenario is the Automated Targeting System, a US DHS system for every person who crosses the US borders. Although first implemented in the late 90s, the system was first discovered by the public in November 2006, when a mention of it appeared in the Federal Register. Since then, the system has been subject to many lawsuits, primarily from the ACLU and citizens concerned about their privacy. Bruce Schneier, author of books on privacy and computer security (including “Liars and Outliers: Enabling the Trust that Society Needs to Thrive“, wrote about ATS:

There is something un-American about a government program that uses secret criteria to collect dossiers on innocent people and shares that information with various agencies, all without any oversight. It’s the sort of thing you’d expect from the former Soviet Union or East Germany or China. And it doesn’t make us any safer from terrorism.

Outside of ATS, the broader questions are:

• Is it acceptable for the government to maintain “risk” files on its innocent citizens, especially as those files contain the output from a targeting system that is not transparent? (the storage aspect of those files is the main concern here)
• Specifically, is it acceptable for government to use an obscure and opaque risk targeting system (opaqueness of the system is the main concern here)

It is also worthwhile to consider one simple risk targeting system that everyone is largely OK with today – that system is the X-ray machine that we all use when going into secure facilities and airports etc. People have largely come to accept that machine, and are also comfortable with the simplicity and the transparency (no pun intended) of that specific risk targeting system.

Switch to our mobile site