Monthly Archives: November 2010

ARIMA and Seasonality Adjustment Support in PMML 4.0

The PMML 4.0 model (released in 2009) has support for Time Series Models, including place holders for ARIMA and Seasonality Adjustment. This is obviously of huge importance in applicability of PMML for forecasting in context of unemployment and other economic indicators. But a more subtle use of this data may be in case of using social networking statistics for indicators that have a seasonal nature, for example, say book sales. Currently, there are a few firms that provide social networking data for better prediction, resource allocation of support staff and resource allocation of advertising budget. However, websites of such providers are (justifiably) more filled with marketing information, and light on statistical details of their data feed. So, one question is, can they take seasonality into account already before providing the data feed to their customers? Alternatively, of course, customers can account for seasonality, but that depends upon the customer having the bigger data set to be able to do the analysis.

Bigger question though is that PMML 4.0 only has a placeholder for ARIMA and SeasonalTredDecomposition, and I am not entirely clear on whether (and if so, how) PMML 4.0 consuming services will interpret the time series models. In either case, the standard specifically says that these are placeholders for future.

In terms of XML structure, time series model is simply one more model choice in the PMML XML, and as before the consumer can choose either to use all models, or only the first model. (Btw, this ambiguity in implementations is my pet peeve regarding PMML structure.)

Here is one example PMML Time Series Model XML element (For simplicity, the elements that are not specific to time series model have been left out):

<!– Header, Application, Timestamp, DataDictionary as before –>
<TimeSeriesModel functionName=”timeSeries” bestFit=”ARIMA”>
<!– MiningSchema –>

Inconsistencies in PMML 4.0 Spec

Perhaps this is as good a place as any to point out a couple of inconsistencies between the official PMML 4.0 spec and the official example in the same doc:

  1. The spec conveys that the “bestFit” attribute in TimeSeriesModel is required attribute, but the example does not have that attribute.
  2. The spec conveys the TimeAnchor element to be inside the TimeSeries (instead of the TimeSeriesModel), but the example has this element inside the TimeSeriesModel directly.
  3. The spec has some typos (such as dateTimeMillisecdondsSince) which are not repeated in the case of example.

I do not know for sure whether the spec or the example is correct in each of these cases, although I am confident that the typo is just that – a typo, and the example is correct in that case.