To SSAS or Not to SSAS

This is a case where a data mining tool seemed like the best approach for data mining but was bested by standard algorithms and careful analysis of the data.

A local computer hardware retailer approached the Alexandra Institute with an interesting problem: they receive a lot of data files from their suppliers but the data is not entirely consistent nor is it necessarily correct (from a computer's narrow perspective). A supplier data file has, amongst other things, the suppliers best guess at who manufactured the particular piece of hardware but this is not always correct. So the retailer wanted to automate this procedure and thought that data mining using Microsofts SQL Server Analysis Services (SSAS) could be the answer.

I went into this project with a feeling that SSAS might not do the trick. And I was right. No matter how I tweaked the parameters and the data the hit rate would not go higher than 87%. This is still good, but the best hit rate was with a decision tree. By looking at that tree I found that it basically boiled down to one big if-statement which only looked at one particular input value:

if inputvalue=then result = a
else if inputvalue= then result = b
else if …

...you get the picture.

After analysing the data I found that in most cases the suppliers guess at a manufacturer was correct (in about 87% of the cases, what a coincidence) although it didn't quite match with the list of "legal" manufacturer names that the retailer had ("Hewlett & Packard" versus "Hewlett-Packard" for example).
The data from the suppliers also contain a short description of the particular piece of hardware, most often in english but german desciptions were also present. It turned out that by using algorithms like Levenshtein Distance, Longest Common Substring and Longest Common Subsequenceon both the suppliers guess at a manufaturer and the description I could find matches in the list of legal manufacturer names with a hit rate > 96%. The last few percent are down to noisy data since the (manually maintained) list of legal manufacturers contains entries like "Hewlett Packard", "HP" and "Hewlett Packard Toner".

So Far, So Good

That was the easy part of the problem. The retailer also wanted an automated process that could put each hardware item in a group that is searchable on their website. "Microsoft Office 2007" should be in the group "Applications", "HP 146 GB SAS 15K Universal Hard Drive" should be in the group "Hard drives - internal" et cetera.
No matter how much I tweaked, twisted and mistreated SSAS I could not get a hit rate above 52%. One of the problems with tools like SSAS is that it only operates on discrete and continuous values; a neural network in SSAS, for example, does not look at a discrete string value and see that "236-72AX8G" looks rather similar to "23672AX8G".
So a model was built that allowed the algorithm to determine if we have seen the the particular item before, if we have seen something similar, and how many times we have seen it before. The algorithm then outputs its guess at a group as well as a score that determines how accurate the guess is. This will then be used to further refine the model when a user looks at the values and determines whether or not the guess was correct: correct values will be fed into the model.


I tested it on a number of data set sizes that I used to build the initial model. On the chart you can see how well it performs (percent hit rate) as a function of the model size (percent of full data set). Note that the graph depicts error rate against the full data set which includes the cases used to build the model - the error rate will be higher for new and "unseen" data, but the rate of hits will grow as the model is gradually refined.

The conclusion is: use the right tool for the job. Not all that looks like a duck and quacks like a duck is suitable for the dinner table. While tools like SSAS can be great for some problems, a careful analysis and some old fashioned code can solve problems unsuited for data mining tools.

Skriv kommentar

Netværkets aktiviteter er medfinansieret af Uddannelses- og Forskningsministeriet og drives af et konsortium bestående af:
Alexandra Instituttet . BrainsBusiness . CISS . Datalogisk Institut, Københavns Universitet . DELTA . DTU Compute, Danmarks Tekniske Universitet . Institut for Datalogi, Aarhus Universitet . IT-Universitetet . Knowledge Lab, Syddansk Universitet . Væksthus Hovedstadsregionen . Aalborg Universitet