0

The Danish Voter Classifier



A graphic of Danish election results produced by an unknowing statistician can be used to explain the fundamental concepts of machine learning.

For those new to the concepts of machine learning, it’s really instructive to study the figure below, which shows Danish voters’ tendency to vote for either the socialistic (red) or liberal-conservative (blue) political blocks in the last parliamentary election. In fact, it constitutes what’s known as a classifier: Given the home address of a voter, it predicts what block he or she will vote for.

Danish voters tendency to vote for political blocks dependning on home address.

Danish voter’s tendency to vote for political blocks depending on
home address. Based on a graphic from Berlingske Tidende.

That’s what a classifier does: Given some information about something, the classifier predicts which class the something belongs to. And just like all classifiers based on machine learning, this classifier is also produced from the statistical exploration of some training data - in this case the 2011 election data. I don’t know for sure, but presumably, a statistician has used some mathematical methods on the election data to calculate where the straight line separating the red and the blue area should be. It’s this process that‘s called learning in machine learning. Notice that learning must take place before the classifier can be used, and that learning requires a set of training data to learn from. This is the case for all machine learning applications.

The second point that one can make from this figure concerns the difficulties in achieving good accuracy. Frankly, that figure a lousy classifier, because in reality, there are many socialistic voters in the blue region and many liberal/conservative voters in the red region, so the classification accuracy is not very high. The problems here essentially boil down to:

  1. The home address of a voter alone doesn’t really say that much about his or her political standpoints. One needs to know more: Income, education, age, gender, IQ, favourite pet... any variables that might correlate with political views should be taken into account. In general, selecting these pieces of information about the object to be classified (they’re called features in machine learning jargon) is often both crucial and difficult and requires considerable domain knowledge to achieve good results.
  2. A single straight line is a too simple separator. It’s surely possible to define areas of Denmark where one or the other political block is dominant, but their borders would be curved, and it would be necessary to use several distinct lines to represent them. Fortunately (maybe), the machine learning literature is packed with different models for classification: Linear discriminants (that’s the single straight line), logistic regression models, neural networks, support vector machines, etc, and each of these models allows for different possibilities to fit separators to the data.

Of course, I’m not blaming the unknowing statistician for bad machine learning craftmanship, because the figure wasn’t made for that purpose. However, for real machine learning applications, proper feature selection and model selection are crucial for achieving high accuracy, and together with the data collection, they are the most important development activities. We’ll surely get back to these concepts repeatedly in the postings to come.



Kommentarer

  1. Klaus Marius Hansen skrev:

    Good explanation. It would be nice to have a down-to-earth explanation of different (real) classifiers, possibly with applications/examples. Hopefully, you will get to that :-)
  2. Jerker Hammarberg skrev:

    Thanks, and your request has been noted! I've promised to stay rather non-technical on this blog, but that said, sometimes it's necessary to understand a bit how the classifiers work in order to understand why they work and why they don't for different applications. So yes, I will get to that - at least to some degree.

Skriv kommentar



InfinIT er finansieret af en bevilling fra Styrelsen for Forskning og Uddannelse og drives af et konsortium bestående af:
Alexandra Instituttet . BrainsBusiness . CISS . Datalogisk Institut, Københavns Universitet . DELTA . DTU Compute, Danmarks Tekniske Universitet . Institut for Datalogi, Aarhus Universitet . IT-Universitetet . Knowledge Lab, Syddansk Universitet . Væksthus Hovedstadsregionen . Aalborg Universitet