The Danish Voter Classifier

A graphic of Danish election results produced by an unknowing statistician can be used to explain the fundamental concepts of machine learning.

For those new to the concepts of machine learning, it’s really instructive to study the figure below, which shows Danish voters’ tendency to vote for either the socialistic (red) or liberal-conservative (blue) political blocks in the last parliamentary election. In fact, it constitutes what’s known as a classifier: Given the home address of a voter, it predicts what block he or she will vote for.

Danish voters tendency to vote for political blocks dependning on home address.

Danish voter’s tendency to vote for political blocks depending on
home address. Based on a graphic from Berlingske Tidende.

That’s what a classifier does: Given some information about something, the classifier predicts which class the something belongs to. And just like all classifiers based on machine learning, this classifier is also produced from the statistical exploration of some training data - in this case the 2011 election data. I don’t know for sure, but presumably, a statistician has used some mathematical methods on the election data to calculate where the straight line separating the red and the blue area should be. It’s this process that‘s called learning in machine learning. Notice that learning must take place before the classifier can be used, and that learning requires a set of training data to learn from. This is the case for all machine learning applications.

The second point that one can make from this figure concerns the difficulties in achieving good accuracy. Frankly, that figure a lousy classifier, because in reality, there are many socialistic voters in the blue region and many liberal/conservative voters in the red region, so the classification accuracy is not very high. The problems here essentially boil down to:

  1. The home address of a voter alone doesn’t really say that much about his or her political standpoints. One needs to know more: Income, education, age, gender, IQ, favourite pet... any variables that might correlate with political views should be taken into account. In general, selecting these pieces of information about the object to be classified (they’re called features in machine learning jargon) is often both crucial and difficult and requires considerable domain knowledge to achieve good results.
  2. A single straight line is a too simple separator. It’s surely possible to define areas of Denmark where one or the other political block is dominant, but their borders would be curved, and it would be necessary to use several distinct lines to represent them. Fortunately (maybe), the machine learning literature is packed with different models for classification: Linear discriminants (that’s the single straight line), logistic regression models, neural networks, support vector machines, etc, and each of these models allows for different possibilities to fit separators to the data.

Of course, I’m not blaming the unknowing statistician for bad machine learning craftmanship, because the figure wasn’t made for that purpose. However, for real machine learning applications, proper feature selection and model selection are crucial for achieving high accuracy, and together with the data collection, they are the most important development activities. We’ll surely get back to these concepts repeatedly in the postings to come.

Stories of Machine Learning Applications

The Alexandra Institute has seen an increasing interest in machine learning among our clients in the recent years. That’s why we start up this blog on machine learning applications. In this blog, we will tell stories about machine learning applications and projects that we have come across.

A few months ago, I got an email from my colleague over at the communications department, where she asked if we, the tech nerds at the Copenhagen branch of the Alexandra Institute, would like to start up a blog on machine learning. I made a quick search on the Internet for machine learning blogs, and there were already a few of them - at least of the highly technical sort, written by experts for other experts. So that’s been done.


Tech nerds: Christian, myself and Morten. 
Tech nerds: Christian, myself and Morten.

On the other hand, we’ve seen an increasing interest in machine learning applications from our clients in the last few years, not least because of the booming smartphone market. Most of these clients don’t even know what machine learning is - they just have an idea for a cool application and a mysterious feeling that the idea might be difficult to implement, so they turn to us to sort things out for them. Often it turns out that their applications have research potential, and then we involve machine learning researchers from the universities in the project.

So, given this interest from our clients, given all the potential cool applications, and given the remarkable research progress in the last few years (which renders machine learning applications so accurate that they’re actually usable and no longer frustrating), we decided to start up a blog with stories about how machine learning can be used, plain and simply. Some of these stories will be inspirational, some of them will be more educational. Some stories will be about our own projects, some will be about other projects from all over the world. But all of them will focus on applications rather than technical details, and all of them will be written with the non-expert reader in mind. With these stories, we want to inform and inspire, hopefully contributing to the innovation of even more cool applications.

And when you have an idea for a cool application, or you just have anything machine learning-ish to say - don’t hesitate to write a comment to the posts or to contact me directly. You’ll find my contact information by following the link on the About page.

InfinIT er finansieret af en bevilling fra Styrelsen for Forskning og Uddannelse og drives af et konsortium bestående af:
Alexandra Instituttet . BrainsBusiness . CISS . Datalogisk Institut, Københavns Universitet . DELTA . DTU Compute, Danmarks Tekniske Universitet . Institut for Datalogi, Aarhus Universitet . IT-Universitetet . Knowledge Lab, Syddansk Universitet . Væksthus Hovedstadsregionen . Aalborg Universitet