What are supervised, unsupervised and reinforcement learning?

Read time: 6 mins (plus 10 minutes of cheesy song videos, optional)

Supervised learning. Unsupervised. Reinforcement. These are all terms that you might have come across, particularly as AI becomes more widespread in digital advertising. This is because programmatic advertising - the automated buying and selling of online adspace - is especially driven by algorithms that might use one of these modes of machine learning.

It was while developing a piece explaining where AI solutions may (or may not) be suitable for business challenges, that we realised we should actually explain what we mean by these terms. There are plenty more definitions online, but ours is, as far as we know, the only one that illustrates each type using late 20th century mainstream song titles.

Supervised learning: Somebody’s watching me

Supervised learning is when an algorithm is trained using data sets that are labelled - that is, where the answer is known.

A good example is email spam. Let’s say you have a large amount of email and you know whether each email is spam, or not spam (that is, you have labelled each email as ‘spam’ or ‘not spam’, or maybe ‘yes’ or ‘no’, or even ‘1’ or ‘0’). By feeding your data into an algorithm, you’re effectively showing it what the expected output is for each input - that is, you’ve shown it the kinds of emails that it should classify as spam, and the kinds it should not classify as spam.

The algorithm can then start to correlate those inputs with those outputs, and figure out for itself what kinds of emails tend to be classed as spam. It looks at factors such as the amount of text, certain specific words, the domain from where the email came, and so on. The output? Quite simply, a classification: spam/not spam, yes/no, 1/0.

Another example is sentiment analysis, often used in social media analytics. By showing an algorithm the content of tweets, and labelling each of those tweets as positive, negative or neutral, the algorithm can then learn how to predict the sentiment of a tweet.

For ad tech, a good example is user response prediction. If you want to know whether a user is likely to engage with an ad, for example clicking the ad or installing the app, then you can train a system with lots of positive and negative examples. By showing it data relating to the ad - the creative, the time, the date and so on - and telling it whether a user engaged with it, the algorithm can then predict whether a user, when shown a new ad, will engage.

Unsupervised learning: Everybody needs good neighbours

Unsupervised learning is when an algorithm is trained using data sets are not labelled - that is, where there is no real ’right’ or ‘wrong’ answer.

Imagine you have data describing visitors to your website or app, but you really have no idea how to group those audiences. You can try and do this with rules - for example, if someone visits a car website three times in 30 days you could call them ‘car enthusiasts’.

However, these rules may not always apply. The day may come when there are no more car enthusiasts: all the kids are more interested in Fortnite (that day has already come), and cars will be driverless.

So instead of maintaining huge numbers of rules, trying to invent new groups or shoehorn visitors into existing groups, an unsupervised algorithm can identify groups for you.

It’s unsupervised because the algorithm isn’t taught whether an answer is right or wrong, as it is with supervised learning. This because there really isn’t a ‘right’ or ‘wrong’ answer. All you know is, there will be groups. So the system looks for areas - clusters - where variables correlate closely (that is, they’re neighbours - see what we did there?)

So even in changing times, your unsupervised learning algorithm will still tell you what categories of behaviour your website or app visitors are displaying. You don’t know what the categories will be, and neither does the AI. When you do know, however, you can better serve their needs - or, indeed use the groups you’ve identified as input into other algorithms, such as the predictive clickthrough rate prediction algorithm described earlier, in which you can now frame the question as to whether a particular group is more or less likely to engage with an ad.

Reinforcement learning: I get knocked down, but I get up again

Reinforcement learning is great for situations where you want an algorithm not just to provide analysis, but that knows it can and should explore. You tell it what the end goal is, and describe the environment in which it’s operating. It will then, through trial and error, and through feedback from that environment as to whether a particular course of action was successful or unsuccessful, figure out a way to achieve that goal.

Games are a great example for understanding reinforcement learning. Google’s AlphaGo system has been told the valid moves that it can make in the Go board game, and what it’s trying to achieve within that environment - that is, to end the game with its points surrounding the greatest number of stones. From that, it has evolved a system for beating the world champion.

Reinforcement learning was the best way to do this: brute force just wouldn’t cut it because, unlike chess, with Go there are simply too many possibilities associated with all possible moves.

Robots are another good example. They’re told about the world around them, and that their objective is to move in that space - to climb stairs, pour a cup of liquid, avoid obstacles and so on. They’re clumsy to begin with, they get knocked down, they get up again. But as they keep trying, and failing, and trying, and succeeding, and receiving positive or negative feedback from their sensors telling them whether they’ve failed or succeeded, they eventually work out a way to ‘win’.

So, looking at ad tech, if you’re on the buy side and you want to optimise your bids in Real-time Bidding (RTB) auctions, you can describe to the algorithm the environment it’s operating in - which exchanges it’s bidding on, what parameters it can use, what defines a positive reward such as a click, an install, a download, a completed view and so on - and it will set about learning what is the best strategy to maximise that reward.

On the sell side, you can use reinforcement learning in a similar way, for example to set the reserve price in RTB auctions. This time, the reward is a successful auction to maximise revenue.

It’s a formidable challenge, with tens of millions of combinations of possible variables involved in each decision, billions of ads served each day, and all happening in real time.

The potential for reinforcement learning is enormous: simply state what the reward is, characterise the environment, and the AI will figure out, through many iterations, how to achieve that reward in that environment. It’s such a powerful model that it can apply to other sectors too.

The trick really lies in developing the best algorithms, driving them with robust systems engineering, and, of course, understanding exactly which model - if any - is suitable for addressing the business problem at hand. If you want help with this, get in touch!