Machine Learning 101: EM Algorithm
--
This story is about Expectation Maximisation (EM) algorithm. I discuss basic concepts and overall flow of the widely used EM algorithm that is behind many recommendation systems and data mining tasks. This story illustrates the big picture of collaborative filtering and generative models that utilise EM algorithm.
When you open apps, you are almost sure to run into some sort of "recommendations" for you: movies on Netflix, songs on Spotify, products on Amazon, posts on Instagram, or games on your preferred gaming platforms. Recommendation systems are crucial as platforms all try to grab your attention and increase your engagement with them. Attention is the currency of platforms. I often listen to recommended songs and I gotta say, I discover many new artists and turn out enjoying their songs, although I do hate product recommendations and intentionally ignore them.
Collaborative filtering is what happens behind these recommendation systems, which is a strategy to guess users' preferences based on the feedback of the users and other users who share similar characteristics with them. You could be part of many virtual clusters of users in the app database. Feedback could be movie ratings, how many times/how much time you listen to a particular genre of songs, likes, clicks and basically all engagement activities on the app. Expectation Maximisation (EM) algorithm is one way to run collaborative filtering. In the following, I will use a simple movie recommendation system example that utilises ratings alone (in reality, a lot more engagement activities are involved) to predict user preferences to illustrate the flow and concepts of EM algorithm.
What should we watch tonight?
Imagine all user data is stored in a giant spreadsheet as below, where there is n number of users X and d number of movies V. Each user can rate the movie from one to five with five being the highest score. You will notice some movies receive zero score, which means the specific user did not rate them. These are the missing entries, or unobserved entries in technical term, that we are hoping to predict. For example, user X¹ rates movie V¹ one point, V² three points, V³ four points but didn't rate V⁴.
Geometrical visualisation would be something like below. All users are points in a d-dimensional space — this is out of my imagination and drawing capability. This is a high-dimensional space, where d denotes the number of movies available. Each user is represented by a vector that carries information on the user's movie ratings for all d number of movies. I use R⁵ for illustration simplicity.
Now, what the recommendation system wants to do is predict ratings on movies that users did not give a score. How could we make a good guess? Our task is to fill the “0” entries in the spreadsheet above. This is matrix completion in technical term. One way is to use the average ratings of people who share similar tastes with you — based on similarities of your movie selection. In this way, you can think that we are all part of some sort of clusters where members are similar in some ways. Some clusters have more users, some less, as illustrated below.
This brings us to a bigger topic of clustering. The "hard" clustering algorithms like K-means or K-medoids give definite assignment, meaning that once the assignment process is done, each user is in one definite cluster. However, EM algorithm, which is back by generative models — usually a mixture of generative models, gives "soft" assignments, which means the algorithm gives a probability that a certain user belongs to a certain cluster. Generative models (vs. discriminative models) would be helpful when data boundaries are blurry, meaning that we do not have enough information to definitely classify them or simply it is not necessary to do so. For example, some music lovers listen a lot of classical music and a lot of contemporary rock, but it is not necessary to definitely assign them to "classical lovers" or "rock lovers" so as to recommend music to them. By knowing their probability of belonging to certain clusters, generative models are more flexible for recommendation tasks.
Gaussian Mixture Model (GMM)
Generative models are widely used in AI. To use simple terms, a generative model tells you the distribution of the data, and how likely a given outcome is. For example, user X¹ is 70% a classical music lover, 50% a pop music lover and 1% rock music lover. There are many forms of generative models, such as Gaussian Models — which see the data as Gaussian distribution, e.g., generating the next word in a sequence based on its probability of appearing after the current word, and Multinomial Models — which see the data as finite outcomes, e.g., generating a sentence from a finite dictionary. For example, in your dictionary or any pre-defined database, you have {A, B, C, D, S}, thus the probability of generating the word ABS is 1/5 x 1/5 x 1/5. So you kind of know a definite probability of each outcome.
The Gaussian Mixture Model (GMM) are hybrid of Gaussian and Multinomial models. They contain several mixture components, each of which follows a Gaussian distribution. Each mixture has a mixture weight — how likely is a certain mixture, which make it multinomial. I will use GMM to illustrate how EM algorithm performs matrix completion — guessing the "0" entries.
The thought process is as follows:
Expectation Maximisation (EM) algorithm
Here is the overall flow of the algorithm. As I summarise through my notes and projects, I realise it is rather daunting process as a beginner of machine learning, due to the intense math and statistics behind, and meticulous programming techniques to make sure the dataset correctly represented and numerical underflow issues under control.
In order to estimate the "0" entries, we first need to group similar users together — the clustering task. The goal here is to optimise our clustering and use the cluster mean to represent the "0" entries. The task of the EM algorithm is to estimate the parameters of our chosen model — Gaussian Mixture Model (GMM), through iteration till converge. The first step is initialisation, where we randomly assign the parameter set denoted by theta.
Alternatively, we can use K-means/K-medoids to find the centres of our initial clusters (the red dots below) and use the global variance of the dataset as the initial variance for all clusters. The initial mixture weights could be the percentage of data points in the clusters.
After initialisation — our starting state, we need to calculate the initial likelihood (l_old) of generating all the data points using this set of parameters. Next, we want to maximise the likelihood — this is similar to other maximisation problems in machine learning, where we take derivatives of the function with respect to target variables and set the equation to 0. Mathematically, it is really complicated to take derivatives of the original function l_old; therefore, a proxy function l_hat is used. l_hat is in the form as below. Bayesian statistics is used here.
Detailed math tricks are beyond the capacity of this story but there are plenty of proof in the internet. The E-step of the EM algorithm is to find the posterior probability of a certain data point i being labelled as cluster j, denoted as P(j|i). Be mindful that you only use observed ratings for all estimation calculation. Next, in the M-step of the EM algorithm, we need to find the probability that a certain data point X is generated by cluster j, given current parameter set theta. With these, we are ready to take derivatives with respect to the parameter variables to obtain new parameter values. Lastly, we calculate a new likelihood (l_new) of generating all the data points using this new set of parameters. We compare l_old and l_new until converge. Now we have achieve the optimised clustering and we could use the cluster mean to represent our missing data. For example, movie User X¹ has the highest probability in cluster j, thus we can use cluster j's mean rating to represent the missing ratings of User X¹.
We can obviously tune the hyperparameter K — the number of clusters based on what we know about the data structure and achieve different results and performance until we feel satisfied.
I neglected a ton of details. This is a very conceptual illustration of what EM algorithm can do and a rather brief overview of the complex world of generative models. I find it helpful to keep a map — the overall flow, as it is easy to lose sight when you get bogged down in endless debugging, the array representation and numerical issues at programming. I also find it useful to get the idea right as the logic of clustering apply to other tasks. Once you have your map, the rest is getting the nitty gritty right and of course… debugging.