Machine Learning 101: Perceptron
In this story, I share a step-by-step dissection of perceptron algorithm that aims for completely machine learning beginners. I discuss how much linear algebra is needed to learn machine learning, demonstrate codes and the math behind.
Machine learning may seem mysterious to some (at least for me previously), as it seems to imply there is a physical machine that is learning something. In fact, it is a method to find patterns in chaos; specifically, it is a broad term for using algorithms to find patterns in data for modelling, prediction and control purpose.
While I am in the process of learning various machine learning algorithms myself, as a beginner, I'd like to share my fresh and probably naive perspectives on some algorithms, starting from Perceptron. I deeply understand that for beginners, every tiny detail could be a roadblock. This story thus aims to tediously show tiny details and share useful learning resources.
Machine Learning Jargons
Traditionally, you divide the dataset into the following two distinct subsets:
- training set subset of the dataset used to train the model, e.g., think of exercises with correct answers that you can check and correct yourself
- test set subset of the dataset used for testing the trained model, e.g., think of final exams where you can only know the answers/grade after the exam
Feature vectors an array of values as input into a model for training
Label the correct “answer” or “result” of a training feature vector
Classifier a model that turns raw inputs into a prediction. Perceptron is a linear classifier that turns inputs into a binary (0/1, -1/1) result. A classifier is usually controlled by mathematic equations. A linear classifier like perceptron operates according to a linear equation:
Weight a vector that basically tells you the importance of each feature vector components. If a weight is 0, then the corresponding feature vector component doesn’t contribute to the model. For example, in f(x), if w[1] is 0, then the value of x[1] is irrelevant to f(x). In another word, the result is determined by x[2] (provided that w[2] is not 0) and bias term b.
Bias a scalar (number) which decides if your linear classification line will go through the origin or not. Bias term denotes an intercept or offset from an origin. See illustration. b = 0.6 here. If b = 0, then the line will go through the origin.
Training error the percentage of mistakes your model makes during training operation. Often, we don't want a perfect 0 error, because this could mean your model is overfitting, which means a model matches the training data too well (like tailor-made) that the model fails to make correct predictions on new unseen data. Note that data always differs, like human body shapes. If a dress matches your body shape too well, others won't fit.
Test error the percentage of mistakes your model makes during test operation.
For more jargons, refer to Google's Machine Learning Glossary.
How much linear algebra is enough?
Yes, linear algebra is important for machine learning. I recommend linear algebra course from Prof. Gilbert Strang at MIT. These videos are dated but the knowledge is not. The quality of the videos should not impede your learning because Prof. Strang is a genius in stimulating thinking.
If you have no foundation in linear algebra, I recommend Lecture 1–21. It will take some time to digest but I do think linear algebra is indispensable once you start coding. Many things will just make sense automatically. If you have some foundation of linear algebra, i.e., knowing basic matrix operations and some sense of vector space, I recommend essential lectures. Once you have built understanding of linear algebra, you can easily learn additional theorems whenever you need. This is my strategy.
Definition
Now let's get to perceptron. Perception is a linear classifier or binary classifier, resulting in 0/1, -1/1, or any other designated binary labels. In below illustration, the classifier is the line f(x) that could best separate the blue and orange points.
I now realise that behind every machine learning algorithm there is a (or multiple) mathematic equation(s) that secretly dictates everything. Perceptron uses a rather simple and easy to understand linear function, in the form of:
w and x are in vector form. Their dot product is:
x is the target vector which you try to classify.
The general operational flow of perceptron algorithm is:
Code Demonstration
Three standard Python machine learning packages to use: NumPy, Scikit-learn and Matplotlib.
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Perceptron
from sklearn.metrics import accuracy_score
First, let's generate a linear separable dataset using make_blobs() from sklearn.datasets. This function creates a bunch of isotropic Gaussian blobs (think of them as random data points) like below for us to play with. Our goal here is to find a separation line between the two clusters of data. I know, you can eye ball a rough line. Hello, human. Yet, this automation task is not so straightforward for computers.
Code to generate blobs:
X, y = make_blobs(n_samples=1200, n_features=2,centers=2, random_state=4)
fig, ax = plt.subplots()
colors = ['pink','brown']
for label in [0, 1]:
index = (y == label)
ax.scatter(X[index, 0], X[index, 1],c=colors[label])
Use make_blobs () to create 1200 (n_samples) blobs in 2 clusters (centers) and in 2 dimension (n_features) with random state 4. Random state dictates the distribution (shape) of your clusters. You are welcome to try other states and pick a distribution that you’d like to experiment. X is a matrix with dimension (1200, 2) ; y is an array made up of 0 and 1 labels since I created 2 centers. If you create 3 centers, meaing 3 clusters of blobs, y will be an array of 0, 1, 2. Below is a sample of X and y. Each x vector in matrix X corresponds to a y label. All these are randomly assigned by make_blobs ().
I then create a figure window using matplotlib’s subplots() function, which allows me to create image later. I want my blobs colors to be pink and brown. I then use a for loop to plot the blobs.
(y == label) is a logic test, which tests each element in y with current value of label (either 0 or 1 in this case) and returns a list (I called index) of boolean values such as [ True True True False True True True True False False…]. See visualisation example below:
Thus, when label == 0, X[index, 0], X[index, 1] equals to X[True row, 0], X[True row, 1], which will give [1,2] and [5,6].
Next, we need to split the dataset into training set and test set like this. Orange and purple denote training set. Magenta and green denote test set.
Use train_test_split() function from sklearn.model_selection. This function is pretty much self-explanatory. I am doing a 80/20 split. Again, I am using the same for loop as before to plot after-split blobs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
fig, ax = plt.subplots()
colorstrain=['orange', 'purple']
colorstest=['magenta','green']
for label in [0, 1]:
index = (y_train == label)
ax.scatter(X_train[index, 0], X_train[index, 1],c=colorstrain[label])
for label in [0, 1]:
index = (y_test == label)
ax.scatter(X_test[index, 0], X_test[index, 1], c=colorstest[label])
Now we are ready to train perceptron using our training set to find a w and use that w to classify our test set.
clf = Perceptron(max_iter=50, random_state=4)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
w = clf.coef_[0]
b = clf.intercept_
print('Test accuracy: %.4f' % accuracy_score(y_test, y_pred))
print("w: ",w,"b: ", b)
First, we define Perception() by deciding the maximum number of passes over the training data to be 50 times: max_iter = 50. Afterwards, we simply pass the training pair X_train, y_train to perceptron's method fit(). The fit() method fits the model with stochastic gradient descent and goes through training set 50 times. It returns itself, meaning that subsequent usage of clf would use a new w learnt from the training operation. I then call the method clf.predict(X_test) which returns a vector y_pred containing predicted labels for each test sample.
Lastly, I want to know the resulting w and b so I call perceptron attributes .coef_ and intercept_. Note that the accuracy and w, b values change every time you run Perceptron() because there are a number of different w and b could define the same line (linear algebra needed here). I call accuracy_score() to find out test accuracy rate.
Next, I want to draw a decision boundary line on the figure like this.
In fact, learning to draw this decision line is my roadblock! I somehow just can not see it intuitively.
x_bnd = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 500)
y_bnd = - x_bnd * (w[0] /w[1]) - (b / w[1])
ax.plot(x_bnd, y_bnd,color='black')
First, we need to know the bounds of horizontal axis (x_bnd) and vertical axis (y_bnd). NumPy linspace() function creates an array of 500 numbers between X[:, 0].min() — 1, X[:, 0].max() + 1 because the horizontal axis is determined by the first column of matrix X. Create 1 step less than minimum and 1 step more than maximum to ensure all points are included. Maybe I should also explain X[:, 0]. This means take all the rows of X in column 0. It essentially returns all values in column 0 of X. Then X[:, 0].min() finds the minimum value in column 0 of X.
The vertical bound y_bnd is determined by the second column of X. Remember this linear function that perceptron uses to classify and the expression of dot product?
The decision line is where this equation f(x) = 0. In our case, X has 2 features so w[0] * X[0]+w[1]* X[1] + b = 0, thus X[1] = — X[0] * (w[0] /w[1]) — (b / w[1]). This is single vector calculation. Translating into matrix operation, it is: y_bnd = — x_bnd * (w[0] /w[1]) — (b / w[1]). Done.
The Scikit-learn package is so convenient. In fact, you don't need to understand how Perceptron() actually operates to use it well for your purpose. However, I find it helpful to understand what could be happening behind. Understanding the logic of an algorithm helps me to apply it to much more complicated neural network algorithms.
Dissection
If we are NOT using the Scikit-learn package, we can write a Perceptron training algorithm based on this linear function:
Define perceptron_vector_update (feature_vector, label, current_w, current_b) to take in a single training vector, its correct label and the current values of w and b. The goal here is to check if current values of w and b can predict the label correctly. If there is a mistake, we should update the w and b values to correct the mistake.
def perceptron_vector_update(feature_vector, label, current_w, current_b):
w = current_w
b = current_b
z = label*(np.dot(current_w, feature_vector) + current_b)
if z <= 0:
w = current_w + label * feature_vector
b = current_b + label
return (w,b)
If prediction is correct, meaning label and f(x) corresponds, this translates into equation is: z = label*(np.dot(current_w, feature_vector) + current_b) > 0. No mistakes, thus no update. np.dot() function calculates the dot product.
If prediction is incorrect, meaning z ≤ 0, we need to update w and b to correct the mistakes. As to why we update this way, again linear algebra is in play here. See below lecture notes from Cornell's Machine Learning for Intelligent Systems course for a perfect illustration.
Now, we can iterate through the entire training matrix multiple times to find w and b. Note that each time we modify w and b based on a mistake, the new w and b could mess up previously correct prediction. Therefore, we need go through the training set a number of times to find the w and b that could predict most training data correctly.
Define perceptron(feature_matrix, labels, max_iter) to take in a training matrix, its correct labels and the number of times that you wish to go through this matrix. Initiate w to be a 0 vector in the same length as vectors in feature matrix. Initiate b to be 0. Then create a nested for loop to go through the matrix vector by vector and call perceptron_vector_update () to update vector by vector. Done.
def perceptron(feature_matrix, labels, max_iter):
w = np.zeros((len(feature_matrix[0]),))
b = 0
for t in range(max_iter):
for i in feature_matrix.shape[0]: #find out # of rows in feature_matrix
w,b = perceptron_vector_update(feature_matrix[i],labels[i], w, b)
return (w,b)
As you can see, perceptron is a simple linear classifier that does not need a learning rate, does not regulate and only updates when there is a mistake. Thus, it is often a fast algorithm and works well with linear separable problems. There are a number of variations on perceptron and it is frequently used as a part of neural networks.
The logic behind machine learning algorithm is truly beautiful. The coding process is largely a logic game and math representation. The computational thinking forces you to design a work flow that is logically sound (otherwise, it won't run or it will run forever! My Mac Pro hates me now) and mathematically rigorous (otherwise, the results are not what you expect).
While I progress in my own learning journey of machine learning, some of my codes may not be as efficient as they should be. Comments and knowledge sharing are always welcome!
Until next story…