# Machine Learning 101: Perceptron

*In this story, I share a step-by-step dissection of perceptron algorithm that aims for completely machine learning beginners. I discuss how much linear algebra is needed to learn machine learning, demonstrate codes and the math behind.*

**Machine learning** may seem mysterious to some (at least for me previously), as it seems to imply there is a physical machine that is learning something. In fact, it is a method to **find patterns in chaos**; specifically, it is a broad term for **using algorithms to find patterns in data for modelling, prediction and control purpose**.

While I am in the process of learning various machine learning algorithms myself, as a beginner, I'd like to share my fresh and probably naive perspectives on some algorithms, starting from **Perceptron**. I deeply understand that for beginners, every tiny detail could be a roadblock. This story thus aims to tediously show tiny details and share useful learning resources.

# Machine Learning Jargons

Traditionally, you divide the dataset into the following two distinct subsets:

**training set**subset of the dataset used to train the model, e.g., think of exercises with correct answers that you can check and correct yourself**test set**subset of the dataset used for testing the trained model, e.g., think of final exams where you can only know the answers/grade after the exam

**Feature vectors **an array of values as input into a model for training

**Label **the correct “answer” or “result” of a training feature vector

**Classifier **a model that turns raw inputs into a prediction. Perceptron is a linear classifier that turns inputs into a binary (0/1, -1/1) result. A classifier is usually controlled by mathematic equations. A linear classifier like perceptron operates according to a linear equation:

**Weight **a vector that basically tells you the importance of each feature vector* *components. If a *weight* is 0, then the corresponding feature vector component doesn’t contribute to the model. For example, in *f(x)*, if *w[1]* is 0, then the value of *x[1]* is irrelevant to *f(x). *In another word, the result is determined by *x[2] (*provided that *w[2] *is not 0*) *and bias term* b.*

**Bias** a scalar (number) which decides if your linear classification line will go through the origin or not. Bias term denotes an intercept or offset from an origin. See illustration. *b = 0.6* here. If b = 0, then the line will go through the origin.

**Training error **the percentage of mistakes your model makes during training operation. Often, we don't want a perfect 0 error, because this could mean your model is **overfitting**, which means a model matches the training data too well (like tailor-made) that the model fails to make correct predictions on new unseen data. Note that data always differs, like human body shapes. If a dress matches your body shape too well, others won't fit.

**Test error **the percentage of mistakes your model makes during test operation.

*For more jargons, refer to Google's **Machine Learning Glossary**.*

# How much linear algebra is enough?

Yes, linear algebra is important for machine learning. I recommend linear algebra course from Prof. Gilbert Strang at MIT. These videos are dated but the knowledge is not. The quality of the videos should not impede your learning because Prof. Strang is a genius in stimulating thinking.

If you have no foundation in linear algebra, I recommend Lecture 1–21. It will take some time to digest but I do think linear algebra is indispensable once you start coding. Many things will just make sense automatically. If you have some foundation of linear algebra, i.e., knowing basic matrix operations and some sense of vector space, I recommend essential lectures. Once you have built understanding of linear algebra, you can easily learn additional theorems whenever you need. This is my strategy.

# Definition

Now let's get to perceptron.** Perception** is a **linear classifier** or **binary classifier**, resulting in 0/1, -1/1, or any other designated binary labels. In below illustration, the classifier is the line *f(x) *that could best separate the blue and orange points.

I now realise that behind every machine learning algorithm there is a (or multiple) mathematic equation(s) that secretly dictates everything. Perceptron uses a rather simple and easy to understand linear function, in the form of:

*w* and *x* are in vector form. Their dot product is:

*x* is the target vector which you try to classify.

**The general operational flow of perceptron algorithm is:**

# Code Demonstration

Three standard Python machine learning packages to use: NumPy, Scikit-learn and Matplotlib.

`import numpy as np`

from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import Perceptron

from sklearn.metrics import accuracy_score

First, let's generate a linear separable dataset using *make_blobs()* from *sklearn.datasets*. This function creates a bunch of isotropic Gaussian blobs (think of them as random data points) like below for us to play with. Our goal here is to **find a separation line between the two clusters of data**. I know, you can eye ball a rough line. Hello, human. Yet, this automation task is not so straightforward for computers.

Code to generate blobs:

`X, y = make_blobs(n_samples=1200, n_features=2,centers=2, random_state=4)`

fig, ax = plt.subplots()

colors = ['pink','brown']

for label in [0, 1]:

index = (y == label)

ax.scatter(X[index, 0], X[index, 1],c=colors[label])

Use *make_blobs* () to create 1200 (n_samples) blobs in 2 clusters (centers) and in 2 dimension (n_features) with random state 4. Random state dictates the distribution (shape) of your clusters. You are welcome to try other states and pick a distribution that you’d like to experiment. *X* is a matrix with dimension (1200, 2) ; *y* is an array made up of 0 and 1 labels since I created 2 centers. If you create 3 centers, meaing 3 clusters of blobs, *y* will be an array of 0, 1, 2. Below is a sample of *X* and *y.* Each *x* vector in matrix *X* corresponds to a *y* label. All these are randomly assigned by *make_blobs* ().

I then create a figure window using *matplotlib’s subplots()* function, which allows me to create image later. I want my blobs colors to be pink and brown. I then use a *for* loop to plot the blobs.

*(y == label) *is a logic test, which tests each element in y with current value of *label* (either 0 or 1 in this case) and returns a list (I called *index*) of boolean values such as *[ True True True False True True True True False False…]*. See visualisation example below:

Thus, when* label == 0*, *X[index, 0], X[index, 1] *equals to* X[True row, 0], X[True row, 1], *which will give [1,2] and [5,6].

Next, we need to split the dataset into training set and test set like this. Orange and purple denote training set. Magenta and green denote test set.

Use *train_test_split()* function from *sklearn.model_selection*. This function is pretty much self-explanatory. I am doing a 80/20 split. Again, I am using the same *for* loop as before to plot after-split blobs.

`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)`

fig, ax = plt.subplots()

colorstrain=['orange', 'purple']

colorstest=['magenta','green']

for label in [0, 1]:

index = (y_train == label)

ax.scatter(X_train[index, 0], X_train[index, 1],c=colorstrain[label])

for label in [0, 1]:

index = (y_test == label)

ax.scatter(X_test[index, 0], X_test[index, 1], c=colorstest[label])

Now we are ready to train perceptron using our training set to find a *w* and use that *w* to classify our test set.

`clf = Perceptron(max_iter=50, random_state=4)`

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

w = clf.coef_[0]

b = clf.intercept_

print('Test accuracy: %.4f' % accuracy_score(y_test, y_pred))

print("w: ",w,"b: ", b)

First, we define *Perception()* by deciding the maximum number of passes over the training data to be 50 times: *max_iter = 50*. Afterwards, we simply pass the training pair *X_train, y_train *to perceptron's method *fit()*. The *fit() *method fits the model with stochastic gradient descent and goes through training set 50 times. It returns itself, meaning that subsequent usage of *clf *would use a new *w *learnt* *from the training operation. I then call the method *clf.predict(X_test) *which returns a vector *y_pred* containing predicted labels for each test sample.

Lastly, I want to know the resulting *w* and *b* so I call perceptron attributes *.coef_* and *intercept_*. Note that the accuracy and *w, b* values change every time you run *Perceptron() *because there are a number of different *w *and* b *could define the same line (linear algebra needed here). I call accuracy_score() to find out test accuracy rate.

Next, I want to draw a **decision boundary line** on the figure like this.

In fact, learning to draw this decision line is my roadblock! I somehow just can not see it intuitively.

`x_bnd = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, 500)`

y_bnd = - x_bnd * (w[0] /w[1]) - (b / w[1])

ax.plot(x_bnd, y_bnd,color='black')

First, we need to know the bounds of horizontal axis (*x_bnd*) and vertical axis (*y_bnd*). *NumPy linspace()* function creates an array of 500 numbers between *X[:, 0].min() — 1, X[:, 0].max() + 1 *because the horizontal axis is determined by the first column of matrix *X. *Create 1 step less than minimum and 1 step more than maximum to ensure all points are included. Maybe I should also explain *X[:, 0]. *This means take all the rows of X in column 0. It essentially returns all values in column 0 of *X. *Then *X[:, 0].min() *finds the minimum value in column 0 of *X.*

The vertical bound *y_bnd *is determined by the second column of *X. *Remember this linear function that perceptron uses to classify and the expression of dot product?

The decision line is where this equation ** f(x) = 0**. In our case,

*X*has 2 features so

*w[0] * X[0]+w[1]* X[1] + b = 0,*thus

*X[1] = — X[0] * (w[0] /w[1]) — (b / w[1]).*This is single vector calculation. Translating into matrix operation, it is:

*y_bnd = — x_bnd * (w[0] /w[1]) — (b / w[1]).*Done.

The Scikit-learn package is so convenient. In fact, you don't need to understand how *Perceptron()* actually operates to use it well for your purpose. However, I find it helpful to understand what could be happening behind. Understanding the logic of an algorithm helps me to apply it to much more complicated neural network algorithms.

# Dissection

If we are NOT using the Scikit-learn package, we can write a Perceptron training** **algorithm based on this linear function:

Define *perceptron_vector_update (feature_vector, label, current_w, current_b) *to take in **a single training vector**, its correct label and the current values of *w *and *b*. The goal here is to check if current values of *w *and* b *can predict the label correctly. If there is a mistake, we should update the* w *and* b *values to correct the mistake.

`def perceptron_vector_update(feature_vector, label, current_w, current_b):`

w = current_w

b = current_b

z = label*(np.dot(current_w, feature_vector) + current_b)

if z <= 0:

w = current_w + label * feature_vector

b = current_b + label

return (w,b)

If prediction is correct, meaning *label *and* f(x)* corresponds, this translates into equation is: *z = label*(np.dot(current_w, feature_vector) + current_b) > 0. *No mistakes, thus no update. *np.dot() *function calculates the dot product.

If prediction is incorrect, meaning *z ≤ 0*, we need to update *w* and *b* to correct the mistakes. As to why we update this way, again linear algebra is in play here. See below lecture notes from **Cornell's Machine Learning for Intelligent Systems**** **course for a perfect illustration.

Now, we can iterate through the entire training matrix multiple times to find *w *and* b. *Note that each time we modify *w *and* b *based on a mistake, the new *w *and* b *could mess up previously correct prediction. Therefore, we need go through the training set a number of times to find the *w *and* b *that could predict most training data correctly.

Define *perceptron(feature_matrix, labels, max_iter)* to take in **a training matrix**, its correct labels and the number of times that you wish to go through this matrix. Initiate *w* to be a 0 vector in the same length as vectors in feature matrix. Initiate *b* to be 0. Then create a nested* for* loop to go through the matrix vector by vector and call *perceptron_vector_update () *to update vector by vector. Done.

`def perceptron(feature_matrix, labels, max_iter):`

w = np.zeros((len(feature_matrix[0]),))

b = 0

for t in range(max_iter):

for i in feature_matrix.shape[0]: #find out # of rows in feature_matrix

w,b = perceptron_vector_update(feature_matrix[i],labels[i], w, b)

return (w,b)

As you can see, **perceptron **is a simple linear classifier that does not need a learning rate, does not regulate and only updates when there is a mistake. Thus, it is often a fast algorithm and works well with linear separable problems. There are a number of variations on perceptron and it is frequently used as a part of neural networks.

The logic behind machine learning algorithm is truly beautiful. The coding process is largely a logic game and math representation. The computational thinking forces you to design a work flow that is logically sound (otherwise, it won't run or it will run forever! My Mac Pro hates me now) and mathematically rigorous (otherwise, the results are not what you expect).

While I progress in my own learning journey of machine learning, some of my codes may not be as efficient as they should be. Comments and knowledge sharing are always welcome!

Until next story…