History of Machine Learning
for History Buffs
Arthur Samuel ? IBM
Tom Mitchell ?
Supervised learning
Example …
Unsupervised Learning
Example …
To learn about machine language there is some structure.
Have some hypothesis.
Start with linear regression with one variable.
Linear Regression with 1 Variable
Start with the initial equation which we define as our prediction hypothesis.
$$ \hat y = h\theta(x) = \theta_0 + \theta_1x$$
Then we take some data which we call a training set and we plot.
Data set:
House Pricing | Square Footage |
---|---|
Content Cell | Content Cell |
Content Cell | Content Cell |
So need to come up with estimates for theta_0 and theta_1 so that if we know the square footage then we can estimate a price. We hypothesize that the price of house a based on square footage of house a based on our data of other houses.
This is usually referred to as linear regression with a single variable. In our data set the variable is square footage.
To be able to solve or make this prediction then what? Common solution is something called linear regression.
Definition of regression re (back) and gredior (go). Going from coordinates “back” to a formula!
So how do we go back … that is the trick.
So we create a 2nd equation. The first was the hypothesis function (h). The second function is the cost function (J).
$$ J(\theta_0,\theta_1) = \frac 1 {2m} \sum_{i=1}^n (h_\theta(x_i) - y_i)^2 $$
Intuition on this formula.
- Taking the different between estimated point (our hypothesis) and actual point (our data).
- Squaring the difference to make sure it’s positive.
- Summing the squared differences over the entire data set.
- Averaging them by dividing by the number of data examples.
- Adding the constant 1/2 for other reasons.
Our best prediction will come with the value from J is lowest. In otherwords has the lowest cost (of errors or differences).
But this is about machine learning. That implies change and updating our beliefs. How do we do that? We turn to something called gradiant descent for linear regression.
Gradient Descent algorithm
$$ \theta_j := \theta_j - \alpha \frac \delta {\delta \theta_j}J(\theta_0, \theta_1) $$
In our linear regression with one variable we have two thetas theta_0 and theta_1.
So we need to move these thetas to their most optimal position. Gradient descent. From lating … gradi (to walk). In this case we can think of a landscape where we are trying to walk in a descending manner to the lowest point.
In reality think about sitting on a landscape and trying to figure out which way to go. Well obviously we look at the slope of the hill and look to proceed down the slope.
In math the slope is determined by regressing that curve and the slope at that point. If we are to the west we might have to walk east. If we are to the east we might have to walk west. Or we might have to walk east by north east … you get the point.
In math this repeated steps looks like:
$$ \theta_0 := \theta_0 - \alpha \frac 1 m \sum_{i=1}^m (h_\theta(x_i) - y_i) $$
$$ \theta_1 := \theta_1 - \alpha \frac 1 m \sum_{i=1}^m ((h_\theta(x_i) - y_i)x_i) $$
Do them stepwise … left then right, then left then right. Can’t take a bunch of lefts or rights.
Now there was some trickery to get those convergence equations. Essentially we took the partial derivative of the cost function in respect to each parameter theta. Partial derivative just means we hold the other step constant.
Finding the derivative means we are finding the slope of each parameter at the same time and moving them both down. Think about this as the slope will be either positive or negative (until we reach the local/global floor where it’s flat!!!).
Our position should be moving perpendicular to our current location’s tangent. How big of a step we take is that alpha figure. If we aren’t sure baby steps but then baby steps will take us forever to get there. So we have to guess how much to go. Ever try to navigate a room in the dark and you start taking babysteps/shuffling. You don’t want to slam into the bed or dresser drawers now do you.
So in laymans terms we are standing on a hill and we have figured out the direction to go by point ourselves in line with the downward slope in our environment/landscape. Then we decide to take a modest step left then right and then repeat the steps until we are at the bottom safely.
Now we can also learn to do this not with a list of equations but in a matrix.
$$ \begin{bmatrix} a & b \cr c & d\end{bmatrix} $$
Our simple data set … just like a spreadsheet.
$$ \begin{bmatrix} $160,000 & 1,500 \cr $210,000 & 2,000 \end{bmatrix} $$
This is a simple 2x2 matrix where there are 2 columns and 2 rows. A vector is a matrix with one column. So if we looked only at house prices … that would be a vector or 2 X 1.
A scalar is a what? It’s from scalaris … pertaining to a ladder, steps, flight of steps. (1846 history there.)
Link to Linear Algebra page!!!
Linear Regression with multiple variables. Let’s say we want our prediction of pricing to include more than just square footage. We can add bed rooms or baths or floors. Each feature is a new variable.
Our hypothesis will now look like:
$$ h_\theta(x) = \theta_0x_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + \theta_nx_n $$
Using matrix notation:
$$ h_\theta(x) = \begin{bmatrix} \theta_0 & \theta_1 & \theta_2 & \theta_3 & \cdots & \theta_n \end{bmatrix} \begin{bmatrix} x_0 \cr x_1 \cr x_2 \cr x_3 \cr \cdots \cr x_n \end{bmatrix} = \theta^Tx $$
What we have done in this equation is vectorize our hypothesis. We are making one assumption to make the math work:
$$ x_0^i = 1\ for\ (i \in 1,…,m) $$
Now we add bedrooms and bathrooms to the data set.
$$ X = \begin{bmatrix} 1 & 1,500 & 3 & 1.5 \cr 1 & 2,000 & 4 & 2.5\end{bmatrix} $$
$$ y = \begin{bmatrix} $160,000 \cr $210,000 \end{bmatrix} $$
$$ \theta = \begin{bmatrix} \theta_0 \cr \theta_1 \cr \theta_2 \cr \theta_3 \end{bmatrix} $$
$$ h_\theta(X) = X\theta $$
We want to find the thetas that when plugging in a new data set we arrive at our new price.
Back to the cost function which will show the vectorized version.
$$ J(\theta) = \frac 1 {2m} (X\theta-\overrightarrow y)^T(X\theta - \overrightarrow y)$$
The gradient descent convergence formula is the same as before we just add a step for each theta. So in this case we would have 4 equations for the vector Theta.
$$ \nabla J(\theta) = \begin{bmatrix} \frac {\delta J(\theta)} {\delta \theta_0} \cr \frac {\delta J(\theta)} {\delta \theta_1} \cr \frac {\delta J(\theta)} {\delta \theta_2} \cr \frac {\delta J(\theta)} {\delta \theta_3}\end{bmatrix} $$
$$ \nabla J(\theta) = \frac 1 m X^T(X\theta - \overrightarrow y)$$
Now let’s take a side step journey … into feature normalization. What we want to do is take all of our features and make them easier to walk down. We can do that by feature scaling and mean normalization. The goal is to get a range between 3 and negative 3.
Just remember that we have to normalize any new data set to get an accurate prediction.
The normal equation (derivation here with intuition)
$$ \theta = (X^TX)^{-1}X^Ty $$
Logistic Regression
moving from predicting from a range to classifying.
logistic is an adjective … logic … “branch of philosophy that treats of forms of thinking of true and false reasoning” (Latin (ars) logica) around true from false reasoning.
Binary classification. It is or it isn’t.
$$ y \in 0,1 $$
$$ 0 \le h_\theta(x) \le 1 $$
and so we have a new type of hypothesis:
This is called a sigmoid function.
Etymology of sigmoid: from the 1660s shaped like a c and then 1786 shaped like an S from sigma. Sigma is the 18th letter of the Greek alphabet corresponding to S in latin.
$$ h_\theta(x) = g(\theta^Tx) $$
This theta transpose times x is just 1 data example or the ith sample of a training set of m items.
$$ z = \theta^T x $$
$$ g(z) = \frac 1 {1 + e^{-z}} $$
When graphing this function its maximum is 1 and it’s minimum is 0 and crosses 0.5 when z = 0. Horizontal axis is z and vertical axis is 1.
Now we get to add a new element … probabilities.
$$ h(\theta)x = P(y = 1|x;\theta) = 1 - P(y = 0|x;\theta)$$ $$ P(y = 0|x;\theta) + P(y = 1|x;\theta) = 1 $$
and we have a decision boundary where
$$ h_\theta(x) \ge 0.5 \rightarrow y = 1 $$ $$ h_\theta(x) \lt 0.5 \rightarrow y = 0 $$
and
$$ g(z) \gt 0.5\ when\ z \ge 0 $$
Example trying to determine if a tumor is benign or malignant. In this case we can say that we want our hypothesis to predict malignant (1) or benign (0).
If our hypothesis returns 0.7 then we would predict that there is a 70% chance of the tumor being malignant.
Example:
$$ h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2) $$
Predict “y = 1” if
$$ \theta^Tx = -3 + x_1 + x_2 $$
$$-3 + x_1 + x_2 \ge 0 $$
$$ x_1 + x_2 \ge 3 $$
So we get our 3 cases of x_1 plus x_2 greater than, equal to or less than 3. When greater than Y =1, when equal to = .5 and when less than 0.
Note that the hypothesis can include non-linear decision boundaries in the x_2 x_1 area plot forming some funky shapes.
also some things to remember
$$ z = 0, e^0 = 1 \Rightarrow g(z) = 1/2$$ $$ z \rightarrow \infty, e^{-\infty} \rightarrow 0 \Rightarrow g(z) = 1$$ $$ z \rightarrow -\infty, e^\infty \rightarrow \infty \Rightarrow g(z) = 0$$
So as with linear regression … we now need to come up with a way to calculate thetas for our prediction equation/hypothesis.
Cost Function for the Logistic Regression
Problem using linear regression is that it creates a non-convex function. Need a cost function that is convex.
$$ J(\theta) = \frac 1 m \sum_{i=1}^mCost(h_\theta(x^i),y^i)$$
So …
$$ Cost(h_\theta(x),y) = \cases {-log(h_\theta(x)) & \text{if y = 1} \cr -log(1-h_\theta(x)) & \text{if y = 0} }$$
Intuitively if the cost is zero and y =1 then as hypothesis approaches zero then cost approaches infinity and vice versa if y = 0 then if cost is zero hypothesis is zero but then as hypothesis approaches 1 then cost approaches infinity.
Gradient descent
$$ J(\theta) = - \frac 1 m [\sum_{i=1}^my^ilogh_ \theta(x^i) + (1 - y^i)log(1 - h_\theta(x^i))]$$
where our goal is to find
$$ min_\theta J(\theta) $$
So we would repeat for all thetas:
$$ \theta_j := \theta_j - \alpha \frac \delta {\delta \theta_j} J(\theta)$$
where
$$ \frac \delta {\delta\theta_j} = \frac 1 m \sum_{i=1}^m(h_\theta(x^i) - y^i)x_j^i $$
which turns out looks like our linear regression equation.
Other optimization algorithms besides gradient descent include conjugate gradient, BFGS, and L-BFGS. Often faster than gradient descent, alpha determined but also more complex.
Multiclass classification.
Instead of Benign or Malignant we can look at more options. Say Email is tagged with Work, Friends, Family, Hobby. In this case we have multiple classes. Then we can create a hypothesis focused on one class at at in a one-vs-all or one-vs-rest scenario:
$$ h_\theta^i (x) = P(y = i|x;\theta) ~~~ (i = 1,2,3)$$
In a one-vs-all system we would train the logistic regression (i.e. our hypothesis predictor function) for each class i to predict the probability for that i. Then we setup
$$ \max_i h_\theta^i (x) $$