Week 1

Regression

Regression is a fundamental concept in supervised learning, a subset of machine learning

Key Points:

Housing Price Prediction Example:

The following chart illustrates a simple regression model for predicting house prices based on house size:

In this example, the scatter points represent individual house prices, while the trend line shows the general relationship between house size and price. The model can be used to predict the price of a house given its size, such as estimating that a 750 sq ft house might be worth around $150,000 to $200,000.

Regression models can vary in complexity, from simple linear relationships to more complex curves, depending on the nature of the data and the specific problem being addressed.

Classification

Key Points:

Breast Cancer Detection Example:

A common example of classification is breast cancer detection, where the algorithm predicts whether a tumor is benign (0) or malignant (1) based on features like tumor size and patient age.

In this example, the scatter points represent individual tumors, with blue circles for benign tumors and red crosses for malignant tumors. The green line represents a possible decision boundary that the classification algorithm might learn to separate the two classes.

The classification model uses multiple input features (in this case, tumor size and patient age) to make its prediction. The algorithm learns to draw a boundary that best separates the two classes based on the training data.

Comparison with Regression:

While both regression and classification are types of supervised learning, they differ in their output:

Understanding the difference between regression and classification is crucial for choosing the right algorithm for a given problem in machine learning.

Unsupervised Learning

Unsupervised learning is a type of machine learning where algorithms find patterns or structures in unlabeled data. Unlike supervised learning, there are no predetermined output labels.

Key Points:

Clustering Example: Google News

Google News uses clustering algorithms to group related news articles. For example:

News clustering example

The algorithm automatically groups articles mentioning similar words like "panda," "twin," and "zoo" without being explicitly programmed to do so.

Clustering: Grouping Customers

Companies can use clustering to segment their customers based on various attributes. Here's an example of customer clustering:

This chart demonstrates how customers might be clustered based on their motivations for learning:

Clustering helps companies understand different customer segments and tailor their services accordingly.

Linear Regression Model

Key Points:

Model Representation

In Linear Regression, the function f is represented as a straight line:

fw,b(x) = wx + b

Where:

Key Terminology

Types of Linear Regression

Choosing a Linear Function

While it's possible to fit more complex non-linear functions (like curves or parabolas), a linear function is chosen because:

The Importance of the Cost Function

To make Linear Regression work effectively, it's crucial to construct a cost function. The cost function is:

Understanding these concepts forms the foundation for more complex models and is essential for anyone looking to delve deeper into machine learning and AI.

Cost Function

The cost function shows how well the model fits the data. By adjusting its parameters it is possible to reduce the cost and improve the model fit.

Understanding the cost function is crucial for implementing efficient algorithms like gradient descent and gaining insight into how linear regression and other machine learning models work.

Linear Regression Model

The linear regression model is used here to fit the house size/price training set data

fw,b(x) = wx + b

Where:

Cost Function Definition

The cost function measures the error between predictions and actual values across the entire training set:

J(w,b) = 1 2m m i=1 ( ŷ(i) - y(i) )2
m = number of training examples
error = ŷ(i) - y(i)

Now replacing y hat with the regression equation :

J(w,b) = 
1
2m
 * Σi=1 to m (fw,b(x(i)) - y(i))2

Cost function

When we plot the cost function J against both parameters w and b, we obtain a three-dimensional surface plot, often resembling a bowl-shaped curve, where the minimum point of this surface represents the optimal values for w and b that minimize the cost function. Additionally, if we create a contour plot by plotting b against w, we typically see concentric circles or ellipses, with the center representing the minimum of the cost function. This 2D representation provides another way to visualize the optimisation landscape, where the tightest circle or ellipse in the middle corresponds to the lowest point of the 3D surface.

Gradient Descent

Gradient descent is an algorithm used to minimise the cost function j(w, b), commonly used in machine learning for linear regression and deep learning models. The goal is to find values of w and b that minimise the cost function.

Gradient descent is implemented using specific derivative equations for the parameters w and b of the linear regression model. These derivatives are calculated using calculus and guide the updates to w and b during each step of gradient descent. The pre-calculated derivative equations are provided by the person implementing the algorithm or by using a machine learning library like TensorFlow or PyTorch, where the derivatives are already computed by the library. Once derived, the computer uses these equations to efficiently calculate the updates at each iteration.

By using the squared error cost function for linear regression, we ensure that the cost function has a single global minimum rather than multiple local minima. This is because the squared error cost function is convex (bowl-shaped), so gradient descent will always converge to this global minimum, provided the learning rate is chosen appropriately.

Linear regression model

\[f_{w,b}(x) = wx + b\]

Cost function

\[J(w,b) = \frac{1}{2m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})^2\]

Gradient descent algorithm

repeat until convergence {

\[w = w - \alpha \frac{\partial}{\partial w}J(w,b) \rightarrow \frac{1}{m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}\]

\[b = b - \alpha \frac{\partial}{\partial b}J(w,b) \rightarrow \frac{1}{m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})\]

}

Partial Derivatives

\[\frac{\partial}{\partial w}J(w,b) = \frac{\partial}{\partial w} \frac{1}{2m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})^2 = \frac{\partial}{\partial w} \frac{1}{2m} \sum_{i=1}^m (wx^{(i)} + b - y^{(i)})^2\]

\[= \frac{1}{m} \sum_{i=1}^m (wx^{(i)} + b - y^{(i)}) x^{(i)} = \frac{1}{m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}\]

\[\frac{\partial}{\partial b}J(w,b) = \frac{\partial}{\partial b} \frac{1}{2m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})^2 = \frac{\partial}{\partial b} \frac{1}{2m} \sum_{i=1}^m (wx^{(i)} + b - y^{(i)})^2\]

\[= \frac{1}{m} \sum_{i=1}^m (wx^{(i)} + b - y^{(i)}) = \frac{1}{m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})\]

repeat until convergence {

\[\frac{\partial}{\partial w}J(w,b)\]

\[w = w - \alpha \frac{1}{m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)}) x^{(i)}\]

\[b = b - \alpha \frac{1}{m} \sum_{i=1}^m (f_{w,b}(x^{(i)}) - y^{(i)})\]

}

\[\frac{\partial}{\partial b}J(w,b)\]

\[f_{w,b}(x^{(i)}) = wx^{(i)} + b\]

How Gradient Descent Works

Visualising Gradient Descent

Imagine the cost function surface like a hilly landscape. Gradient descent helps you take steps downhill (towards the minimum) by choosing the direction of steepest descent. This is done iteratively:

Convergence to Local Minima

Implementing Gradient Descent

The parameters are updated using the following rules:

ww − α ×  J(w, b)
w
bb − α ×  J(w, b)
b

Here, the symbol "←" denotes assignment (new value of the parameter is assigned based on the calculation).

α is a small positive number (e.g., 0.01) known as the learning rate.
It controls the size of the steps taken towards the minimum:

The derivative terms are:

J(w, b)
w
  and   J(w, b)
b

These represent the partial derivatives of the cost function with respect to w and b, respectively.
They indicate the direction and rate of increase of the cost function; their negatives point towards the steepest descent.

Calculate the updates for both w and b before updating either parameter.

Steps:

  1. Compute temporary variables:
    • temp_w ← w − α ×  J(w, b)
      w
    • temp_b ← b − α ×  J(w, b)
      b
  2. Update parameters:
    • w ← temp_w
    • b ← temp_b

This ensures both updates use the original values of w and b from the same iteration.

Updating w before computing the update for b would cause the b update to use the new w value.
This would lead to inconsistencies and deviates from the true gradient descent algorithm.

Continue updating w and b simultaneously until the changes become negligible.
Convergence is achieved when the parameters no longer change significantly with further iterations.

Each update moves the parameters in the direction that most rapidly decreases the cost function.
Simultaneous updates ensure a consistent descent path towards the minimum.

Choose an appropriate α to balance convergence speed and stability.

Detailed computation of the derivative terms:

J(w, b)
w
  and   J(w, b)
b

Gradient descent works by updating the parameter w based on the product of the learning rate (α) and the derivative of the cost function with respect to w.

If the derivative is positive, gradient descent reduces w, moving it towards the minimum.

If the derivative is negative, gradient descent increases w, again moving it closer to the minimum.

This process continues iteratively until w reaches the value that minimises the cost function. The derivative helps guide the direction and size of the update, while the learning rate controls the size of each step.

gradient-descent

If the learning rate is too small, gradient descent will take very small steps and converge slowly, requiring many iterations to reach the minimum.

If the learning rate is too large, the algorithm may take overly large steps, causing it to overshoot the minimum, potentially resulting in divergence or failure to converge.

As gradient descent approaches the local minimum, it naturally takes smaller steps because the derivative decreases, meaning the updates become smaller, even with a fixed learning rate. This allows gradient descent to eventually settle at the minimum.

gradient-descent