Week 3


Logistic Regression

Logistic regression is used to solve binary classification problems where the output label \( y \) is either 0 or 1. Linear regression usually doesn't work well for classification tasks because it can predict values outside the range [0, 1]. Logistic regression addresses this limitation by using the sigmoid function to map predictions to probabilities.

The Sigmoid (Logistic) Function

Purpose: Maps any real-valued number into the range between 0 and 1, making it suitable for probability estimation.

Formula:

\[ g(z) = \frac{1}{1 + e^{-z}} \]

Properties:

Graph Shape: S-shaped curve (sigmoid curve) that approaches 0 for large negative inputs and 1 for large positive inputs.

Logistic Regression Model Formulation

Linear Combination:

\[ z = w \cdot x + b \]

Applying the Sigmoid Function:

\[ f(x) = g(z) = \frac{1}{1 + e^{-(w \cdot x + b)}} \]

Output: \( f(x) \) produces a value between 0 and 1, representing the estimated probability that \( y = 1 \).

Interpreting the Output as Probability

Probability Estimation: \( f(x) \) estimates \( P(y = 1 \mid x) \), the probability that the label is 1 given input \( x \).

Example:

Making Predictions with Logistic Regression

Thresholding:

Decision Boundary in Logistic Regression

Definition: The decision boundary is the set of points where the probability \( f(x) \) is exactly 0.5. In logistic regression, this corresponds to the points where the linear combination \( z = w \cdot x + b = 0 \).

Purpose: It separates the input feature space into two regions:

Visualisation:

Example:

Consider a logistic regression model with parameters \( w \) and \( b \), where:

\[ z = w_1 x_1 + w_2 x_2 + b \]

If we set \( w_1 = 1 \), \( w_2 = 1 \), and \( b = -3 \), the decision boundary equation becomes:

\[ z = x_1 + x_2 - 3 = 0 \] \[ \Rightarrow x_1 + x_2 = 3 \]

This equation represents a straight line in the feature space that separates the predictions for \( y = 0 \) and \( y = 1 \).

Non-linear Decision Boundaries:

By incorporating polynomial features, logistic regression can model more complex decision boundaries:

\[ z = w_1 x_1^2 + w_2 x_2^2 + b \]

For example, with \( w_1 = 1 \), \( w_2 = 1 \), and \( b = -1 \), the decision boundary becomes:

\[ x_1^2 + x_2^2 = 1 \]

This represents a circle with radius 1 centered at the origin, allowing the model to separate data in a non-linear fashion.


Cost Function

The cost function gives you a way to measure how well a specific set of parameters fits the training data.
Loss is a measure of the difference of a single example to its target value while the Cost is a measure of the losses over the training set
For Linear Regression key points :

The Logistic Regression Loss Function

\[J(\overrightarrow{w}, b) = \frac{1}{m} \sum_{i=1}^m L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}), y^{(i)})\]

Where the loss function L is defined as:

\[L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}), y^{(i)}) = \begin{cases} -\log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) & \text{if } y^{(i)} = 1 \\[2ex] -\log(1 - f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) & \text{if } y^{(i)} = 0 \end{cases} \]


Summary of Gradient Descent for Logistic Regression

Use gradient descent to find the w and b parameters that minimse the cost function for a logistic regression model
Once w and b are chosen, put a new x into the model and it will predict the probability that y is 1 (not 0)

The cost function measures the discrepancy between the predicted probabilities and the actual binary labels (0 or 1).

Cost function:

\[J(\overrightarrow{w}, b) = -\frac{1}{m} \sum_{i=1}^m \left[y^{(i)}\log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) + (1 - y^{(i)})\log(1 - f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))\right]\]

Equations for w_j and b look similar to linear regression but they are different because f(x) is different for linear and logistic regression:

wj update equation:

\[w_j = w_j - \alpha \left[\frac{1}{m} \sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)})x_j^{(i)}\right]\]

b update equation:

\[b = b - \alpha \left[\frac{1}{m} \sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)})\right]\]

Linear regression equation:

\[f_{\overrightarrow{w},b}(\overrightarrow{x}) = \overrightarrow{w} \cdot \overrightarrow{x} + b\]

Logistic regression equation:

\[f_{\overrightarrow{w},b}(\overrightarrow{x}) = \frac{1}{1 + e^{-(\overrightarrow{w} \cdot \overrightarrow{x} + b)}}\]

Similar to linear regression :


Overfitting and Underfitting:

Overfitting occurs when a model fits the training data too well, capturing noise and irrelevant details, which leads to poor generalisation to new data. This often happens when using too many features or overly complex models, such as high-order polynomials.

Underfitting happens when the model is too simple and unable to capture the underlying patterns in the data, leading to poor performance on both training and new data.

Bias and Variance:

High bias (underfitting) means the model makes strong assumptions (e.g., linearity) that may not match the data, leading to systematic errors.

High variance (overfitting) means the model is too sensitive to small variations in the training data, producing highly variable predictions for new data.

Generalisation:

A model’s ability to perform well on unseen data is called generalisation. A good model should balance fitting the training data while generalizing well to new examples.

Techniques to Address Overfitting:

Collect more data: With more training data, complex models are less likely to overfit as they better capture the true patterns.

Reduce the number of features: By selecting a smaller subset of relevant features (feature selection), the model can focus on the most important aspects and avoid overfitting. Also removes risk there is insufficient data, but does increase risk useful features are lost.

Regularisation: Keep all the features, but reduce their effect.

Regularisation techniques reduce the impact of large parameters by encouraging the model to shrink parameter values, preventing overly complex models from fitting noise. This method allows keeping all features without overfitting. NB Usually just reduce w parameters and leave b.

Regularisation in Practice:

Regularisation works by limiting the size of the parameters (e.g., in linear or logistic regression) to prevent overfitting while still retaining the features.

Common regularisation methods include L1 (Lasso) and L2 (Ridge) regularisation, which control how much the model relies on each feature.

Cost Function with Regularisation

Instead of removing a feature, you minimise it (eg multiply w_3 will reduce w_3 close to zero.

Because we dont know which w to minimse, we minimse all the w using a regularisation parameter lamda to penalise all the values of w. So have to choose a value for lamda.

The regularised cost function is modified as follows:

Here, \(\lambda\) is the regularisation parameter, and \(m\) is the number of training examples. This new term encourages the parameters \(w_j\) to stay small, helping reduce overfitting. Importantly, regularisation does not apply to the bias term \(b\), as regularising it has little practical effect.

Choosing \(\lambda\):

The trade-off between these two goals is central to regularisation. Finding the right value of \(\lambda\) helps ensure the model generalises well without being overly complex or too simplistic.

Regularised Logistic Regression

For linear regression, f(x) is a linear function, for logistic regression, f(x) is the sigmoid (logistic) function

When logistic regression is applied to a dataset with many features, especially high-order polynomial features, it can be prone to overfitting, leading to overly complex decision boundaries.

Regularisation helps to address this by adding a penalty term to the cost function which discourages the parameters from becoming too large, leading to a simpler, more generalised decision boundary

When using gradient descent to minimize this regularised cost function, the derivative of the penalty term affects the update rule. For each parameter , the update rule becomes:

where is the learning rate. Note that the bias term b is not regularized, so its update rule remains unchanged.

In the lab, the cost function is shown to be a bit higher with the regularisation (as expected as its the normal cost plus regularisation cost)