Logistic Regression

Logistic regression is used to solve binary classification problems where the output label \( y \) is either 0 or 1. Linear regression usually doesn't work well for classification tasks because it can predict values outside the range [0, 1]. Logistic regression addresses this limitation by using the sigmoid function to map predictions to probabilities.

The Sigmoid (Logistic) Function

Purpose: Maps any real-valued number into the range between 0 and 1, making it suitable for probability estimation.

Formula:

\[ g(z) = \frac{1}{1 + e^{-z}} \]

e is the mathematical constant approximately equal to 2.71828.

Properties:

When \( z \rightarrow +\infty \): \( g(z) \rightarrow 1 \)
When \( z = 0 \): \( g(0) = 0.5 \)
When \( z \rightarrow -\infty \): \( g(z) \rightarrow 0 \)

Graph Shape: S-shaped curve (sigmoid curve) that approaches 0 for large negative inputs and 1 for large positive inputs.

Logistic Regression Model Formulation

Linear Combination:

\[ z = w \cdot x + b \]

\( w \) is the weight vector.
\( x \) is the input feature vector.
\( b \) is the bias term.

Applying the Sigmoid Function:

\[ f(x) = g(z) = \frac{1}{1 + e^{-(w \cdot x + b)}} \]

Output: \( f(x) \) produces a value between 0 and 1, representing the estimated probability that \( y = 1 \).

Interpreting the Output as Probability

Probability Estimation: \( f(x) \) estimates \( P(y = 1 \mid x) \), the probability that the label is 1 given input \( x \).

Example:

If \( f(x) = 0.7 \), there is a 70% chance that \( y = 1 \) (e.g., tumor is malignant).
Consequently, there's a 30% chance that \( y = 0 \) (tumor is benign).

Making Predictions with Logistic Regression

Thresholding:

Decision Rule: If \( f(x) \geq 0.5 \), predict \( y = 1 \); otherwise, predict \( y = 0 \).

Decision Boundary in Logistic Regression

Definition: The decision boundary is the set of points where the probability \( f(x) \) is exactly 0.5. In logistic regression, this corresponds to the points where the linear combination \( z = w \cdot x + b = 0 \).

Purpose: It separates the input feature space into two regions:

Region 1: Where \( f(x) \geq 0.5 \), and the model predicts \( y = 1 \).
Region 0: Where \( f(x) \lt 0.5 \), and the model predicts \( y = 0 \).

Visualisation:

In two dimensions, the decision boundary is a line that divides the plane into two halves.
In higher dimensions, it becomes a hyperplane that separates the space.

Example:

Consider a logistic regression model with parameters \( w \) and \( b \), where:

\[ z = w_1 x_1 + w_2 x_2 + b \]

If we set \( w_1 = 1 \), \( w_2 = 1 \), and \( b = -3 \), the decision boundary equation becomes:

\[ z = x_1 + x_2 - 3 = 0 \] \[ \Rightarrow x_1 + x_2 = 3 \]

This equation represents a straight line in the feature space that separates the predictions for \( y = 0 \) and \( y = 1 \).

Non-linear Decision Boundaries:

By incorporating polynomial features, logistic regression can model more complex decision boundaries:

\[ z = w_1 x_1^2 + w_2 x_2^2 + b \]

For example, with \( w_1 = 1 \), \( w_2 = 1 \), and \( b = -1 \), the decision boundary becomes:

\[ x_1^2 + x_2^2 = 1 \]

This represents a circle with radius 1 centered at the origin, allowing the model to separate data in a non-linear fashion.

Cost Function

The cost function gives you a way to measure how well a specific set of parameters fits the training data.
Loss is a measure of the difference of a single example to its target value while the Cost is a measure of the losses over the training set
For Linear Regression key points :

Avoid the Squared Error in Classification: It leads to non-convex cost functions in logistic regression.
Use Logistic Loss Function: It provides a suitable measure of error for binary classification and results in a convex cost function.
Convex Cost Function Benefits: Guarantees that optimization algorithms can find the global minimum, leading to better model performance.
Implementation: The logistic regression cost function is calculated using the negative log-likelihood of the predictions, averaged over all training examples.

The Logistic Regression Loss Function

\[J(\overrightarrow{w}, b) = \frac{1}{m} \sum_{i=1}^m L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}), y^{(i)})\]

Where the loss function L is defined as:

\[L(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}), y^{(i)}) = \begin{cases} -\log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) & \text{if } y^{(i)} = 1 \\[2ex] -\log(1 - f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) & \text{if } y^{(i)} = 0 \end{cases} \]

Summary of Gradient Descent for Logistic Regression

Use gradient descent to find the w and b parameters that minimse the cost function for a logistic regression model
Once w and b are chosen, put a new x into the model and it will predict the probability that y is 1 (not 0)

The cost function measures the discrepancy between the predicted probabilities and the actual binary labels (0 or 1).

Cost function:

\[J(\overrightarrow{w}, b) = -\frac{1}{m} \sum_{i=1}^m \left[y^{(i)}\log(f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)})) + (1 - y^{(i)})\log(1 - f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}))\right]\]

Equations for w_j and b look similar to linear regression but they are different because f(x) is different for linear and logistic regression:

w_j update equation:

\[w_j = w_j - \alpha \left[\frac{1}{m} \sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)})x_j^{(i)}\right]\]

b update equation:

\[b = b - \alpha \left[\frac{1}{m} \sum_{i=1}^m (f_{\overrightarrow{w},b}(\overrightarrow{x}^{(i)}) - y^{(i)})\right]\]

Linear regression equation:

\[f_{\overrightarrow{w},b}(\overrightarrow{x}) = \overrightarrow{w} \cdot \overrightarrow{x} + b\]

Logistic regression equation:

\[f_{\overrightarrow{w},b}(\overrightarrow{x}) = \frac{1}{1 + e^{-(\overrightarrow{w} \cdot \overrightarrow{x} + b)}}\]

Similar to linear regression :

Implementing gradient descent requires simultaneous updates of all parameters.
Feature scaling is critical
Vectorisation techniquesgreatly improve the computational efficiency of gradient descent, especially when dealing with large datasets.

Overfitting and Underfitting:

Overfitting occurs when a model fits the training data too well, capturing noise and irrelevant details, which leads to poor generalisation to new data. This often happens when using too many features or overly complex models, such as high-order polynomials.

Underfitting happens when the model is too simple and unable to capture the underlying patterns in the data, leading to poor performance on both training and new data.

Bias and Variance:

High bias (underfitting) means the model makes strong assumptions (e.g., linearity) that may not match the data, leading to systematic errors.

High variance (overfitting) means the model is too sensitive to small variations in the training data, producing highly variable predictions for new data.

Generalisation:

A model’s ability to perform well on unseen data is called generalisation. A good model should balance fitting the training data while generalizing well to new examples.

Techniques to Address Overfitting:

Collect more data: With more training data, complex models are less likely to overfit as they better capture the true patterns.

Reduce the number of features: By selecting a smaller subset of relevant features (feature selection), the model can focus on the most important aspects and avoid overfitting. Also removes risk there is insufficient data, but does increase risk useful features are lost.

Regularisation: Keep all the features, but reduce their effect.

Regularisation techniques reduce the impact of large parameters by encouraging the model to shrink parameter values, preventing overly complex models from fitting noise. This method allows keeping all features without overfitting. NB Usually just reduce w parameters and leave b.

Regularisation in Practice:

Regularisation works by limiting the size of the parameters (e.g., in linear or logistic regression) to prevent overfitting while still retaining the features.

Common regularisation methods include L1 (Lasso) and L2 (Ridge) regularisation, which control how much the model relies on each feature.

Cost Function with Regularisation

Instead of removing a feature, you minimise it (eg multiply w_3 will reduce w_3 close to zero.

Because we dont know which w to minimse, we minimse all the w using a regularisation parameter lamda to penalise all the values of w. So have to choose a value for lamda.

The regularised cost function is modified as follows:

Here, \(\lambda\) is the regularisation parameter, and \(m\) is the number of training examples. This new term encourages the parameters \(w_j\) to stay small, helping reduce overfitting. Importantly, regularisation does not apply to the bias term \(b\), as regularising it has little practical effect.

Choosing \(\lambda\):

If \(\lambda = 0\), the model is not regularised and is likely to overfit, producing a highly complex function.
If \(\lambda\) is very large, the parameters are forced to be very small, leading to underfitting, where the model is overly simplistic.
The key is to choose a value for \(\lambda\) that strikes a balance between fitting the training data well and keeping the parameters small to avoid overfitting.

The trade-off between these two goals is central to regularisation. Finding the right value of \(\lambda\) helps ensure the model generalises well without being overly complex or too simplistic.

Regularised Logistic Regression

For linear regression, f(x) is a linear function, for logistic regression, f(x) is the sigmoid (logistic) function

When logistic regression is applied to a dataset with many features, especially high-order polynomial features, it can be prone to overfitting, leading to overly complex decision boundaries.

Regularisation helps to address this by adding a penalty term to the cost function which discourages the parameters from becoming too large, leading to a simpler, more generalised decision boundary

When using gradient descent to minimize this regularised cost function, the derivative of the penalty term affects the update rule. For each parameter , the update rule becomes:

where is the learning rate. Note that the bias term b is not regularized, so its update rule remains unchanged.

In the lab, the cost function is shown to be a bit higher with the regularisation (as expected as its the normal cost plus regularisation cost)

Week 3