Vectorisation

Vectorisation is a technique used to make machine learning algorithms, like linear regression, more efficient by performing operations on entire vectors or matrices at once, instead of using loops to process one element at a time.

In the context of multiple linear regression, vectorisation allows you to compute the dot product of the feature vector and parameter vector in a single step, which simplifies the implementation and speeds up calculations.

By leveraging vectorised operations, you can take advantage of optimised linear algebra libraries, leading to faster computation and more scalable models.

Example in Python

In this example, w is the weight vector, b is the bias term, and x is the feature vector. The values in w and x represent the parameters and features for a model with 3 features (n = 3).

Without vectorisation

When performing calculations without vectorisation, each weight is multiplied by the corresponding feature value, and then the bias is added:

Without vectorisation, each calculation is performed sequentially, which can be inefficient for large datasets.

If the number of features is large (e.g., n = 100,000), calculating the output manually becomes more complex and slower:

This method is computationally expensive when the number of features, n, is very large.

Vectorisation

Vectorisation allows us to use efficient linear algebra operations, like the dot product, to compute the result much faster. For instance, using np.dot(w, x) + b

Gradient Descent for multiple linear regression

The main difference in gradient descent for multiple features is that the parameters and features are vectors, but the update rule still follows the same principle: you adjust each parameter by subtracting the product of the learning rate and the derivative of the cost function with respect to that parameter.

In single-feature linear regression, gradient descent updates the parameter w and the bias b using their respective derivatives with respect to the cost function. When you move to multiple-feature linear regression, the key difference is that both the parameters w and the features x become vectors rather than single values.

Scaling features

When you have different features that take on very different ranges of values, it can cause gradient descent to run slowly. Rescaling the features so they all take on comparable ranges of values can significantly speed up gradient descent. Scaling features helps ensure that all features contribute equally to the learning process, preventing features with larger scales from dominating and thereby improving the efficiency and convergence of gradient descent.

Feature Engineering

Feature engineering is the process of creating new features by transforming or combining existing ones, based on your knowledge or intuition about the problem. The goal is to make it easier for the learning algorithm to make accurate predictions by providing more relevant or insightful features, which can lead to a better-performing model than simply using the original features.

For example, feature engineering can involve creating a new feature, such as calculating the area of land (x₃ = x₁ × x₂) from the width (x₁) and depth (x₂) of a house lot. This new feature is added to the model alongside the original features, allowing the algorithm to determine whether the area or the individual dimensions are more predictive for the target outcome, like house price.

By incorporating polynomial terms into our regression models, we can move beyond straight lines and better capture the underlying trends in our data.

Polynomial Regression

Polynomial regression lets us fit non-linear relationships. Instead of modeling the target variable y solely as a linear function of x, polynomial regression includes x², x³ etc, or √x, enabling the model to capture curves in the data.

Selecting the appropriate degree for the polynomial is crucial. Higher-degree polynomials can model more complex relationships but may introduce overfitting, where the model captures noise instead of the underlying pattern. It's essential to choose a degree that balances model complexity with the ability to generalize to new data.

Application to Housing Prices

For a dataset used to predict housing prices based on the size of the house in square feet. A simple linear regression might not fit the data well if the relationship between size and price is non-linear. By incorporating polynomial terms, we can model this non-linear relationship more effectively.

For instance, adding a quadratic term (x²) allows the model to fit a parabolic curve to the data. However, a quadratic model might eventually predict that prices decrease with increasing size, which may not make sense in this context since larger houses typically cost more. Alternatively, including a cubic term (x³) can model more complex curves that continue to increase with size, potentially providing a better fit to the data.

Using functions like the square root of x (√x) is another way to model relationships where the rate of increase in price slows down as size increases. This function becomes less steep as x increases, capturing scenarios where additional square footage adds less value beyond a certain point.

Importance of Feature Scaling

When using polynomial features, feature scaling becomes increasingly important. Polynomial terms can have vastly different scales:

If x ranges from 1 to 1,000 square feet, then:
- x² ranges from 1 to 1,000,000.
- x³ ranges from 1 to 1,000,000,000.

Without scaling, features with larger values can disproportionately influence the model, leading to slow convergence or instability in gradient descent optimization. Applying feature scaling ensures that all features contribute equally to the learning process, improving the efficiency and convergence of the algorithm.

aim for about -1 ≤	x_j	≤ 1	for each feature x_j
-3 ≤	x_j	≤ 3	acceptable ranges
-0.3 ≤	x_j	≤ 0.3	acceptable ranges

0 ≤	x₁	≤ 3	okay, no rescaling
-2 ≤	x₂	≤ 0.5	okay, no rescaling
-100 ≤	x₃	≤ 100	too large → rescale
-0.001 ≤	x₄	≤ 0.001	too small → rescale
98.6 ≤	x₅	≤ 105	too large → rescale

Week 2 Cheat Sheet

Multiple Features

Example in Python

Without vectorisation

Vectorisation

Gradient Descent for multiple linear regression

Scaling features

Tips for the best scale values to aim for

Feature Engineering

Polynomial Regression

Application to Housing Prices

Importance of Feature Scaling