Normalizing and Standardizing

Normalizing and Standardizing#

Feature engineering is a crucial step in the machine learning pipeline, and normalizing and standardizing are two essential techniques for preparing data before feeding it into models. These processes help improve the performance and convergence speed of machine learning algorithms by ensuring that features have similar scales.

Normalizing#

Normalization is the process of scaling individual samples to have unit norm. This can be particularly useful when using machine learning algorithms that rely on distance calculations, such as k-nearest neighbors (KNN) or support vector machines (SVM).

Formula:

For a feature vector \( x \), normalization typically transforms it as follows:

\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]

This scales the values of \( x \) to a range of [0, 1].

Example:

Consider a dataset with a feature representing the age of individuals, ranging from 18 to 90 years. Without normalization, age values will dominate over other features like height (measured in centimeters) when calculating distances. Normalizing the age values to a [0, 1] range ensures that no single feature unduly influences the model.

Real-world Example:

Image Processing: In image processing tasks, pixel values are often normalized to a range of [0, 1] to ensure that the neural network treats all pixel intensities equally.

Standardizing#

Standardization transforms features to have a mean of zero and a standard deviation of one. This is particularly useful for algorithms like linear regression, logistic regression, and neural networks that assume normally distributed data.

Formula:

For a feature vector \( x \), standardization typically transforms it as follows:

\[ x' = \frac{x - \mu}{\sigma} \]

where \( \mu \) is the mean of the feature and \( \sigma \) is the standard deviation.