Box-Cox Transformation explained

The Box-Cox transformation is to transform the data so that its distribution is as close to a normal distribution as possible, that is, the histogram looks like a bell.

This technique has its place in feature engineering because not all species of predictive models are robust to skewed data, so it is worth using when experimenting. It probably won’t provide a spectacular improvement, although at the fine-tuning stage it can serve its purpose by improving our evaluation metric.

Box-Cox Equation in code

The transformation itself has the following formula

Box-Cox transformation equation

Let’s express them in code using the standard Python library

import math

def box_cox(x, lmbda):
  if lmbda == 0:
    return [math.log(v) for v in x]

  return [(math.pow(v, lmbda) - 1) / lmbda for v in x]

Or using NumPy package

import numpy as np

def box_cox(x: list, lmbda: float):
  if lmbda == 0:
    return np.log(x)

  return (np.power(x, lmbda) - 1) / lmbda

I have the data, but how to select the lambda?

The case is not complicated, we need a normality test, compare its results for several lambdas in the range (customarily) <-5, 5> then choose the one whose test result is the best. An out-of-the-box solution is provided by the SciPy package

When the second argument (lambda) is not given to the box cox function, it will be matched and returned.

Box-Cox in SciPy

The only problem we encounter when using this implementation is the requirement that the input data elements have to be greater than zero. However, we have to just shift the values by the minimum of the dataset.

def shift_to_positive(x):
  min_value = np.min(x)
  if min_value > 0:
    return x, 0
  
  shift_value = np.abs(min_value) + 1
  
  return x + shift_value, shift_value

Example for population by state in 2007

The full version of the code can be found in this online notebook, here I will only comment on the results.

Box-Cox example

On the left, we see the distribution of our input data. A keen eye will notice that imposing the logarithm (middle column) perfectly approximates our data to the normal distribution, but the best effect is achieved by using the title transformation (right column)

Box-Cox as a Scikit-learn transformer

Let’s implement it as a ready-to-use scikit-learn transformer, so you can use it in a Pipeline or FeatureUnion. Additionally, it allows you to use it in train/test data split. Remember, lambda has to be picked using train dataset only.

from sklearn.base import (
    TransformerMixin, 
    BaseEstimator
)
class BoxCoxTransformer(BaseEstimator, TransformerMixin):
    fitted_lambda: float

    def fit(self, x: np.array) -> BoxCoxTransformer:

        _, self.fitted_lambda = stats.boxcox(x)
        return self

    def transform(self, x: np.array) -> np.array:
        # Note that for x of length = 1 stats.boxcox will raise error
        return stats.boxcox(x, self.fitted_lambda)
        

Leave a Comment

Your email address will not be published.