The Box-Cox transformation is to transform the data so that its distribution is as close to a normal distribution as possible, that is, the histogram looks like a bell.
This technique has its place in feature engineering because not all species of predictive models are robust to skewed data, so it is worth using when experimenting. It probably won’t provide a spectacular improvement, although at the fine-tuning stage it can serve its purpose by improving our evaluation metric.
Box-Cox Equation in code
The transformation itself has the following formula
Let’s express them in code using the standard Python library
import math def box_cox(x, lmbda): if lmbda == 0: return [math.log(v) for v in x] return [(math.pow(v, lmbda) - 1) / lmbda for v in x]
Or using NumPy package
import numpy as np def box_cox(x: list, lmbda: float): if lmbda == 0: return np.log(x) return (np.power(x, lmbda) - 1) / lmbda
I have the data, but how to select the lambda?
The case is not complicated, we need a normality test, compare its results for several lambdas in the range (customarily) <-5, 5> then choose the one whose test result is the best. An out-of-the-box solution is provided by the SciPy package
When the second argument (lambda) is not given to the box cox function, it will be matched and returned.
The only problem we encounter when using this implementation is the requirement that the input data elements have to be greater than zero. However, we have to just shift the values by the minimum of the dataset.
def shift_to_positive(x): min_value = np.min(x) if min_value > 0: return x, 0 shift_value = np.abs(min_value) + 1 return x + shift_value, shift_value
Example for population by state in 2007
The full version of the code can be found in this online notebook, here I will only comment on the results.
On the left, we see the distribution of our input data. A keen eye will notice that imposing the logarithm (middle column) perfectly approximates our data to the normal distribution, but the best effect is achieved by using the title transformation (right column)
Box-Cox as a Scikit-learn transformer
Let’s implement it as a ready-to-use scikit-learn transformer, so you can use it in a Pipeline or FeatureUnion. Additionally, it allows you to use it in train/test data split. Remember, lambda has to be picked using train dataset only.
from sklearn.base import ( TransformerMixin, BaseEstimator ) class BoxCoxTransformer(BaseEstimator, TransformerMixin): fitted_lambda: float def fit(self, x: np.array) -> BoxCoxTransformer: _, self.fitted_lambda = stats.boxcox(x) return self def transform(self, x: np.array) -> np.array: # Note that for x of length = 1 stats.boxcox will raise error return stats.boxcox(x, self.fitted_lambda)