Unsupervised Binning

Map > Data Science > Explaining the Past > Data Exploration > Univariate Analysis > Binning > Unsupervised

Unsupervised Binning

Unsupervised binning methods transform numerical variables into categorical counterparts but do not use the target (class) information. Equal Width and Equal Frequency are two unsupervised binning methods.

1- Equal Width Binning

The algorithm divides the data into k intervals of equal size. The width of intervals is:

w = (max-min)/k

And the interval boundaries are:

min+w, min+2w, ... , min+(k-1)w

2- Equal Frequency Binning

The algorithm divides the data into k groups which each group contains approximately same number of values. For the both methods, the best way of determining k is by looking at the histogram and try different intervals or groups.

Example:

3- Other Methods

Rank: The rank of a number is its size relative to other values of a numerical variable. First, we sort the list of values, then we assign the position of a value as its rank. Same values receive the same rank but the presence of duplicate values affects the ranks of subsequent values (e.g., 1,2,3,3,5). Rank is a solid binning method with one major drawback, values can have different ranks in different lists.
Quantiles (median, quartiles, percentiles, ...): Quantiles are also very useful binning methods but like Rank, one value can have different quantile if the list of values changes.
Math functions: For example, FLOOR(LOG(X)) is an effective binning method for the numerical variables with highly skewed distribution (e.g., income).

Exercise

Try to invent a real time unsupervised binning method. Components of a real time method are updated on the fly.