Map > Data
Science > Explaining the Past
> Data Exploration > Univariate
Analysis > Binning > Unsupervised |
|
|
|
|
|
Unsupervised
Binning
|
|
|
Unsupervised binning methods transform numerical variables
into categorical
counterparts but do not use the target (class) information. Equal Width
and Equal Frequency are two unsupervised binning methods. |
|
|
|
|
|
1- Equal Width Binning
|
|
|
The algorithm divides the data into k
intervals of equal size. The
width of intervals is:
|
|
|
w = (max-min)/k
|
|
|
And the interval boundaries are:
|
|
|
min+w, min+2w, ... , min+(k-1)w
|
|
|
|
|
|
2- Equal Frequency Binning
|
|
|
The algorithm divides the data into k groups
which each group contains approximately same number of
values. For the both methods, the best way of determining
k
is by looking at the histogram and try different intervals or groups.
|
|
|
|
|
|
Example:
|
|
|
|
|
|
|
|
|
|
|
|
3- Other Methods
|
|
|
- Rank: The rank of a number is its size relative to other values
of a numerical variable. First, we sort the list of values, then we
assign the position of a value as its rank. Same values receive the same
rank but the presence of duplicate values affects the ranks of subsequent
values (e.g., 1,2,3,3,5). Rank is a solid binning method with one major
drawback, values can have different ranks in
different lists.
- Quantiles
(median, quartiles, percentiles, ...): Quantiles are also very useful
binning methods but like Rank, one value can have different quantile if
the list of values changes.
- Math functions: For example, FLOOR(LOG(X)) is
an effective binning method for the numerical variables with highly skewed distribution (e.g., income).
|
|
|
|
|
|
|
|
|
|
|
|
Try
to invent a real time unsupervised binning method. Components of a real
time method are updated on the fly.
|
|
|
|
|
|