Finding V-Optimal histogram (part 2) - searching for the best bucket partitions

Finding the histogram with the minimum error - Part 2 best bucket boundaries

Previously:

the optimal value for h_r that minimizes the error in a bucket b_r was solved using calculus:

h_r = Average(f_{s_r} , f_{s_r+1} , f_{s_r+2} , .... , f_{e_r})

The minimum error in a bucket b_r is:

S_r = (f_{s_r}² + f_{s_r+1}² + .... + f_{e_r}²) - (f_{s_r} + f_{s_r+1} + ... + f_{e_r})²/p
p = # values

The next problem that we must solve to find the V-optimal histogram is finding the best boundaries for the buckets
This step requires computer science...
Jagadish et. al. presented a dynamic programming approach to find the optimal bucket partitioning
Their paper: click here

Searching for the best bucket assignment

Problem formulation:
This problem is solved by a brute force search
(A smart brute force search :-))

Basic idea for the search algorithm:

Meaning of the figure:

The red area represents a optimal histogram with k buckets
The green area represents a optimal histogram with k-1 buckets
The last bucket contains the data for x, x+1, ..., b

Suppose we need to construct a histogram using k-1 buckets for the data for a, a+1, ..., x-1

The solution optimal histogram will be the one in the green area

Therefore:

Optimal Histogram for [a,b] using k buckets

= Optimal Histogram of [a..x-1] using k-1 buckets
+ Optimal Histogram of [x..b] using 1 buckets

Question:

How to determine x ???

Answer:

There is no mathematical formula that will tell us what is the "best" last bucket (x..b).
The only way is to try every single possible case:
- Last bucket is: x_N
- Last bucket is: x_N-1...x_N
- Last bucket is: x_N-2...x_N
- ...
- Last bucket is: x₁...x_N
One of them must have the smallest squared error
That is the optimal partition !!!

In other words, we have the following recursive relationship:

Optimal Histogram for [a,b] using k buckets
= min_x=a..b{ Optimal Histogram of [a..x-1] using k-1 buckets + last bucket is [x..b]}

Workout Example

Consider the following input:
Problem: construct a V-optimal histogram with B = 3 buckets

Step 1: construct V-optimal histogram with B = 1 bucket

Note:

minimize the squared error

click here

Histogram with 1 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: 0.0 | 2.0 | 2.0 | 8.75 | 10.0 | 13.3 | 63.7 | 161.5|

Step 2: construct V-optimal histogram with B = 2 bucket

Initially:

Input: 4 2 3 6 5 6 12 16
Histogram with 2 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | 0.0 | ?? | ?? | ?? | ?? | ?? | ?? | ^ | each value in its own bucket

To find the best bucket partition for values 4 2 3, we try:

[ 4 2 ] [ 3 ] ===> MinError[1][2] + 0 [ 4 ] [ 2 3 ] ===> MinError[1][1] + (2 - 2.5)² + (3 - 2.5)² | | +-------+ 1 bucket optimal histogram Using the result from the 1 bucket optimal histogram: [ 4 2 ] [ 3 ] ===> 2.0 + 0 = 2.0 [ 4 ] [ 2 3 ] ===> 0.0 + 0.5 = 0.5 <---- Min

Result:

To find the best bucket partition for values 4 2 3 6, we try:

[ 4 2 3 ] [ 6 ] ===> MinError[1][3] + 0 [ 4 2 ] [ 3 6 ] ===> MinError[1][2] + (3 - 4.5)² + (6 - 4.5)² [ 4 ] [ 2 3 6 ] ===> MinError[1][1] + (2 - 3.66)² + (3 - 3.66)² + (6 - 3.66)² | | +---------+ 1 bucket optimal histogram Using the result from the 1 bucket optimal histogram: [ 4 2 3 ] [ 6 ] ===> 2.0 + 0 = 2.0 <--- Min [ 4 2 ] [ 3 6 ] ===> 2.0 + 4.5 = 6.5 [ 4 ] [ 2 3 6 ] ===> 0.0 + 8.666 = 8.666

Result:

And so on... - Final result:

Input: 4 2 3 6 5 6 12 16
V-optimal Histogram with 2 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | 0.0 | 0.5 | 2.0 | 2.5 | 2.66 | 13.3 | 21.3 |

Step 3: construct V-optimal histogram with B = 3 bucket

Initially:

Input: 4 2 3 6 5 6 12 16
Histogram with 3 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | x | 0 | ?? | ?? | ?? | ?? | ?? | ^ | each value in its own bucket

To find the best bucket partition for values 4 2 3 6, we try:

{ 4 2 3 } [ 6 ] ===> MinError[2][3] + 0 { 4 2 } [ 3 6 ] ===> MinError[2][2] + (3 - 4.5)² + (6 - 4.5)² { 4 } [ 2 3 6 ] ===> MinError[2][1] + (2 - 3.66)² + (3 - 3.66)² + (6 - 3.66)² | | +---------+ 2 bucket optimal histogram Using the result from the 2 bucket optimal histogram: { 4 2 3 } [ 6 ] ===> 0.5 + 0 = 0.5 <---- Min { 4 2 } [ 3 6 ] ===> 0.0 + 4.5 = 4.5 { 4 } [ 2 3 6 ] ===> 0.0 + 8.666 = 8.666

Result:

To find the best bucket partition for values 4 2 3 6 5, we try:

{ 4 2 3 6 } [ 5 ] ===> MinError[2][4] + 0 { 4 2 3 } [ 6 5 ] ===> MinError[2][3] + (6 - 5.5)² + (5 - 5.5)² { 4 2 } [ 3 6 5 ] ===> MinError[2][2] + (3 - 4.66)² + (6 - 4.66)² + (5 - 4.66)² { 4 } [ 2 3 6 5 ] ===> MinError[2][1] + (2 - 4)² + (3 - 4)² + (6 - 4)² + (5 - 4)² | | +-----------+ 2 bucket optimal histogram Using the result from the 1 bucket optimal histogram: { 4 2 3 6 } [ 5 ] ===> 2.0 + 0 { 4 2 3 } [ 6 5 ] ===> 0.5 + 0.5 = 1.0 <--- Min { 4 2 } [ 3 6 5 ] ===> 0.0 + 4.666 = 4.666 { 4 } [ 2 3 6 5 ] ===> 0.0 + 10.0 = 10.0

Result:

And so on... - Final result:

Input: 4 2 3 6 5 6 12 16
V-optimal Histogram with 3 bucket: Values: 1..1 | 1..2 | 1..3 | 1..4 | 1..5 | 1..6 | 1..7 | 1..8 | -------+------+------+------+------+------+------+------+--- Min Error: x | x | 0.0 | 0.5 | 1.0 | 1.16 | 2.66 | 10.6 |

The V-optimal Algorithm

Algorithm in psuedo code:

/* ------------------------------------------------ Help function to compute Error in a bucket ------------------------------------------------ */ SqError(int a, int b) { s2 = PP[b] - PP[a]; s1 = P[b] - P[a]; return (s2 - s1*s1/(b-a+1)); } /* ---------------------------------------------- Prepare arrays to compute error efficiently ---------------------------------------------- */ P[0] = 0; PP[0] = 0; for (i = 1; i <= N; i++) { P[i] = P[i-1] + x_i PP[i] = PP[i-1] + x_i² } /* --------------------------------------------- Compute the best error for 1 bucket histogram --------------------------------------------- */ for (i = 1; i <= N; i++) { // Single bucket: use error formula... BestErr[k][i] = SqError(1,i); } /* --------------------------------------------------------- Now we compute the V-opt. histogram with B buckets Output: BestError[k][i] = best error of histogram using k buckets on data points (1..i) --------------------------------------------------------- */ // The dynamic algorithm uses these variables: // // k = # buckets // i = current item - items processed are: (1..i) // BestError[k][i] = min. error in histogram of k buckets for f1..fi for (k = 1; k <= B; k++) { // Find optimal histogram using k buckets for (i = 1; i <= N; i++) { // Multiple buckets: search BestError[k][i] = INFINITE; // Start value // Try every possible size for the last bucket for (j = 1; j <= i-1; j++) // Last bucket is [j..i] { if ( BestError[k-1][j] + SqError(j+1,i) < BestError[k][i] ) { BestError[k][i] = BestError[k-1][j] + SqError(j+1,i); // Better division found } } } }

Example Program: (Demo above code)
- Jagadish's algorithm Prog file: click here
- A version of Jagadish's algorithm that only prints the min. errors: click here
- Sample input data file 1: click here
- Sample input data file 2: click here

Finding the buckets in the histogram

Insert the code tag with ********* into the above program to obtain the histogram bucket:

/* ------------------------------------------------ Help function to compute Error in a bucket ------------------------------------------------ */ SqError(int a, int b) { s2 = PP[b] - PP[a]; s1 = P[b] - P[a]; return (s2 - s1*s1/(b-a+1)); } /* ---------------------------------------------- Prepare arrays to compute error efficiently ---------------------------------------------- */ P[0] = 0; PP[0] = 0; for (i = 1; i <= N; i++) { P[i] = P[i-1] + x_i PP[i] = PP[i-1] + x_i² } /* --------------------------------------------- Compute the best error for 1 bucket histogram --------------------------------------------- */ for (i = 1; i <= N; i++) { // Single bucket: use error formula... BestErr[k][i] = SqError(1,i); } index[1] = 1; // First index ************* /* --------------------------------------------------------- Now we compute the V-opt. histogram with B buckets Output: BestError[k][i] = best error of histogram using k buckets on data points (1..i) --------------------------------------------------------- */ // The dynamic algorithm uses these variables: // // k = # buckets // i = current item - items processed are: (1..i) // BestError[k][i] = min. error in histogram of k buckets for f1..fi for (k = 1; k <= B; k++) { // Find optimal histogram using k buckets for (i = 1; i <= N; i++) { // Multiple buckets: search BestError[k][i] = INFINITE; // Start value // Try every possible size for the last bucket for (j = 1; j <= i-1; j++) // Last bucket is [j..i] { if ( BestError[k-1][j] + SqError(j+1,i) < BestError[k][i] ) { BestError[k][i] = BestError[k-1][j] + SqError(j+1,i); // Better division found index[i] = j+1; // ******************** } } } } /* --------------------------------- Print bucket boundaries --------------------------------- */ i = B; j = n; while (i >= 2) { int end_point; end_point = j; j = min_index[j]; System.out.println("[" + j + " .. " + end_point + "]"); j--; i--; } System.out.println("[" + 1 + " .. " + j + "]");