Joining using a sorted index: zig-zag join algorithm

Review: index
- Fact:
- Recall: B⁺-tree
  Notice that:
- We will usually access the sorted records of the relation/file in a random jumping-around manner:
  because the records may not be sorted in that order
- Unless the index is a clustering index - click here:
Review: join algorithm based on TPMMS
- The Phase 1 of the Join Algorithm based on TPMMS will sort the input relations: (See: click here)
- Phase 2 joins the sorted relations:

The zig-zag join algorithm

Requirement: (for the zig-zag join algorithm):

The zig-zag join algorithm:

while ( R ≠ empty and S ≠ empty ) do { Read the index of R with the smallest join key in buf(R): Read the index of S with the smallest join key in buf(S): if ( search keys are equal ) { Use the index to access the tuples and join: Use next smallest value in indexes; } else { Remove the smallest value from the index; } }

Example:

Step 1:
1 ≠ 2 ⇒ move on
Step 2:
3 ≠ 2 ⇒ move on
Step 3:
3 ≠ 4 ⇒ move on
Step 4:
4 == 4 ⇒ access tuples and join
And so on

Notice that:
Hence the name zig-zag join algorithm.....

Notice also that:

The tuples themselves are not accessed until the index search keys has found a match
No tuples are accessed here:
Tuples are only accessed when there is a match at the index:

The zig-zag algorithm using a custering index
- The behavior of the zig-zag algorithm is very different using a clustering index:
  Graphically:
  (It will not zigzag as long as the search key is the same, it only zig-zag when it moves from one serach key value to the next search key value)

Comparing performance of different join algorithms --- scenario 1

Consider the following scenario 1:

relation R(X,Y) relation S(Y,Z) -------------------------------------------- B(R) = 1000 blks B(S) = 500 blocks T(R) = 10000 tuples T(S) = 5000 tuples R is clustered S is clustered No index on R Clustering sorted index on S V(S,Y) = 100
Available memory: M = 101 buffers

One-pass Join Algorithm: (See: click here)
The one-pass Join algorithm is not applicable !!!
2-pass Join Algorithm based on TPMMS (See: click here)
According to this webpage click here, the cost of (version 2) join algorithm based on TPMMS is:

The zig-zag join algorithm:

Unfortunately, we cannot use the zig-zag join algorithm:
Since R does not have an index, we cannot use the zig-zag join algorithm....

However, we could adapt the zig-zag join...

Adapting the zig-zag join algorithm when one relation has an sorted index:

We sort the relation R that does not have an index
We can make use of the (sorted) index on S to scan S in a sorted manner

We proceed as follows:

We must sort relation R first --- because we do not have a sort index on R.
For simplicity (on calculation), we will use 100 buffers to sort R:

We use 10 buffers to read the sorted chunks of R

And use the remaining buffers to read S in a sorted manner using the clustering index on S:

Algorithm:

Find all tuples with the smallest join key value in the sorted chunks on R; Use the clustering index on S to find all tuples with the smallest join key value in S; Join these 2 set of tuples if possible;

Cost of this Join algorithm:

We need to scan all sorted chunks of R: we will use: B(R) disk I/Os Because the index on S is clustering, we will scan S once (will little skipping around): we will use: B(S) disk I/Os
# disk IO to join R and S = B(R) + B(S) blocks

Total cost of the Join algorithm:

First sort R and save sorted chunks on disk: 2 × B(R) Join sorted chunks of R and S, using clusting index: B(R) + B(S) Total cost = 3 B(R) + B(S) = 3 × 1000 + 500 = 3500 blocks (We beat TPMMS based join which has cost: 4500 block disk IOs)

Note:

We have assumed that the IO cost for the reading the index is negligible !!!

Comparing performance of different join algorithms --- scenario 2

Consider the following scenario 2:

relation R(X,Y) relation S(Y,Z) -------------------------------------------- B(R) = 1000 blks B(S) = 500 blocks T(R) = 10000 tuples T(S) = 5000 tuples R is clustered S is clustered Clustering sorted index on R Clustering sorted index on S V(S,Y) = 100
Available memory: M = 101 buffers

In this scenario, we can apply the zig-zag join algorithm because:

R has an (clustering) sorted index and
S has an (clustering) sorted index

We use:

1 buffer to read the index of R
1 buffer to read the index of S
The remaining (99) buffers to hold tuples with the smallest join values from R and S

The performance of the zig-zag join algorithm using clustering indexes is:

Use the clustering index of R and scan R once we will use: B(R) disk I/Os (there is few skipping around because the index is clustering) Use the clustering index of S and scan S once we will use: B(S) disk I/Os (there is few skipping around because the index is clustering)
Total # disk IOs = B(R) + B(S) = 1000 + 500 = 1500 blocks

Note:

We have assumed that the IO cost for the reading the index is negligible !!!