Overview: hw5 program part, BWT.java
This is due 9pm Thursday, 4/26.

THE BURROWS-WHEELER TRANSFORM (BWT)

We saw this in class (share/0413/), this is a reading review.

The BWT is a reversible text transformation.  Suppose your input text
is a string of length n, say A="bananamanana" (n=12).  First append a
marker character that is not in the text.  For this example, we use
"$".  Build a table of all n+1 cyclic shifts, and sort them:

 SHIFTS:            SORTED:
 bananamanana$      $bananamanana
 $bananamanana	    a$bananamanan
 a$bananamanan	    amanana$banan
 na$bananamana	    ana$bananaman
 ana$bananaman	    anamanana$ban
 nana$bananama	    anana$bananam
 anana$bananam	    ananamanana$b
 manana$banana	    bananamanana$
 amanana$banan	    manana$banana
 namanana$bana	    na$bananamana
 anamanana$ban	    namanana$bana
 nanamanana$ba	    nana$bananama
 ananamanana$b	    nanamanana$ba

The BWT output is the LAST column of the sorted table, a string of
length n+1.  For this example, the output is "annnnmb$aaaaa".
We implement the forward BWT transform in BWT.java.  It uses O(n)
space, and its speed depends mostly on sorting the strings.


THE INVERSE BWT

Given the BWT output and the marker char, it turns out we can recover
the original string, in linear time!  We argued "why" in lecture, here
we just review "how".

Call the BWT output string L, for "last column" (with indices 0 to n).
If we sort the chars in L we get string F, the "first column".  For
each i, let T[i] be the index such that L[i] = F[T[i]], assuming the
chars are sorted stably. (That is, equal chars must stay in the same
order in L and F.)  For example:

  i   L[i]  T[i]  F[i]
  0    a     1     $
  1    n     9     a
  2    n    10     a
  3    n    11     a
  4    n    12     a
  5    m     8     a
  6    b     7     a
  7    $     0     b
  8    a     2     m
  9    a     3     n
 10    a     4     n
 11    a     5     n
 12    a     6     n

You can compute T from F, by a variant of counting sort.

Now we claim that for each char L[i], char L[T[i]] cyclically precedes
L[i] in the original text!  So just repeat this step n times, starting
at the mark, to recover the entire original input string, in reverse
order.  For example, starting at i=7:

   i=7,       L[7] ='$'
   i=T[7]=0,  L[0] ='a'
   i=T[0]=1,  L[1] ='n'
   i=T[1]=9,  L[9] ='a'
   i=T[9]=3,  L[3] ='n'
   i=T[3]=11, L[11]='a'
   i=T[11]=5, L[5] ='m'
   and so on.

Note we do not really need the F[i] column, just T[i].  There is a
variant of this approach, where you compute the "inverse" of T: if you
have that, then you can recover the original text in forward order.


REMARKS

The BWT is used in data compression, since the transform of natural
language text (which tends to have many repeated substrings) tends to
have more immediately repeating symbols, and so it is more easily
compressed by simple schemes.  See "man bzcat" and burrows-wheeler.pdf
(the original research paper).


HOMEWORK (hw5 program part):

You are given BWT.java, which already correctly computes the forward
BWT in the transform() method.  You may assume that all our input char
codes are in the range 0 to 255, the extended ASCII alphabet.  (What
would you do, if the entire 16-bit char range were possible?)

Basic TODO:

      Implement the method transformBack() in BWT.java, which should
      compute the inverse BWT.  Once you have this done, you should
      pass all these tests pretty quickly:
      	    make run
	    make ghost
	    make moby
      However, "make pipi" will be quite slow.

EC (extra credit) TODO:

      Our transform() method uses Arrays.sort(), which is slow (worst
      case quadratic) when the input has long repeated strings.  That
      is why "make pipi" is slow.  Modify transform() so it does not
      use Arrays.sort, but some faster method.  In particular, you may
      borrow code from Manber.java (copy and modify what you need into
      your BWT.java, we will not turnin Manber.java).  After this fix,
      "make pipi" should be much faster.

Let me know if you have attempted the extra credit.  It is actually
not too hard if you reuse Manber.java (it just needs some editing to
fit our purpose).  For a real challenge, try implementing some other
sorting approach, for example the recent "DC3" algorithm.