Overview: hw5 program part, BWT.java This is due 9pm Thursday, 4/26. THE BURROWS-WHEELER TRANSFORM (BWT) We saw this in class (share/0413/), this is a reading review. The BWT is a reversible text transformation. Suppose your input text is a string of length n, say A="bananamanana" (n=12). First append a marker character that is not in the text. For this example, we use "$". Build a table of all n+1 cyclic shifts, and sort them: SHIFTS: SORTED: bananamanana$ $bananamanana $bananamanana a$bananamanan a$bananamanan amanana$banan na$bananamana ana$bananaman ana$bananaman anamanana$ban nana$bananama anana$bananam anana$bananam ananamanana$b manana$banana bananamanana$ amanana$banan manana$banana namanana$bana na$bananamana anamanana$ban namanana$bana nanamanana$ba nana$bananama ananamanana$b nanamanana$ba The BWT output is the LAST column of the sorted table, a string of length n+1. For this example, the output is "annnnmb$aaaaa". We implement the forward BWT transform in BWT.java. It uses O(n) space, and its speed depends mostly on sorting the strings. THE INVERSE BWT Given the BWT output and the marker char, it turns out we can recover the original string, in linear time! We argued "why" in lecture, here we just review "how". Call the BWT output string L, for "last column" (with indices 0 to n). If we sort the chars in L we get string F, the "first column". For each i, let T[i] be the index such that L[i] = F[T[i]], assuming the chars are sorted stably. (That is, equal chars must stay in the same order in L and F.) For example: i L[i] T[i] F[i] 0 a 1 $ 1 n 9 a 2 n 10 a 3 n 11 a 4 n 12 a 5 m 8 a 6 b 7 a 7 $ 0 b 8 a 2 m 9 a 3 n 10 a 4 n 11 a 5 n 12 a 6 n You can compute T from F, by a variant of counting sort. Now we claim that for each char L[i], char L[T[i]] cyclically precedes L[i] in the original text! So just repeat this step n times, starting at the mark, to recover the entire original input string, in reverse order. For example, starting at i=7: i=7, L[7] ='$' i=T[7]=0, L[0] ='a' i=T[0]=1, L[1] ='n' i=T[1]=9, L[9] ='a' i=T[9]=3, L[3] ='n' i=T[3]=11, L[11]='a' i=T[11]=5, L[5] ='m' and so on. Note we do not really need the F[i] column, just T[i]. There is a variant of this approach, where you compute the "inverse" of T: if you have that, then you can recover the original text in forward order. REMARKS The BWT is used in data compression, since the transform of natural language text (which tends to have many repeated substrings) tends to have more immediately repeating symbols, and so it is more easily compressed by simple schemes. See "man bzcat" and burrows-wheeler.pdf (the original research paper). HOMEWORK (hw5 program part): You are given BWT.java, which already correctly computes the forward BWT in the transform() method. You may assume that all our input char codes are in the range 0 to 255, the extended ASCII alphabet. (What would you do, if the entire 16-bit char range were possible?) Basic TODO: Implement the method transformBack() in BWT.java, which should compute the inverse BWT. Once you have this done, you should pass all these tests pretty quickly: make run make ghost make moby However, "make pipi" will be quite slow. EC (extra credit) TODO: Our transform() method uses Arrays.sort(), which is slow (worst case quadratic) when the input has long repeated strings. That is why "make pipi" is slow. Modify transform() so it does not use Arrays.sort, but some faster method. In particular, you may borrow code from Manber.java (copy and modify what you need into your BWT.java, we will not turnin Manber.java). After this fix, "make pipi" should be much faster. Let me know if you have attempted the extra credit. It is actually not too hard if you reuse Manber.java (it just needs some editing to fit our purpose). For a real challenge, try implementing some other sorting approach, for example the recent "DC3" algorithm.