ADMIN: hw4 was due yesterday, late penalty is only 10/day (prorated hourly), up to three days. Nothing new to do, except keep up reading in Chapter 5. Today we continue 5.1, on string sorting (resume slides around p 35): MSD radix sort (started already, now code) 3-way radix quicksort (quicksort on one char at a time) These two sorting methods are potentially "sublinear": they may succeed in sorting the strings without even looking at all the chars in the strings (this depends on the data, of course). Note that 3-way quicksort uses less extra space, but is not stable. Application: sort the suffixes of mobydick.txt (a text of 1191463 ASCII chars), in order to find the longest repeated substring. Some benchmarks to try: export CLASSPATH='.:../book/jar/*' time java LRS < ../book/mobydick.txt # Arrays.sort, 1.2 sec time java LRSMSD < ../book/mobydick.txt # MSD sort, 0.9 sec time java LRSQuick3 < ../book/mobydick.txt # Quick3, 0.9 sec In these experiments MSD and Quick3 are close in performance (maybe dominated by JVM startup). MSD uses more "extra" memory, but we've got it. Next: If there is time (possibly not today), I'd like to introduce an idea from outside the textbook, a way to speed up the geometric graph construction for hw4: the QuadTree (with points inserted in random order). It is essentially like a naive BST, but a 4-ary tree using two dimensions, instead of a binary tree using just one. Using it, we can construct the "mona-20k.txt" graph with D=20, in about 11 seconds.