Today we start Chapter 5, on string algorithms. See the slides, 51StringSorts.pdf Also: http://www.youtube.com/watch?v=k4RRi_ntQc8 First we'll review some basic Java string properties: char: unsigned 16 bit int, enough for Unicode1, including ASCII. Remark: the "char" in C is only 8 bits, enough for ASCII. String: immutable char[] sequence, shares the array with substrings fast substring, slow append StringBuilder (or StringBuffer): mutable char[] array with doubling slow substring, fast append Alphabets: If an alphabet has <= 2^k symbols, then we can represent each symbol using a k-bit encoding. E.g: DNA in 2 bits per symbol. Then we'll get into the problem of sorting, and we'll see some ways to beat the "N lg N sorting lower bound". Recall the lower bound was for sorting by comparisons, we will get around it by using a new operation: we will use our keys (which are either small integers, or character codes taken from strings) as array indices. Sorts we'll see: key-indexed counting (sorting an array of small integers) LSD radix sort (for sorting short strings, all of same length) MSD radix sort (for sorting longer strings, of varying length) 3-way radix quicksort (quicksort variant, looks at one char at a time) The last two sorting methods are potentially "sublinear": they may succeed in sorting the strings without even looking at all the chars in the strings (this depends on the data, of course). The last algorithm uses less extra space, but is not stable. Examples: * Obama's question: to sort a million 32 bit integers, it might be reasonable to use 2-pass LSD radix sort, with R = 2^16. That is, you think of each 32-bit int as a sequence of two 16-bit chars. * To sort a million random 128-bit integers, it might be faster to use MSD radix sort (say with R=2^16 again), since it won't need to look past the second "char" for most of the keys. * Sort all the suffixes of a large text file, in order to find the longest repeated substring (it will be identified by two consecutive suffixes, with longest common prefix). It turns out we can do this for 'mobydick.txt' in less than two seconds, try this: cd ../book; time run LRS < mobydick.txt