For this assignment we had to implement a word frequency counter using tree-based and then hash-based methods.
I assumed that a "word" in a text file is a sequence of letters with non-letters on either side, i.e. a maximal matching of the regular expression [a-zA-Z]+
. Furthermore, I assumed that words are compared in a case-insensitive manner.
I used the King James Bible[warning: large file] as test input. By my count, there are 822,651 words in the file, with 12,604 distinct words. All of my implementations produce the following output for the 25 most frequently occurring words:
63919 the 51696 and 34618 of 13560 to 12915 that 12667 in 10420 he 9837 shall 8998 unto 8971 for 8853 i 8474 his 8179 a 7964 lord 7376 they 7012 be 6989 is 6661 him 6636 not 6429 them 6129 it 6012 with 5620 all 5474 thou 4603 thy
My programs are in plain C and are each a couple hundred lines long. I think the most interesting thing about them is how few language features they use—no dynamic memory, no structs, just arrays and pointers. It was actually a fun challenge doing everything from scratch for a change and trying to be as minimalistic as possible.
Some very rough timings of these programs, with an attempt to account for I/O time, suggest that the "hash" method is about 7 times as fast as the "tree" method.