Complexity: Finding the Most Frequent Words in the Bible

For this assignment we had to implement a word frequency counter using tree-based and then hash-based methods.

I assumed that a "word" in a text file is a sequence of letters with non-letters on either side, i.e. a maximal matching of the regular expression [a-zA-Z]+. Furthermore, I assumed that words are compared in a case-insensitive manner.

I used the King James Bible[warning: large file] as test input. By my count, there are 822,651 words in the file, with 12,604 distinct words. All of my implementations produce the following output for the 25 most frequently occurring words:

63919 the
51696 and
34618 of
13560 to
12915 that
12667 in
10420 he
9837 shall
8998 unto
8971 for
8853 i
8474 his
8179 a
7964 lord
7376 they
7012 be
6989 is
6661 him
6636 not
6429 them
6129 it
6012 with
5620 all
5474 thou
4603 thy

My programs are in plain C and are each a couple hundred lines long. I think the most interesting thing about them is how few language features they use—no dynamic memory, no structs, just arrays and pointers. It was actually a fun challenge doing everything from scratch for a change and trying to be as minimalistic as possible.

Some very rough timings of these programs, with an attempt to account for I/O time, suggest that the "hash" method is about 7 times as fast as the "tree" method.