Haskell - 366 351 344 337 333 characters
(One line break in main
added for readability, and no line break needed at end of last line.)
import Data.List
import Data.Char
l=length
t=filter
m=map
f c|isAlpha c=toLower c|0<1=' '
h w=(-l w,head w)
x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++w
q?(g,w)=q*(77-l w)`div`g
b x=m(x!)x
a(l:r)=(' ':t(=='_')l):l:r
main=interact$unlines.a.b.take 22.sort.m h.group.sort
.t(`notElem`words"the and of to a i it in or is").words.m f
How it works is best seen by reading the argument to interact
backwards:
map f
lowercases alphabetics, replaces everything else with spaces.
words
produces a list of words, dropping the separating whitespace.
filter (
notElemwords "the and of to a i it in or is")
discards all entries with forbidden words.
group . sort
sorts the words, and groups identical ones into lists.
map h
maps each list of identical words to a tuple of the form (-frequency, word)
.
take 22 . sort
sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.
b
maps tuples to bars (see below).
a
prepends the first line of underscores, to complete the topmost bar.
unlines
joins all these lines together with newlines.
The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so ||
would be a bar of zero length. The function b
maps c x
over x
, where x
is the list of histograms. The entire list is passed to c
, so that each invocation of c
can compute the scale factor for itself by calling u
. In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.
Note the trick of using -frequency
. This removes the need to reverse
the sort
since sorting (ascending) -frequency
will places the words with the largest frequency first. Later, in the function u
, two -frequency
values are multiplied, which will cancel the negation out.