Mathematica (297 284 248 244 242 199 chars) Pure Functional
and Zipf's Law Testing
Look Mamma ... no vars, no hands, .. no head
Edit 1> some shorthands defined (284 chars)
f[x_, y_] := Flatten[Take[x, All, y]];
BarChart[f[{##}, -1],
BarOrigin -> Left,
ChartLabels -> Placed[f[{##}, 1], After],
Axes -> None
]
& @@
Take[
SortBy[
Tally[
Select[
StringSplit[ToLowerCase[Import[i]], RegularExpression["\\W+"]],
!MemberQ[{"the", "and", "of", "to", "a", "i", "it", "in", "or","is"}, #]&]
],
Last],
-22]
Some explanations
Import[]
# Get The File
ToLowerCase []
# To Lower Case :)
StringSplit[ STRING , RegularExpression["\\W+"]]
# Split By Words, getting a LIST
Select[ LIST, !MemberQ[{LIST_TO_AVOID}, #]&]
# Select from LIST except those words in LIST_TO_AVOID
# Note that !MemberQ[{LIST_TO_AVOID}, #]& is a FUNCTION for the test
Tally[LIST]
# Get the LIST {word,word,..}
and produce another {{word,counter},{word,counter}...}
SortBy[ LIST ,Last]
# Get the list produced bt tally and sort by counters
Note that counters are the LAST element of {word,counter}
Take[ LIST ,-22]
# Once sorted, get the biggest 22 counters
BarChart[f[{##}, -1], ChartLabels -> Placed[f[{##}, 1], After]] &@@ LIST
# Get the list produced by Take as input and produce a bar chart
f[x_, y_] := Flatten[Take[x, All, y]]
# Auxiliary to get the list of the first or second element of lists of lists x_
dependending upon y
# So f[{##}, -1] is the list of counters
# and f[{##}, 1] is the list of words (labels for the chart)
Output
alt text http://i49.tinypic.com/2n8mrer.jpg
Mathematica is not well suited for golfing, and that is just because of the long, descriptive function names. Functions like "RegularExpression[]" or "StringSplit[]" just make me sob :(.
Zipf's Law Testing
The Zipf's law predicts that for a natural language text, the Log (Rank) vs Log (occurrences) Plot follows a linear relationship.
The law is used in developing algorithms for criptography and data compression. (But it's NOT the "Z" in the LZW algorithm).
In our text, we can test it with the following
f[x_, y_] := Flatten[Take[x, All, y]];
ListLogLogPlot[
Reverse[f[{##}, -1]],
AxesLabel -> {"Log (Rank)", "Log Counter"},
PlotLabel -> "Testing Zipf's Law"]
& @@
Take[
SortBy[
Tally[
StringSplit[ToLowerCase[b], RegularExpression["\\W+"]]
],
Last],
-1000]
The result is (pretty well linear)
alt text http://i46.tinypic.com/33fcmdk.jpg
Edit 6 > (242 Chars)
Refactoring the Regex (no Select function anymore)
Dropping 1 char words
More efficient definition for function "f"
f = Flatten[Take[#1, All, #2]]&;
BarChart[
f[{##}, -1],
BarOrigin -> Left,
ChartLabels -> Placed[f[{##}, 1], After],
Axes -> None]
& @@
Take[
SortBy[
Tally[
StringSplit[ToLowerCase[Import[i]],
RegularExpression["(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"]]
],
Last],
-22]
Edit 7 → 199 characters
BarChart[#2, BarOrigin->Left, ChartLabels->Placed[#1, After], Axes->None]&@@
Transpose@Take[SortBy[Tally@StringSplit[ToLowerCase@Import@i,
RegularExpression@"(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"],Last], -22]
- Replaced
f
with Transpose
and Slot
(#1
/#2
) arguments.
- We don't need no stinkin' brackets (use
f@x
instead of f[x]
where possible)