wordfreq
|
The wordfreq script counts the number of occurrences of each word in its input. If you give it files, it reads from them; otherwise it reads standard input. The -i option folds uppercase into lowercase (uppercase letters will count the same as lowercase). |
---|
Here's this book's Preface run through wordfreq :
%wordfreq ch00
141 the 98 to 84 and 84 of 71 a 55 in 44 that 38 book 32 we ...
The script was taken from a long-ago Usenet ( 1.33 ) posting by Carl Brandauer. Here is Carl's original script (with a few small edits):
tr sort uniq -4 |
cat $* | # tr reads the standard input tr "[A-Z]" "[a-z]" | # Convert all uppercase to lowercase tr -cs "a-z'" "\012" | # replace all characters not a-z or ' # with a new line. i.e. one word per line sort | # uniq expects sorted input uniq -c | # Count number of times each word appears sort +0nr +1d | # Sort first from most to least frequent, # then alphabetically pr -w80 -4 -h "Concordance for $*" # Print in four columns |
---|
The version on the disc is somewhat different. It adjusts the tr commands for the script's -i option. The disc version also doesn't use pr to make output in four columns, though you can add that to your copy of the script - or just pipe the wordfreq output through pr on the command line when you need it.
The second
tr
command above (with the
-cs
options) is for the Berkeley version of
tr
. For System V
tr
, the command should be:
tr -cs "[a-z]'" "[\012*]"
If you aren't sure which version of tr you have, see article 35.11 . You could use deroff ( 29.10 ) instead.
One of the beauties of a simple script like this is that you can tweak it if you don't like the way it counts. For example, if you want hyphenated words like
copy-editor
to count as one, add a hyphen to the
tr -cs
expression:
"[a-z]'-"
(System V) or
"-a-z'"
(Berkeley).
-
,