words list, and optionally count, words in a list of files
doc generated from the script with gendoc
ruby script, version=4.01

Synopsis

words [options] files

-c,--count
Report the number of occurrences of each word.
-s,--sum
Report only the total number of words
-f,--fold
Convert input to lower case before detecting words.
-p,--pattern=RE
set pattern defining the word separators
default: [^[:alpha:]_]
-V,--version
print version and exit
-h
print short help and exit
--help
print full documentation via less and exit

Description

words lists, and optionally counts, words occurring in a list of files or, if no arguments are present, standard input. Words are defined as character sequences separated by the regexp set by the --pattern option. By default, any character other then underscore and alphabetic characters (including accented characters) acts as a separator.

Without the --count option, the output comes in 1 column of words, sorted in case insensitive order. With the --count option two tab-separated columns appear with the counts in column 1 and the words in column 2; the order will be reverse numerically sorted on column 1 and normally sub-sorted on column 2.

The --fold option converts all input to lowercase.

Examples

Given an input file test containing:

  The Prêt-à-porter robe is priced at € 77.50,
  the shoes (ladies' only) at € 255.

To show the words in it:

    words test #=> 
    à
    at
    is
    ladies
    only
    porter
    priced
    Prêt
    robe
    shoes
    the
    The

To count the words, after folding upper to lower case:

    words --count --fold test #=>
    2at
    2the
    1à
    1is
    1ladies
    1only
    1porter
    1prêt
    1priced
    1robe
    1shoes

to include - to be a possible word character, thus finding words like avant-garde:

    words -p '[^[:alpha:]-]' test #=>
    at
    is
    ladies
    only
    priced
    Prêt-à-porter
    robe
    shoes
    the
    The

Note that the - must be at the end of the expression, in order not to be interpreted as a range-character.

To count the number of backslashes in a TeX file:

    words --pattern='[^\\]' -c test #=>

but, of course, this is a lot faster:

    tr -dc '\\' <test |wc -c

Author

Wybo Dekker

Copyright

Released under the GNU General Public License