SWI-Prolog -- tokenize

Documentation
- Reference manual
- Packages
  - SWI-Prolog Natural Language Processing Primitives
    - Porter Stem -- Determine stem and related routines

Availability::- use_module(library(porter_stem)).(can be autoloaded)

tokenize_atom(+In, -TokenList)

Break the text In into words, numbers and punctuation characters. Tokens are created to the following rules:

`[-+][0-9]+(\.[0-9]+)?([eE][-+][0-9]+)?`	number
`[:alpha:][:alnum:]+`	word
`[:space:]+`	skipped
anything else	single-character

Character classification is based on the C-library iswalnum() etc. functions. Recognised numbers are passed to Prolog read/1, supporting unbounded integers.

It is likely that future versions of this library will provide tokenize_atom/3 with additional options to modify space handling as well as the definition of words.

Examples

Counting word frequency

word_frequency_count(Words, Counts) :-
    maplist(downcase_atom, Words, LwrWords),
    msort(LwrWords, Sorted),
    clumped(Sorted, Counts).

?- word_frequency_count([a,b,'A',c,d,'B',b,e], Counts).
Counts = [a-2, b-3, c-1, d-1, e-1].

See also: - tokenize_atom/2 or split_string/4 may be used to split a text into tokens