The library(isub) implements a similarity measure between strings, i.e.,
something similar to the Levenshtein distance. This method is based
on the length of common substrings.
- author
- - Giorgos Stoilos
- See also
- - A string metric for ontology alignment by Giorgos Stoilos,
2005 - http://www.image.ece.ntua.gr/papers/378.pdf .
- isub(+Text1:text, +Text2:text, -Similarity:float, +Options:list) is det
- Similarity is a measure of the similarity/dissimilarity between
Text1 and Text2. E.g.
?- isub('E56.Language', 'languange', D, [normalize(true)]).
D = 0.4226950354609929. % [-1,1] range
?- isub('E56.Language', 'languange', D, [normalize(true),zero_to_one(true)]).
D = 0.7113475177304964. % [0,1] range
?- isub('E56.Language', 'languange', D, []). % without normalization
D = 0.19047619047619047. % [-1,1] range
?- isub(aa, aa, D, []). % does not work for short substrings
D = -0.8.
?- isub(aa, aa, D, [substring_threshold(0)]). % works with short substrings
D = 1.0. % but may give unwanted values
% between e.g. 'store' and 'spore'.
?- isub(joe, hoe, D, [substring_threshold(0)]).
D = 0.5315315315315314.
?- isub(joe, hoe, D, []).
D = -1.0.
This is a new version of isub/4 which replaces the old version while
providing backwards compatibility. This new version allows several
options to tweak the algorithm.
- Arguments:
-
Text1 | - and Text2 are either an atom, string or a list of
characters or character codes. |
Similarity | - is a float in the range [-1,1.0], where 1.0
means most similar. The range can be set to [0,1] with
the zero_to_one option described below. |
Options | - is a list with elements described below. Please
note that the options are processed at compile time using
goal_expansion to provide much better speed. Supported options
are:
- normalize(+Boolean)
- Applies string normalization as implemented by the original
authors: Text1 and Text2 are mapped
to lowercase and the characters "._ " are removed. Lowercase
mapping is done with the C-library function
towlower() . In
general, the required normalization is domain dependent and is
better left to the caller. See e.g., unaccent_atom/2. The default
is to skip normalization (false ).
- zero_to_one(+Boolean)
- The old isub implementation deviated from the original algorithm
by returning a value in the [0,1] range. This new isub/4 implementation
defaults to the original range of [-1,1], but this option can be set
to
true to set the output range to [0,1].
- substring_threshold(+Nonneg)
- The original algorithm was meant to compare terms in semantic web
ontologies, and it had a hard coded parameter that only considered
substring similarities greater than 2 characters. This caused the
similarity between, for example 'aa' and 'aa' to return -0.8 which
is not expected. This option allows the user to set any threshold,
such as 0, so that the similatiry between short substrings can be
properly recognized. The default value is 2 which is what the
original algorithm used.
|
Undocumented predicates
The following predicates are exported, but not or incorrectly documented.
- $isub(Arg1, Arg2, Arg3, Arg4, Arg5)