Grouping words with equivalent substrings by automatic clustering based on suffix relationships
First Claim
1. A method of grouping a set of words that may occur in a natural language set, comprising:
- automatically obtaining suffix relation data indicating a relation value for each of a set of relationships between suffixes that occur in the natural language set; and
automatically clustering the words in the set of words using the relation values from the suffix relation data, to obtain group data indicating groups of words;
two or more words in a group having suffixes as in one of the relationships and, preceding the suffixes, equivalent substrings.
4 Assignments
0 Petitions
Accused Products
Abstract
A set of words of a natural language is grouped by automatically obtaining suffix relation data that indicate a relation value for each of a set of relationships between suffixes that occur in the natural language, and, then, by automatically clustering the words in the set using the relation values from the suffix relation data, to obtain group data indicating groups of words. Two or more words in a group have suffixes as in one of the relationships and, preceding the suffixes, equivalent substrings. The relationships can be pairwise relationships, and the relation value can indicate the number of occurrences of a suffix pair. The suffix relation data can be obtained using an inflectional lexicon. Complete link clustering can be used.
292 Citations
19 Claims
-
1. A method of grouping a set of words that may occur in a natural language set, comprising:
-
automatically obtaining suffix relation data indicating a relation value for each of a set of relationships between suffixes that occur in the natural language set; and
automatically clustering the words in the set of words using the relation values from the suffix relation data, to obtain group data indicating groups of words;
two or more words in a group having suffixes as in one of the relationships and, preceding the suffixes, equivalent substrings.- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
obtaining, for each of a set of pairs of words, a pairwise similarity value based on the relation value of a suffix pair, if any, that relates the words to each other; and
performing automatic clustering using the pairwise similarity values for the pairs of words.
-
-
6. The method of claim 5 in which the act of performing automatic clustering performs complete link clustering.
-
7. The method of claim 5 in which the pairwise similarity value for a pair of words is equal to the greatest relation value of the relationships between suffixes that relate the words in the pair to each other.
-
8. The method of claim 3 in which the natural language set includes one natural language and the act of automatically obtaining suffix relation data comprises:
-
using a lexicon for the language to obtain a word list indicating the set of words;
using the word list to obtain suffix pair data indicating pairs of suffixes that relate words in the set of words to each other; and
for each pair of suffixes indicated by the suffix pair data, obtaining a relation value indicating a number of times the suffix pair occurs in the set of words.
-
-
9. The method of claim 8 in which the lexicon is an inflectional lexicon for the language.
-
10. The method of claim 8 in which the suffix pair data further indicate, for each suffix in a pair, a part of speech;
- the relation value indicating the number of times the suffixes in the suffix pair occur in the set of words with the indicated parts of speech.
-
11. The method of claim 1, further comprising:
-
automatically obtaining, for each group of words indicated by the group data, a representative; and
automatically producing a data structure that can be accessed with a word in a group to obtain the group'"'"'s representative.
-
-
12. The method of claim 11 in which the act of automatically obtaining a representative selects the shortest word in a group as the representative.
-
13. The method of claim 11 in which the data structure can also be accessed with a group'"'"'s representative to obtain a list of words in the group.
-
14. The method of claim 11 in which the data structure is a finite state transducer data structure.
-
15. A system for grouping a set of words that occur in a natural language, comprising:
-
memory for storing data; and
a processor connected for accessing the memory;
the processor operating to;
automatically obtain suffix relation data indicating a relation value for each of a set of relationships between suffixes that occur in the natural language;
the processor storing the suffix relation data in memory; and
automatically cluster the words in the set using the relation values from the suffix relation data, to obtain group data indicating groups of words;
two or more words in a group having suffixes as in one of the relationships and, preceding the suffixes, equivalent substrings;
the processor storing the group data in memory.- View Dependent Claims (16, 17, 18, 19)
an inflectional lexicon stored in memory;
the processor, in automatically obtaining suffix relation data, accessing the inflectional lexicon in memory.
-
-
17. The system of claim 15 in which the processor further operates to:
-
automatically obtain, for each group of words indicated by the group data, a representative; and
automatically produce a data structure that can be accessed with a word in a group to obtain the group'"'"'s representative;
the processor storing the data structure in memory.
-
-
18. The system of claim 17, further comprising a storage medium access device for accessing a storage medium;
- the processor being connected for providing data to the storage medium access device;
the processor further operating to provide the data structure to the storage medium access device;
the storage medium access device storing the data structure on the storage medium.
- the processor being connected for providing data to the storage medium access device;
-
19. The system of claim 17, in which the processor is further connected for establishing connections with machines over a network;
- the processor operating to;
establish a connection to a machine over the network; and
transfer the data structure to the machine over the network.
- the processor operating to;
Specification