Statistical stemming
First Claim
1. A computer-implemented method, comprising:
- receiving, for a canonical suffix-rewriting rule, a plurality of applicable words and a plurality of non-applicable words, wherein each applicable word is a word to which the rule should be applied, and each non-applicable word is a word to which the rule should not be applied, and wherein the canonical suffix-rewriting rule specifies a suffix replacement;
generating a suffix tree from the applicable words and the non-applicable words, the suffix tree comprising a plurality of nodes and a plurality of leaves arranged in a hierarchical structure wherein each node has one or more nodes or one or more leaves below the node in the tree, wherein each leaf corresponds to either a unique applicable word or a unique non-applicable word, and each node corresponds to a suffix of its children in the tree;
selecting, using one or more processors, a minimum colored subset of the nodes and leaves in the suffix tree, wherein each node and leaf in the minimum colored subset has an associated optimal status, wherein the minimum colored subset and the optimal status of each node and leaf in the minimum colored subset are selected such that a derived status for each leaf is valid when the leaf corresponds to an applicable word, and the derived status for each leaf is not valid when the leaf corresponds to a non-applicable word, wherein the derived status for each leaf is the optimal status for the leaf if the leaf is included in the minimum colored subset, and otherwise is the optimal status of a first node above the leaf in the tree that is in the minimum colored subset; and
generating a plurality of suffix-rewriting rules, wherein each rule corresponds to a node in the minimum colored subset with a valid status, and maps the suffix of the node to the suffix of the node with the replacement specified by the suffix-rewriting rule.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for generating suffix rewriting rules. A method includes obtaining a plurality of canonical suffix-rewriting rules each associated with one or more words, generating a suffix tree from the words, selecting a minimum colored subset of the nodes and leaves in the suffix tree, and generating a plurality of final suffix-rewriting rules from the nodes in the minimum colored subset. Another method includes receiving applicable and non-applicable words for a suffix-rewriting rule, generating a suffix tree from the applicable words and the non-applicable words, selecting a minimum colored subset of the nodes and leaves in the suffix tree, and generating a plurality of suffix-rewriting rules, wherein each rule corresponds to a node in the minimum colored subset with a valid status.
-
Citations
20 Claims
-
1. A computer-implemented method, comprising:
-
receiving, for a canonical suffix-rewriting rule, a plurality of applicable words and a plurality of non-applicable words, wherein each applicable word is a word to which the rule should be applied, and each non-applicable word is a word to which the rule should not be applied, and wherein the canonical suffix-rewriting rule specifies a suffix replacement; generating a suffix tree from the applicable words and the non-applicable words, the suffix tree comprising a plurality of nodes and a plurality of leaves arranged in a hierarchical structure wherein each node has one or more nodes or one or more leaves below the node in the tree, wherein each leaf corresponds to either a unique applicable word or a unique non-applicable word, and each node corresponds to a suffix of its children in the tree; selecting, using one or more processors, a minimum colored subset of the nodes and leaves in the suffix tree, wherein each node and leaf in the minimum colored subset has an associated optimal status, wherein the minimum colored subset and the optimal status of each node and leaf in the minimum colored subset are selected such that a derived status for each leaf is valid when the leaf corresponds to an applicable word, and the derived status for each leaf is not valid when the leaf corresponds to a non-applicable word, wherein the derived status for each leaf is the optimal status for the leaf if the leaf is included in the minimum colored subset, and otherwise is the optimal status of a first node above the leaf in the tree that is in the minimum colored subset; and generating a plurality of suffix-rewriting rules, wherein each rule corresponds to a node in the minimum colored subset with a valid status, and maps the suffix of the node to the suffix of the node with the replacement specified by the suffix-rewriting rule. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising; receiving, for a canonical suffix-rewriting rule, a plurality of applicable words and a plurality of non-applicable words, wherein each applicable word is a word to which the rule should be applied, and each non-applicable word is a word to which the rule should not be applied, and wherein the canonical suffix-rewriting rule specifies a suffix replacement; generating a suffix tree from the applicable words and the non-applicable words, the suffix tree comprising a plurality of nodes and a plurality of leaves arranged in a hierarchical structure wherein each node has one or more nodes or one or more leaves below the node in the tree, wherein each leaf corresponds to either a unique applicable word or a unique non-applicable word, and each node corresponds to a suffix of its children in the tree; selecting a minimum colored subset of the nodes and leaves in the suffix tree, wherein each node and leaf in the minimum colored subset has an associated optimal status, wherein the minimum colored subset and the optimal status of each node and leaf in the minimum colored subset are selected such that a derived status for each leaf is valid when the leaf corresponds to an applicable word, and the derived status for each leaf is not valid when the leaf corresponds to a non-applicable word, wherein the derived status for each leaf is the optimal status for the leaf if the leaf is included in the minimum colored subset, and otherwise is the optimal status of a first node above the leaf in the tree that is in the minimum colored subset; and generating a plurality of suffix-rewriting rules, wherein each rule corresponds to a node in the minimum colored subset with a valid status, and maps the suffix of the node to the suffix of the node with the replacement specified by the suffix-rewriting rule. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
receiving, for a canonical suffix-rewriting rule, a plurality of applicable words and a plurality of non-applicable words, wherein each applicable word is a word to which the rule should be applied, and each non-applicable word is a word to which the rule should not be applied, and wherein the canonical suffix-rewriting rule specifies a suffix replacement; generating a suffix tree from the applicable words and the non-applicable words, the suffix tree comprising a plurality of nodes and a plurality of leaves arranged in a hierarchical structure wherein each node has one or more nodes or one or more leaves below the node in the tree, wherein each leaf corresponds to either a unique applicable word or a unique non-applicable word, and each node corresponds to a suffix of its children in the tree; selecting a minimum colored subset of the nodes and leaves in the suffix tree, wherein each node and leaf in the minimum colored subset has an associated optimal status, wherein the minimum colored subset and the optimal status of each node and leaf in the minimum colored subset are selected such that a derived status for each leaf is valid when the leaf corresponds to an applicable word, and the derived status for each leaf is not valid when the leaf corresponds to a non-applicable word, wherein the derived status for each leaf is the optimal status for the leaf if the leaf is included in the minimum colored subset, and otherwise is the optimal status of a first node above the leaf in the tree that is in the minimum colored subset; and generating a plurality of suffix-rewriting rules, wherein each rule corresponds to a node in the minimum colored subset with a valid status, and maps the suffix of the node to the suffix of the node with the replacement specified by the suffix-rewriting rule. - View Dependent Claims (18, 19, 20)
Specification