Automatic clustering of tokens from a corpus for grammar acquisition
First Claim
Patent Images
1. A grammar learning method from a corpus, comprising:
- identify context tokens within the corpus, for each non-context token in the corpus, counting occurrence of predetermined relationship of the non-context token to a context token, generating frequency vectors for each non-context token based upon the counted occurrences, and clustering non-context tokens based upon the frequency vectors.
4 Assignments
0 Petitions
Accused Products
Abstract
In a method of learning grammar from a corpus, context words are identified from a corpus. For the other non-context words, the method counts the occurrence of predetermined relationships which the context words, and maps the counted occurrences to a multidimensional frequency space. Clusters are grown from the frequency vectors. The clusters represent classes of words; words in the same cluster possess the same lexical significancy and provide an indicator of grammatical structure.
40 Citations
19 Claims
-
1. A grammar learning method from a corpus, comprising:
-
identify context tokens within the corpus, for each non-context token in the corpus, counting occurrence of predetermined relationship of the non-context token to a context token, generating frequency vectors for each non-context token based upon the counted occurrences, and clustering non-context tokens based upon the frequency vectors. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method of phrase grammar learning from a corpus, comprising:
-
identify context tokens within the corpus, for each non-context token in the corpus, counting occurrence of a predetermined relationship of a context token, generating frequency vectors for each non-context token based upon the counted occurrences, clustering non-context tokens based upon the frequency vectors into a cluster tree. - View Dependent Claims (9, 10, 11, 12, 13, 15, 16, 17, 18, 19)
-
-
14. A method of phrase grammar learning from a corpus, comprising:
-
identifying context words from a corpus, for each non-context word in the corpus, counting occurrences of the non-context word within a predetermined adjacency of a context word, generating frequency vectors for each non-context word based upon the counted occurrences, clustering non-context tokens based on the frequency vectors into a cluster tree, and cutting the cluster tree along a cutting line.
-
Specification