Automatic clustering of tokens from a corpus for grammar acquisition
First Claim
Patent Images
1. A method of grammar learning from a corpus, comprising:
- identifying context tokens within the corpus, for each non-context token in the corpus, counting occurrences of a predetermined relationship of the non-context token to the context tokens, generating frequency vectors for each non-context token based upon the counted occurrences, clustering non-context tokens based upon the frequency vectors into a cluster, and writing a representation of the cluster to a file, wherein the cluster indicates a lexical correlation among the clustered non-context tokens.
4 Assignments
0 Petitions
Accused Products
Abstract
In a method of learning grammar from a corpus, context words are identified from a corpus. For the other non-context words, the method counts the occurrence of predetermined relationships which the context words, and maps the counted occurrences to a multidimensional frequency space. Clusters are grown from the frequency vectors. The clusters represent classes of words; words in the same cluster possess the same lexical significancy and provide an indicator of grammatical structure.
62 Citations
30 Claims
-
1. A method of grammar learning from a corpus, comprising:
-
identifying context tokens within the corpus, for each non-context token in the corpus, counting occurrences of a predetermined relationship of the non-context token to the context tokens, generating frequency vectors for each non-context token based upon the counted occurrences, clustering non-context tokens based upon the frequency vectors into a cluster, and writing a representation of the cluster to a file, wherein the cluster indicates a lexical correlation among the clustered non-context tokens.
-
-
2. A machine-readable medium having stored thereon executable instructions that when executed by a processor, cause the processor to:
-
identify context tokens within a corpus, count occurrences of predetermined relationships of non-context tokens to the context tokens, generating frequency vectors for each non-context token based on the counted occurrences, and clustering the non-context tokens based upon the frequency vectors into a cluster tree. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A grammar modeling apparatus, comprising:
-
means for identifying context tokens within an input corpus, for each non-context token in the corpus, means for counting occurrences of predetermined relationships of the non-context token to a plurality of context tokens, means for generating frequency vectors for each non-context token based upon the counted occurrences, and means for clustering the non-context tokens into clusters based upon the frequency vectors, thereby forming a grammatical model of the corpus. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. A file storing a grammar model of a corpus of speech, created according to a method comprising:
-
identifying context tokens within a corpus, for each non-context token in the corpus, counting each occurrence of an identified context token having a predetermined relationship to the non-context token, generating frequency vectors for each non-context token based upon the counted occurrences, generating clusters of non-context tokens based upon a correlation between the frequency vectors, and storing the non-context tokens and a representation of the clusters in a file. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30)
-
Specification