Automatic clustering of tokens from a corpus for grammar acquisition

US 6,751,584 B2
Filed: 07/26/2001
Issued: 06/15/2004
Est. Priority Date: 12/07/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method of grammar learning from a corpus, comprising:

identifying context tokens within the corpus, for each non-context token in the corpus, counting occurrences of a predetermined relationship of the non-context token to the context tokens, generating frequency vectors for each non-context token based upon the counted occurrences, clustering non-context tokens based upon the frequency vectors into a cluster, and writing a representation of the cluster to a file, wherein the cluster indicates a lexical correlation among the clustered non-context tokens.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a method of learning grammar from a corpus, context words are identified from a corpus. For the other non-context words, the method counts the occurrence of predetermined relationships which the context words, and maps the counted occurrences to a multidimensional frequency space. Clusters are grown from the frequency vectors. The clusters represent classes of words; words in the same cluster possess the same lexical significancy and provide an indicator of grammatical structure.

62 Citations

View as Search Results

30 Claims

1. A method of grammar learning from a corpus, comprising:
- identifying context tokens within the corpus, for each non-context token in the corpus, counting occurrences of a predetermined relationship of the non-context token to the context tokens, generating frequency vectors for each non-context token based upon the counted occurrences, clustering non-context tokens based upon the frequency vectors into a cluster, and writing a representation of the cluster to a file, wherein the cluster indicates a lexical correlation among the clustered non-context tokens.

2. A machine-readable medium having stored thereon executable instructions that when executed by a processor, cause the processor to:
- identify context tokens within a corpus, count occurrences of predetermined relationships of non-context tokens to the context tokens, generating frequency vectors for each non-context token based on the counted occurrences, and clustering the non-context tokens based upon the frequency vectors into a cluster tree.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 3. The method machine-readable medium of claim 2, wherein the cluster tree represents a grammatical relationship among the non-context tokens.
  - 4. The method machine-readable medium of claim 2, wherein the instructions further cause the processor to cut the cluster tree along a cutting line to separate large clusters from small clusters.
  - 5. The method machine-readable medium of claim 4, wherein small clusters are ranked according to a compactness value.
  - 6. The method machine-readable medium of claim 2, wherein the predetermined relationship is a measure of adjacency.
  - 7. The method machine-readable medium of claim 2, wherein the clustering is performed based on Euclidean distances between the frequency vectors.
  - 8. The method machine-readable medium of claim 2, wherein the clustering is performed based on Manhattan distances between the frequency vectors.
  - 9. The method machine-readable medium of claim 2, wherein the clustering is performed based on maximum distance metrics between the frequency vectors.
  - 10. The method machine-readable medium of claim 2, further comprising normalizing the frequency vectors based upon a number of occurrences of the non-context token in the corpus.
  - 11. The method machine-readable medium of claim 2, wherein the frequency vectors are multi-dimensional vectors, the number of dimensions being determined by the number of context tokens and a number of predetermined relationships of non-context tokens to the context token being counted.

12. A grammar modeling apparatus, comprising:
- means for identifying context tokens within an input corpus, for each non-context token in the corpus, means for counting occurrences of predetermined relationships of the non-context token to a plurality of context tokens, means for generating frequency vectors for each non-context token based upon the counted occurrences, and means for clustering the non-context tokens into clusters based upon the frequency vectors, thereby forming a grammatical model of the corpus.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 13. The apparatus of claim 12, wherein the clusters form a hierarchical cluster tree.
  - 14. The apparatus of claim 13, wherein the hierarchical cluster tree is cut along a cutting line.
  - 15. The apparatus of claim 12, wherein a cluster may be represented by a centroid vector.
  - 16. The apparatus of claim 12, wherein the predetermined relationship is a measure of adjacency.
  - 17. The apparatus of claim 12, wherein the clusters are created based on Euclidean distances between the frequency vectors.
  - 18. The apparatus of claim 12, wherein the clusters are created based on Manhattan distances between the frequency vectors.
  - 19. The apparatus of claim 12, wherein the clusters are created based on maximum distance metrics between the frequency vectors.
  - 20. The apparatus of claim 12, wherein the frequency vectors are normalized based upon the number of occurrences of the non-context token in the corpus.
  - 21. The apparatus of claim 12, wherein the frequency vectors are multi-dimensional vectors, the number of dimensions of which is determined by the number of context tokens and the number of predetermined relationships of non-context tokens to context tokens.

22. A file storing a grammar model of a corpus of speech, created according to a method comprising:
- identifying context tokens within a corpus, for each non-context token in the corpus, counting each occurrence of an identified context token having a predetermined relationship to the non-context token, generating frequency vectors for each non-context token based upon the counted occurrences, generating clusters of non-context tokens based upon a correlation between the frequency vectors, and storing the non-context tokens and a representation of the clusters in a file.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30)
- - 23. The file of claim 22, wherein the clusters may be represented by centroid vectors.
  - 24. The file of claim 22, wherein the predetermined relationship is adjacency.
  - 25. The file of claim 22, wherein the correlation is based on Euclidean distance.
  - 26. The file of claim 22, the correlation is based on Manhattan distance.
  - 27. The file of claim 22, wherein the correlation is based on a maximum distance metric.
  - 28. The file of claim 22, wherein the frequency vectors are normalized based upon the number of occurrences of the non-context token in the corpus.
  - 29. The file of claim 22, wherein the frequency vectors are multi-dimensional vectors, the number of dimensions of which is determined by the number of context tokens and the number of predetermined relationships of non-context tokens to context tokens.
  - 30. The file of claim 22, wherein the file is stored on a machine-readable medium.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Bangalore, Srinivas, Riccardi, Giuseppe
Primary Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US09/912,461
Publication Number

US 20020002454A1
Time in Patent Office

1,055 Days
Field of Search

704/1, 704/9, 704/10, 704/255, 704/256, 704/257
US Class Current

704/1
CPC Class Codes

G06F 18/24323   Tree-organised classifiers

G06F 40/253   Grammatical analysis; Style...

G06F 40/279   Recognition of textual enti...

G06F 40/284   Lexical analysis, e.g. toke...

Automatic clustering of tokens from a corpus for grammar acquisition

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

62 Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic clustering of tokens from a corpus for grammar acquisition

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

62 Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links