Automatic clustering of tokens from a corpus for grammar acquisition

US 6,317,707 B1
Filed: 12/07/1998
Issued: 11/13/2001
Est. Priority Date: 12/07/1998
Status: Expired due to Term

First Claim

Patent Images

1. A grammar learning method from a corpus, comprising:

identifying context tokens within the corpus, for each non-context token in the corpus, counting occurrences of predetermined relationships of the non-context token to a context token, generating frequency vectors for each non-context token based upon the counted occurrences, and clustering non-context tokens based upon the frequency vectors, whereby the clusters of non-context tokens form a grammatical model of the corpus.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a method of learning grammar from a corpus, context words are identified from a corpus. For the other non-context words, the method counts the occurrence of predetermined relationships which the context words, and maps the counted occurrences to a multidimensional frequency space. Clusters are grown from the frequency vectors. The clusters represent classes of words; words in the same cluster possess the same lexical significancy and provide an indicator of grammatical structure.

Citations

21 Claims

1. A grammar learning method from a corpus, comprising:
- identifying context tokens within the corpus, for each non-context token in the corpus, counting occurrences of predetermined relationships of the non-context token to a context token, generating frequency vectors for each non-context token based upon the counted occurrences, and clustering non-context tokens based upon the frequency vectors, whereby the clusters of non-context tokens form a grammatical model of the corpus.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the clustering is performed based on Euclidean distances between frequency vectors.
  - 3. The method of claim 1, wherein the clustering is performed based on Manhattan distances between frequency vectors.
  - 4. The method of claim 1, wherein the clustering is performed based on maximum distance metrics between frequency vectors.
  - 5. The method of claim 1, wherein the frequency vectors of a non-context token are normalized based upon the number of occurrences of the non-context token in the corpus.
  - 6. The method of claim 1, wherein the frequency vectors are multi dimensional vectors, the number of dimensions determined by the number of context tokens and different levels of adjacency identified in the counting step.
  - 7. The method of claim 1, wherein clusters may be represented by a centroid vector.
  - 8. The method of claim 1, wherein the grammar learning method is unsupervised.
  - 9. The method of claim 1, wherein the corpus is a string of words in a language.

10. A method of phrase grammar learning from a corpus, comprising:
- identifying context tokens within the corpus, for each non-context token in the corpus, counting occurrences of a predetermined relationships of a context token, generating frequency vectors for each non-context token based upon the counted occurrences, and clustering non-context tokens based upon the frequency vectors into a cluster tree, whereby the cluster tree forms a grammatical model of the corpus.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The method of claim 10, wherein the clustering is performed based on Euclidean distances between frequency vectors.
  - 12. The method of claim 10, wherein the clustering is performed based on Manhattan distances between frequency vectors.
  - 13. The method of claim 10, wherein the frequency vectors of a non-context token are normalized based upon the number of occurrences of the non-context token in the corpus.
  - 14. The method of claim 10, wherein the frequency vectors are multi dimensional vectors, the number of dimensions determined by the number of context tokens and different levels of adjacency identified in the counting step.
  - 15. The method of claim 10, wherein clusters may be represented by a centroid vector.

16. A method of phrase grammar learning from a corpus, comprising:
- identifying context words from a corpus, for each non-context word in the corpus, counting occurrences of the non-context word within a predetermined adjacency of a context word, generating frequency vectors for each non-context word based upon the counted occurrences, clustering non-context words based on the frequency vectors into a cluster tree, and cutting the cluster tree along a cutting line, thereby forming a grammatical model of the corpus.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. The method of claim 16, wherein the clustering is performed based on Euclidean distances between frequency vectors.
  - 18. The method of claim 16, wherein the clustering is performed based on maximum distance metrics between frequency vectors.
  - 19. The method of claim 16, wherein the frequency vectors of a non-context words are normalized based upon the number of occurrences of the non-context word in the corpus.
  - 20. The method of claim 16, wherein the frequency vectors are multi dimensional vectors, the number of dimensions determined by the number of context words and different levels of adjacency to context words to be recognized.
  - 21. The method of claim 16, wherein clusters may be represented by a centroid vector.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Bangalore, Srinivas, Riccardi, Giuseppe
Primary Examiner(s)
Edouard, Patrick N.

Application Number

US09/207,326
Time in Patent Office

1,072 Days
Field of Search

704/1, 704/9, 704/10, 704/255, 704/256, 704/257, 704/245, 704/238-240
US Class Current

704/9
CPC Class Codes

G06F 18/24323   Tree-organised classifiers

G06F 40/253   Grammatical analysis; Style...

G06F 40/279   Recognition of textual enti...

G06F 40/284   Lexical analysis, e.g. toke...

Automatic clustering of tokens from a corpus for grammar acquisition

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic clustering of tokens from a corpus for grammar acquisition

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links