Automatic clustering of tokens from a corpus for grammar acquisition

US 20020002454A1
Filed: 07/26/2001
Published: 01/03/2002
Est. Priority Date: 12/07/1998
Status: Active Grant

First Claim

Patent Images

1. A grammar learning method from a corpus, comprising:

identify context tokens within the corpus, for each non-context token in the corpus, counting occurrence of predetermined relationship of the non-context token to a context token, generating frequency vectors for each non-context token based upon the counted occurrences, and clustering non-context tokens based upon the frequency vectors.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a method of learning grammar from a corpus, context words are identified from a corpus. For the other non-context words, the method counts the occurrence of predetermined relationships which the context words, and maps the counted occurrences to a multidimensional frequency space. Clusters are grown from the frequency vectors. The clusters represent classes of words; words in the same cluster possess the same lexical significancy and provide an indicator of grammatical structure.

40 Citations

19 Claims

1. A grammar learning method from a corpus, comprising:
- identify context tokens within the corpus, for each non-context token in the corpus, counting occurrence of predetermined relationship of the non-context token to a context token, generating frequency vectors for each non-context token based upon the counted occurrences, and clustering non-context tokens based upon the frequency vectors.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the clustering is performed based on Euclidean distances between frequency vectors.
  - 3. The method of claim 1, wherein the clustering is performed based on Manhattan distances between frequency vectors.
  - 4. The method of claim 1, wherein the clustering is performed based on maximum distance metrics between frequency vectors.
  - 5. The method of claim 1, wherein the frequency vectors of a non-context token is normalized based upon the number of occurrences of the non-context token in the corpus.
  - 6. The method of claim 1, wherein the frequency vectors are multi dimensional vectors, the number of dimensions determined by the number of context tokens and different level of adjacency identified in the counting step.
  - 7. The method of claim 1, wherein clusters may be represented by a centroid vector.

8. A method of phrase grammar learning from a corpus, comprising:
- identify context tokens within the corpus, for each non-context token in the corpus, counting occurrence of a predetermined relationship of a context token, generating frequency vectors for each non-context token based upon the counted occurrences, clustering non-context tokens based upon the frequency vectors into a cluster tree.
- View Dependent Claims (9, 10, 11, 12, 13, 15, 16, 17, 18, 19)
- - 9. The method of claim 8, wherein the clustering is performed based on Euclidean distances between frequency vectors.
  - 10. The method of claim 8, wherein the clustering is performed based on Manhattan distances between frequency vectors.
  - 11. The method of claim 8, wherein the frequency vectors of a non-context token is normalized based upon the number of occurrences of the non-context token in the corpus.
  - 12. The method of claim 8, wherein the frequency vectors are multi dimensional vectors, the number of dimensions determined by the number of context tokens and different level of adjacency identified in the counting step.
  - 13. The method of claim 8, wherein clusters may be represented by a centroid vector.
  - 15. The method of claim 8, wherein the clustering is performed based on Euclidean distances between frequency vectors.
  - 16. The method of claim 8, wherein the clustering is performed based on maximum distance metrics between frequency vectors.
  - 17. The method of claim 8, wherein the frequency vectors of a non-context token is normalized based upon the number of occurrences of the non-context token in the corpus.
  - 18. The method of claim 8, wherein the frequency vectors are multi dimensional vectors, the number of dimensions determined by the number of context tokens and different levels of adjacency to context tokens to be recognized.
  - 19. The method of claim 8, wherein clusters may be represented by a centroid vector.

14. A method of phrase grammar learning from a corpus, comprising:
- identifying context words from a corpus, for each non-context word in the corpus, counting occurrences of the non-context word within a predetermined adjacency of a context word, generating frequency vectors for each non-context word based upon the counted occurrences, clustering non-context tokens based on the frequency vectors into a cluster tree, and cutting the cluster tree along a cutting line.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
AT&T Corporation (AT&T, Inc.)
Inventors
Riccardi, Giuseppe, Bangalore, Srinivas

Granted Patent

US 6,751,584 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 18/24323   Tree-organised classifiers

G06F 40/253   Grammatical analysis; Style...

G06F 40/279   Recognition of textual enti...

G06F 40/284   Lexical analysis, e.g. toke...

Automatic clustering of tokens from a corpus for grammar acquisition

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

40 Citations

19 Claims

Specification

Use Cases

Quick Links

Others

Automatic clustering of tokens from a corpus for grammar acquisition

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

19 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others