Method and apparatus for measuring the degree of polysemy in polysemous words

US 6,256,629 B1
Filed: 11/25/1998
Issued: 07/03/2001
Est. Priority Date: 11/25/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A method for measuring the degree of polysemy of a target term in a corpus, said method comprising the steps of:

collecting a set of terms from said corpus within a certain window of said target term;

computing a matrix of inter-term distances of said set of terms;

reducing the dimension of said matrix of inter-term distances to a two dimensional representation;

converting said two dimensional representation into radial coordinates; and

deriving a polysemy index for said target term based on the degree to which the radial distribution deviates from unimodality.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and apparatus are disclosed for identifying polysemous terms and for measuring their degree of polysemy. A polysemy index provides a quantitative measure of how polysemous a word is. A list of words can be ranked by their polysemy indices, with the most polysemous words appearing at the top of the list. A polysemy evaluation process collects a set of terms near a target term. Inter-term distances of the set of terms occurring near the target term are computed and the multi-dimensional distance space is reduced to two dimensions. The two dimensional representation is converted into radial coordinates. Isotonic/antitonic regression techniques are used to compute the degree to which the distribution deviates from unimodality. The amount of deviation is the polysemy index. A corpus can be preprocessed using the polysemy indices to identify words having clearly separated senses, allowing an information retrieval system to return a separate list of documents for each sense of a word. Self-organizing sense disambiguation techniques can use the polysemy indixces to select canonical contexts for the various senses identified for a given word. Contexts are selected containing terms falling in radial bins near each peak. Such contexts can then be used for subsequent training of a classifier.

77 Citations

View as Search Results

33 Claims

1. A method for measuring the degree of polysemy of a target term in a corpus, said method comprising the steps of:
- collecting a set of terms from said corpus within a certain window of said target term;
  
  computing a matrix of inter-term distances of said set of terms;
  
  reducing the dimension of said matrix of inter-term distances to a two dimensional representation;
  
  converting said two dimensional representation into radial coordinates; and
  
  deriving a polysemy index for said target term based on the degree to which the radial distribution deviates from unimodality.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method according to claim 1, wherein isotonic/antitonic regression techniques are used to determine the degree to which the radial distribution deviates from unimodality.
  - 3. The method according to claim 1, wherein an inverse goodness-of-fit metric is used as the polysemy index.
  - 4. The method according to claim 1, further comprising the step of ranking a plurality of words according to their polysemy index.
  - 5. The method according to claim 1, further comprising the step of filtering said set of terms.
  - 6. The method according to claim 5, wherein said filter ensures that the ratio the frequency of each term within said certain window and the frequency of said term in said corpus exceeds a predefined threshold.
  - 7. The method according to claim 5, wherein said filter ensures that the frequency of each term within said certain window exceeds a predefined threshold.
  - 8. The method according to claim 1, wherein said inter-term distances are computed from the equation:
    - ${DIST}_{w} (q, q^{'}) = 1 - \frac{\frac{{CO}_{w} (q  q^{'})}{{FREQ}_{C} (q)} + \frac{{CO}_{w} (q^{'}  q)}{{FREQ}_{C} (q^{'})}}{2}$
9. The method according to claim 1, wherein said inter-term distances measure the similarity of each pair of cooccuring words.
10. The method according to claim 1, wherein said dimension reduction step selects the first two eigenvectors of the matrix as the two dimensions.
11. The method according to claim 1, wherein said converting step further comprises the steps of compute a distance_w(q) from an origin in said two dimensional representation and a cosine_w(q) with respect to a horizontal axis and given a cosine_w(q), selecting a radial bin, and incrementing the entry for that bin by distance_w(q).

12. A method for measuring the degree of polysemy of a target term in a corpus, said method comprising the steps of:
- collecting a set of terms from said corpus within a certain window of said target term;
  
  generating a radial distribution of inter-term distances for said set of terms; and
  
  deriving a polysemy index for said target term based on the degree to which the radial distribution deviates from unimodality.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The method according to claim 12, wherein isotonic/antitonic regression techniques are used to determine the degree to which the radial distribution deviates from unimodality.
  - 14. The method according to claim 12, wherein an inverse goodness-of-fit metric is used as the polysemy index.
  - 15. The method according to claim 12, further comprising the step of ranking a plurality of words according to their polysemy index.
  - 16. The method according to claim 12, wherein said inter-term distances are computed from the equation:
    - ${DIST}_{w} (q, q^{'}) = 1 - \frac{\frac{{CO}_{w} (q  q^{'})}{{FREQ}_{C} (q)} + \frac{{CO}_{w} (q^{'}  q)}{{FREQ}_{C} (q^{'})}}{2}$
17. The method according to claim 12, wherein said inter-term distances measure the similarity of each pair of cooccuring words.

18. A system for measuring the degree of polysemy of a target term in a corpus, said system comprising:
- a memory for storing computer readable code; and
  
  a processor operatively coupled to said memory, said processor configured to;
  
  collect a set of terms from said corpus within a certain window of said target term;
  
  compute a matrix of inter-term distances of said set of terms;
  
  reduce the dimension of said matrix of inter-term distances to a two dimensional representation;
  
  convert said two dimensional representation into radial coordinates; and
  
  derive a polysemy index for said target term based on the degree to which the radial distribution deviates from unimodality.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
- - 19. The system according to claim 18, wherein said processor employs isotonic/antitonic regression techniques to determine the degree to which the radial distribution deviates from unimodality.
  - 20. The system according to claim 18, wherein an inverse goodness-of-fit metric is used as the polysemy index.
  - 21. The system according to claim 18, wherein said processor is further configured to rank a plurality of words according to their polysemy index.
  - 22. The system according to claim 18, wherein said processor is further configured to filter said set of terms.
  - 23. The system according to claim 22, wherein said filter ensures that the ratio the frequency of each term within said certain window and the frequency of said term in said corpus exceeds a predefined threshold.
  - 24. The system according to claim 22, wherein said filter ensures that the frequency of each term within said certain window exceeds a predefined threshold.
  - 25. The system according to claim 18, wherein said inter-term distances are computed from the equation:
    - ${DIST}_{w} (q, q^{'}) = 1 - \frac{\frac{{CO}_{w} (q  q^{'})}{{FREQ}_{C} (q)} + \frac{{CO}_{w} (q^{'}  q)}{{FREQ}_{C} (q^{'})}}{2}$
26. The system according to claim 18, wherein said inter-term distances measure the similarity of each pair of cooccuring words.
27. The system according to claim 18, wherein said processor selects the first two eigenvectors of the matrix as the two dimensions during said dimension reduction step.
28. The system according to claim 18, wherein said processor is further configured to compute a distance_w(q) from an origin in said two dimensional representation and a cosine_w(q) with respect to a horizontal axis and given a cosine_w(q), select a radial bin, and increment the entry for that bin by distance_w(q) during said converting step.

29. A method for retrieving information from a corpus, said method comprising the steps of:
- deriving a polysemy index for words in said corpus;
  
  identifying words that are ambiguous in said corpus using said polysemy index;
  
  storing a list of words that are associated with each sense of each of said identified ambiguous words; and
  
  classifying returned documents according to one of said senses if a query includes one of said identified ambiguous words.
- View Dependent Claims (30)
- - 30. The method according to claim 29, wherein said polysemy index is based on the degree to which a radial distribution deviates from a single-peak model.

31. A system for retrieving information from a corpus, said system comprising:
- a memory for storing computer readable code; and
  
  a processor operatively coupled to said memory, said processor configured to;
  
  derive a polysemy index for words in said corpus;
  
  identify words that are ambiguous in said corpus using said polysemy index;
  
  store a list of words that are associated with each sense of each of said identified ambiguous words; and
  
  classify returned documents according to one of said senses if a query includes one of said identified ambiguous words.

32. A method for selecting canonical contexts for the plurality of senses of words in a corpus, said method comprising the steps of:
- deriving a polysemy index for one or more words in said corpus;
  
  identifying words that are ambiguous in said corpus using said polysemy index;
  
  collecting a list of words that are associated with each sense of said identified ambiguous words; and
  
  obtaining seed examples using said collected list of words to train a self-organizing sense-disambiguation algorithm.

33. A system for for selecting canonical contexts for the plurality of senses of words in a corpus, said system comprising:
- a memory for storing computer readable code; and
  
  a processor operatively coupled to said memory, said processor configured to;
  
  derive a polysemy index for one or more words in said corpus;
  
  identify words that are ambiguous in said corpus using said polysemy index;
  
  collect a list of words that are associated with each sense of said identified ambiguous words; and
  
  obtain seed examples using said collected list of words to train a self-organizing sense-disambiguation algorithm.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Sproat, Richard William, VanSanten, Jan Pieter
Primary Examiner(s)
Black, Thomas
Assistant Examiner(s)
CHEUNG, MARY DA ZHI WANG

Application Number

US09/199,670
Time in Patent Office

951 Days
Field of Search

707/1, 707/2, 707/3, 707/4, 707/6, 707/7, 704/1, 704/2, 704/3, 704/4, 704/7, 704/9, 704/10, 704/200, 704/216, 704/217, 704/218, 704/219, 704/220, 704/222, 704/237, 704/238, 704/239, 704/240, 704/243, 704/245
US Class Current

1/1
CPC Class Codes

G06F 40/216   using statistical methods

G06F 40/289   Phrasal analysis, e.g. fini...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99936   Pattern matching access

Method and apparatus for measuring the degree of polysemy in polysemous words

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

77 Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for measuring the degree of polysemy in polysemous words

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

77 Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links