Method and apparatus for measuring the degree of polysemy in polysemous words
First Claim
1. A method for measuring the degree of polysemy of a target term in a corpus, said method comprising the steps of:
- collecting a set of terms from said corpus within a certain window of said target term;
computing a matrix of inter-term distances of said set of terms;
reducing the dimension of said matrix of inter-term distances to a two dimensional representation;
converting said two dimensional representation into radial coordinates; and
deriving a polysemy index for said target term based on the degree to which the radial distribution deviates from unimodality.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and apparatus are disclosed for identifying polysemous terms and for measuring their degree of polysemy. A polysemy index provides a quantitative measure of how polysemous a word is. A list of words can be ranked by their polysemy indices, with the most polysemous words appearing at the top of the list. A polysemy evaluation process collects a set of terms near a target term. Inter-term distances of the set of terms occurring near the target term are computed and the multi-dimensional distance space is reduced to two dimensions. The two dimensional representation is converted into radial coordinates. Isotonic/antitonic regression techniques are used to compute the degree to which the distribution deviates from unimodality. The amount of deviation is the polysemy index. A corpus can be preprocessed using the polysemy indices to identify words having clearly separated senses, allowing an information retrieval system to return a separate list of documents for each sense of a word. Self-organizing sense disambiguation techniques can use the polysemy indixces to select canonical contexts for the various senses identified for a given word. Contexts are selected containing terms falling in radial bins near each peak. Such contexts can then be used for subsequent training of a classifier.
77 Citations
33 Claims
-
1. A method for measuring the degree of polysemy of a target term in a corpus, said method comprising the steps of:
-
collecting a set of terms from said corpus within a certain window of said target term;
computing a matrix of inter-term distances of said set of terms;
reducing the dimension of said matrix of inter-term distances to a two dimensional representation;
converting said two dimensional representation into radial coordinates; and
deriving a polysemy index for said target term based on the degree to which the radial distribution deviates from unimodality. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
where FREQw(q) is the frequency of the term within such certain window, and FREQc(q) is the frequency of the term in said corpus, and COw(q′
|q) indicates the frequency of one term, q′
, within a W-term window of all instances of another term, q.
-
-
9. The method according to claim 1, wherein said inter-term distances measure the similarity of each pair of cooccuring words.
-
10. The method according to claim 1, wherein said dimension reduction step selects the first two eigenvectors of the matrix as the two dimensions.
-
11. The method according to claim 1, wherein said converting step further comprises the steps of compute a distancew(q) from an origin in said two dimensional representation and a cosinew(q) with respect to a horizontal axis and given a cosinew(q), selecting a radial bin, and incrementing the entry for that bin by distancew(q).
-
12. A method for measuring the degree of polysemy of a target term in a corpus, said method comprising the steps of:
-
collecting a set of terms from said corpus within a certain window of said target term;
generating a radial distribution of inter-term distances for said set of terms; and
deriving a polysemy index for said target term based on the degree to which the radial distribution deviates from unimodality. - View Dependent Claims (13, 14, 15, 16, 17)
where FREQw(q) is the frequency of the term within such certain window, and FREQc(q) is the frequency of the term in said corpus, and COw(q′
|q) indicates the frequency of one term, q′
, within a W-term window of all instances of another term, q.
-
-
17. The method according to claim 12, wherein said inter-term distances measure the similarity of each pair of cooccuring words.
-
18. A system for measuring the degree of polysemy of a target term in a corpus, said system comprising:
-
a memory for storing computer readable code; and
a processor operatively coupled to said memory, said processor configured to;
collect a set of terms from said corpus within a certain window of said target term;
compute a matrix of inter-term distances of said set of terms;
reduce the dimension of said matrix of inter-term distances to a two dimensional representation;
convert said two dimensional representation into radial coordinates; and
derive a polysemy index for said target term based on the degree to which the radial distribution deviates from unimodality. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
where FREQw(q) is the frequency of the term within such certain window, and FREQc(q) is the frequency of the term in said corpus, and COw(q′
|q) indicates the frequency of one term, q′
, within a W-term window of all instances of another term, q.
-
-
26. The system according to claim 18, wherein said inter-term distances measure the similarity of each pair of cooccuring words.
-
27. The system according to claim 18, wherein said processor selects the first two eigenvectors of the matrix as the two dimensions during said dimension reduction step.
-
28. The system according to claim 18, wherein said processor is further configured to compute a distancew(q) from an origin in said two dimensional representation and a cosinew(q) with respect to a horizontal axis and given a cosinew(q), select a radial bin, and increment the entry for that bin by distancew(q) during said converting step.
-
29. A method for retrieving information from a corpus, said method comprising the steps of:
-
deriving a polysemy index for words in said corpus;
identifying words that are ambiguous in said corpus using said polysemy index;
storing a list of words that are associated with each sense of each of said identified ambiguous words; and
classifying returned documents according to one of said senses if a query includes one of said identified ambiguous words. - View Dependent Claims (30)
-
-
31. A system for retrieving information from a corpus, said system comprising:
-
a memory for storing computer readable code; and
a processor operatively coupled to said memory, said processor configured to;
derive a polysemy index for words in said corpus;
identify words that are ambiguous in said corpus using said polysemy index;
store a list of words that are associated with each sense of each of said identified ambiguous words; and
classify returned documents according to one of said senses if a query includes one of said identified ambiguous words.
-
-
32. A method for selecting canonical contexts for the plurality of senses of words in a corpus, said method comprising the steps of:
-
deriving a polysemy index for one or more words in said corpus;
identifying words that are ambiguous in said corpus using said polysemy index;
collecting a list of words that are associated with each sense of said identified ambiguous words; and
obtaining seed examples using said collected list of words to train a self-organizing sense-disambiguation algorithm.
-
-
33. A system for for selecting canonical contexts for the plurality of senses of words in a corpus, said system comprising:
-
a memory for storing computer readable code; and
a processor operatively coupled to said memory, said processor configured to;
derive a polysemy index for one or more words in said corpus;
identify words that are ambiguous in said corpus using said polysemy index;
collect a list of words that are associated with each sense of said identified ambiguous words; and
obtain seed examples using said collected list of words to train a self-organizing sense-disambiguation algorithm.
-
Specification