Word sense disambiguation
First Claim
1. In a collection of n documents and a reference collection, each document containing terms, the reference collection containing at least one meaning associated with a term, the total number of terms occurring at least once in the document collection equal to at least m, a computer-implemented method for determining a meaning for a sense of a subject term, the subject term found in at least one document and associated with at least one meaning, the computer-implemented method comprising:
- forming an m by n matrix, where each matrix element (i, j) corresponds to the number of occurrences of term i in document j;
performing singular value decomposition and dimensionality reduction on the matrix to form a latent semantic indexed vector space;
determining at least one cluster of documents within the vector space, each cluster corresponding to a subset of the document collection, each member of the subset having at least one occurrence of a subject term;
discerning an implicit position of a sense of the subject term, each implicit position corresponding to at least one determined cluster;
discerning at least one non-subject term within the vicinity of the implicit position of the sense; and
assigning to the sense having a discerned implicit position, the meaning, associated with the term in the reference collection, that correlates best with the discerned non-subject terms closest to the implicit position of the sense.
11 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for word sense disambiguation, including discerning one or more senses or occurrences, distinguishing between senses or occurrences, and determining a meaning for a sense or occurrence of a subject term. In a collection of documents containing terms and a reference collection containing at least one meaning associated with a term, the method includes forming a vector space representation of terms and documents. In some embodiments, the vector space is a latent semantic index vector space. In some embodiments, occurrences are clustered to discern or distinguish a sense of a term. In preferred embodiments, meaning of a sense or occurrence is assigned based on either correlation with an external reference source, or proximity to a reference source that has been indexed into the space.
54 Citations
4 Claims
-
1. In a collection of n documents and a reference collection, each document containing terms, the reference collection containing at least one meaning associated with a term, the total number of terms occurring at least once in the document collection equal to at least m, a computer-implemented method for determining a meaning for a sense of a subject term, the subject term found in at least one document and associated with at least one meaning, the computer-implemented method comprising:
-
forming an m by n matrix, where each matrix element (i, j) corresponds to the number of occurrences of term i in document j; performing singular value decomposition and dimensionality reduction on the matrix to form a latent semantic indexed vector space; determining at least one cluster of documents within the vector space, each cluster corresponding to a subset of the document collection, each member of the subset having at least one occurrence of a subject term; discerning an implicit position of a sense of the subject term, each implicit position corresponding to at least one determined cluster; discerning at least one non-subject term within the vicinity of the implicit position of the sense; and assigning to the sense having a discerned implicit position, the meaning, associated with the term in the reference collection, that correlates best with the discerned non-subject terms closest to the implicit position of the sense.
-
-
2. In a collection of n source documents and a collection x reference documents, each document containing terms, each reference document containing at least one meaning associated with a term, the total number of terms occurring at least once in the combination collections equal to at least m, a computer-implemented method for determining a meaning for a sense of a subject term, the subject term found in at least one source document and associated with at least one meaning, the computer-implemented method comprising:
-
forming an m by [n+] matrix, where each matrix element (i,j) corresponds to the number of occurrences of term i in document j; performing singular value decomposition and dimensionality reduction on the matrix to form a latent semantic indexed vector space; determining at least one cluster of documents within the vector space, each cluster corresponding to a subset of the [n+x] documents having at least one occurrence of a subject term; discerning the implicit position of at least one of the subject term corresponding to at least one determined cluster; and assigning to at least one sense corresponding to at least one discerned implicit position, the meaning of the subject term closest within the vector space to the implicit position of the sense. - View Dependent Claims (3)
-
-
4. In a collection of documents, each document containing a plurality of terms, a computer program product for discerning the presence of at least one sense of a subject term when executed on a computer system, the computer program product comprising:
-
a computer-readable medium; a matrix-forming module stored on the medium, the matrix-forming module operative to form an m by n matrix, where each matrix element (i, j) corresponds to the number of occurrences of term i in document j; a singular value decomposition and dimensionality reduction module stored on the medium and coupled to the matrix forming module, the singular value decomposition and dimensionality reduction module operative to form a latent semantic indexed vector space from the matrix; a clustering module stored on the medium, the clustering module operative to determine at least one cluster of documents within the vector space, each cluster corresponding to a subset of documents within the vector space containing a subject term; and a sense position determining module stored on the medium, the sense position module operative to determine an implicit position within the vector space of at least one sense of the subject term, the implicit position corresponding to at least one determined cluster.
-
Specification