Automatic stop word identification and compensation
First Claim
1. A computer-based method for automatically compensating for stop words contained in documents during a query of the documents, the method comprising:
- (a) generating an abstract mathematical space based on documents included in a collection of documents, wherein each document has a vector representation in the abstract mathematical space;
(b) receiving a user query;
(c) generating a vector representation of the user query in the abstract mathematical space;
(d) computing a similarity between the vector representation of the user query and the vector representation of each document, wherein computing a similarity between the vector representation of the user query and the vector representation of a first document in the collection of documents comprises applying a weighting function that compensates for stop words, wherein the weighting function is applied to only one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document, thereby automatically compensating for a stop word contained in the first document, wherein computing the similarity between the vector representation of the user query and the vector representation of the first document comprises computing the following equation;
wherein d is the vector representation of the first document, q is the vector representation of the user query, d|q is the similarity between the vector representation of the user query and the vector representation of the first document, w is the weighting function that compensates for stop words, and d1 is the one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document; and
(e) displaying a result based on the similarity computations.
4 Assignments
0 Petitions
Accused Products
Abstract
Disclosed are methods and computer program products for automatically identifying and compensating for stop words in a text processing system. This automatic stop word compensation allows such operations as performing queries on an abstract mathematical space built using all words from all texts, with the ability to compensate for the skew that the inclusion of the stop words may have introduced into the space. Documents are represented by document vectors in the abstract mathematical space. To compensate for stop words, a weight function is applied to a predetermined component of the document vectors associated with frequently occurring word(s) contained in the documents. The weight function may be applied dynamically during query processing. Alternatively, the weight function may be applied statically to all document vectors.
-
Citations
7 Claims
-
1. A computer-based method for automatically compensating for stop words contained in documents during a query of the documents, the method comprising:
-
(a) generating an abstract mathematical space based on documents included in a collection of documents, wherein each document has a vector representation in the abstract mathematical space; (b) receiving a user query; (c) generating a vector representation of the user query in the abstract mathematical space; (d) computing a similarity between the vector representation of the user query and the vector representation of each document, wherein computing a similarity between the vector representation of the user query and the vector representation of a first document in the collection of documents comprises applying a weighting function that compensates for stop words, wherein the weighting function is applied to only one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document, thereby automatically compensating for a stop word contained in the first document, wherein computing the similarity between the vector representation of the user query and the vector representation of the first document comprises computing the following equation; wherein d is the vector representation of the first document, q is the vector representation of the user query, d|q is the similarity between the vector representation of the user query and the vector representation of the first document, w is the weighting function that compensates for stop words, and d1 is the one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document; and (e) displaying a result based on the similarity computations. - View Dependent Claims (2, 3)
-
-
4. A tangible computer program product for automatically compensating for stop words contained in documents during a query of the documents, the computer program product comprising:
-
a computer usable medium comprising a storage unit, wherein the computer usable medium has computer readable program code embodied therein for causing an application program to execute on an operating system of a computer, the computer readable program code comprising; a computer readable first program code to enable a processor to generate an abstract mathematical space based on documents in a collection of documents, wherein each document has a vector representation in the abstract mathematical space; a computer readable second program code to enable a processor to receive a user query; a computer readable third program code to enable a processor to generate a vector representation of the user query in the abstract mathematical space; a computer readable fourth program code to enable a processor to compute a similarity between the vector representation of the user query and the vector representation of each document, wherein computing a similarity between the vector representation of the user query and the vector representation of a first document in the collection of documents comprises applying a weighting function that compensates for stop words, wherein the weighting function is applied to only one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document, thereby automatically compensating for a stop word contained in the first document, wherein the computer readable fourth program code comprises code to enable a processor to compute the following equation; wherein d is the vector representation of the first document, q is the vector representation of the user query, d|q is the similarity between the vector representation of the user query and the vector representation of the first document, w is the weighting function that compensates for stop words, and d1 is the one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document; and a computer readable fifth program code to enable a processor to display a result based on the similarity computations. - View Dependent Claims (5, 6)
-
-
7. A tangible computer program product for automatically compensating for stop words contained in documents, the computer program product comprising:
-
a computer usable medium comprising a storage unit, wherein the computer usable medium has computer readable program code embodied therein for causing an application program to execute on an operating system of a computer, the computer readable program code comprising; a computer readable first program code to enable a processor to generate an abstract mathematical space based on documents in a collection of documents, wherein each document has a vector representation in the abstract mathematical space; and a computer readable second program code to enable a processor to apply a weighting function to compensate for stop words, wherein the weighting function is applied to only one component of the vector representation of each document in the collection of documents, wherein the computer readable second program code comprises code to enable a processor to apply the weighting function to a first computed dimension of the vector representation of a first document, as follows
d=(wd1, d2, . . . , dk)wherein d is the vector representation of the first document, w is the weighting function that compensates for stop words, and d1 is the first computed dimension of the vector representation of the first document.
-
Specification