Automatic stop word identification and compensation

US 7,720,792 B2
Filed: 02/07/2006
Issued: 05/18/2010
Est. Priority Date: 04/05/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-based method for automatically compensating for stop words contained in documents during a query of the documents, the method comprising:

(a) generating an abstract mathematical space based on documents included in a collection of documents, wherein each document has a vector representation in the abstract mathematical space;

(b) receiving a user query;

(c) generating a vector representation of the user query in the abstract mathematical space;

(d) computing a similarity between the vector representation of the user query and the vector representation of each document, wherein computing a similarity between the vector representation of the user query and the vector representation of a first document in the collection of documents comprises applying a weighting function that compensates for stop words, wherein the weighting function is applied to only one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document, thereby automatically compensating for a stop word contained in the first document, wherein computing the similarity between the vector representation of the user query and the vector representation of the first document comprises computing the following equation;

$〈 d \rangle q 〉 = \frac{1}{ d   q } [w (d_{1} \cdot q_{1}) + \sum_{i = 2}^{k} ⅆ_{i} \cdot q_{i}],$ wherein d is the vector representation of the first document, q is the vector representation of the user query, d|q is the similarity between the vector representation of the user query and the vector representation of the first document, w is the weighting function that compensates for stop words, and d₁is the one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document; and

(e) displaying a result based on the similarity computations.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are methods and computer program products for automatically identifying and compensating for stop words in a text processing system. This automatic stop word compensation allows such operations as performing queries on an abstract mathematical space built using all words from all texts, with the ability to compensate for the skew that the inclusion of the stop words may have introduced into the space. Documents are represented by document vectors in the abstract mathematical space. To compensate for stop words, a weight function is applied to a predetermined component of the document vectors associated with frequently occurring word(s) contained in the documents. The weight function may be applied dynamically during query processing. Alternatively, the weight function may be applied statically to all document vectors.

Citations

7 Claims

1. A computer-based method for automatically compensating for stop words contained in documents during a query of the documents, the method comprising:
- (a) generating an abstract mathematical space based on documents included in a collection of documents, wherein each document has a vector representation in the abstract mathematical space;
  
  (b) receiving a user query;
  
  (c) generating a vector representation of the user query in the abstract mathematical space;
  
  (d) computing a similarity between the vector representation of the user query and the vector representation of each document, wherein computing a similarity between the vector representation of the user query and the vector representation of a first document in the collection of documents comprises applying a weighting function that compensates for stop words, wherein the weighting function is applied to only one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document, thereby automatically compensating for a stop word contained in the first document, wherein computing the similarity between the vector representation of the user query and the vector representation of the first document comprises computing the following equation;
  
  $〈 d \rangle q 〉 = \frac{1}{ d   q } [w (d_{1} \cdot q_{1}) + \sum_{i = 2}^{k} ⅆ_{i} \cdot q_{i}],$ wherein d is the vector representation of the first document, q is the vector representation of the user query, d|q is the similarity between the vector representation of the user query and the vector representation of the first document, w is the weighting function that compensates for stop words, and d₁is the one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document; and
  
  (e) displaying a result based on the similarity computations.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein applying the weighting function of step (d) comprises decreasing the one component of the vector representation of the first document by a factor of two.
  - 3. The method of claim 1, wherein steps (a) and (c) comprise:
    - (a) generating a Latent Semantic Indexing (LSI) space based on documents included in a collection of documents, wherein each document has a multi-dimensional vector representation in the LSI space; and
      
      (c) generating a multi-dimensional vector representation of the user query in the LSI space.

4. A tangible computer program product for automatically compensating for stop words contained in documents during a query of the documents, the computer program product comprising:
- a computer usable medium comprising a storage unit, wherein the computer usable medium has computer readable program code embodied therein for causing an application program to execute on an operating system of a computer, the computer readable program code comprising;
  
  a computer readable first program code to enable a processor to generate an abstract mathematical space based on documents in a collection of documents, wherein each document has a vector representation in the abstract mathematical space;
  
  a computer readable second program code to enable a processor to receive a user query;
  
  a computer readable third program code to enable a processor to generate a vector representation of the user query in the abstract mathematical space;
  
  a computer readable fourth program code to enable a processor to compute a similarity between the vector representation of the user query and the vector representation of each document, wherein computing a similarity between the vector representation of the user query and the vector representation of a first document in the collection of documents comprises applying a weighting function that compensates for stop words, wherein the weighting function is applied to only one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document, thereby automatically compensating for a stop word contained in the first document, wherein the computer readable fourth program code comprises code to enable a processor to compute the following equation;
  
  $〈 d \rangle q 〉 = \frac{1}{ d   q } [w (d_{1} \cdot q_{1}) + \sum_{i = 2}^{k} ⅆ_{i} \cdot q_{i}],$ wherein d is the vector representation of the first document, q is the vector representation of the user query, d|q is the similarity between the vector representation of the user query and the vector representation of the first document, w is the weighting function that compensates for stop words, and d₁is the one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document; and
  
  a computer readable fifth program code to enable a processor to display a result based on the similarity computations.
- View Dependent Claims (5, 6)
- - 5. The computer program product of claim 4, wherein the fourth computer readable program code comprises:
    - code to enable a processor to decrease the one component of the vector representation of the first document by a factor of two.
  - 6. The computer program product of claim 4, wherein:
    - the first computer readable program code comprises code to enable a processor to generate a Latent Semantic Indexing (LSI) space based on documents in a collection of documents, wherein each document has a vector representation of the LSI space; and
      
      the third computer readable program code comprises code to enable a processor to generate a multi-dimensional vector representation of the user query in the LSI space.

7. A tangible computer program product for automatically compensating for stop words contained in documents, the computer program product comprising:
- a computer usable medium comprising a storage unit, wherein the computer usable medium has computer readable program code embodied therein for causing an application program to execute on an operating system of a computer, the computer readable program code comprising;
  
  a computer readable first program code to enable a processor to generate an abstract mathematical space based on documents in a collection of documents, wherein each document has a vector representation in the abstract mathematical space; and
  
  a computer readable second program code to enable a processor to apply a weighting function to compensate for stop words, wherein the weighting function is applied to only one component of the vector representation of each document in the collection of documents, wherein the computer readable second program code comprises code to enable a processor to apply the weighting function to a first computed dimension of the vector representation of a first document, as follows
  d=(wd₁, d₂, . . . , d_k)wherein d is the vector representation of the first document, w is the weighting function that compensates for stop words, and d₁is the first computed dimension of the vector representation of the first document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Relativity ODA LLC
Original Assignee
Content Analyst Company, LLC (Relativity ODA LLC)
Inventors
Price, Robert Jenson
Primary Examiner(s)
Ali; Mohammad
Assistant Examiner(s)
Smith; Brannon W

Application Number

US11/348,303
Publication Number

US 20060224572A1
Time in Patent Office

1,561 Days
Field of Search

707/5, 707/101, 707/E17.002
US Class Current

707/739
CPC Class Codes

G06F 16/313 Selection or weighting of t...

G06F 16/3335 Syntactic pre-processing, e...

Automatic stop word identification and compensation

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic stop word identification and compensation

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links