×

Automatic stop word identification and compensation

  • US 7,720,792 B2
  • Filed: 02/07/2006
  • Issued: 05/18/2010
  • Est. Priority Date: 04/05/2005
  • Status: Expired due to Fees
First Claim
Patent Images

1. A computer-based method for automatically compensating for stop words contained in documents during a query of the documents, the method comprising:

  • (a) generating an abstract mathematical space based on documents included in a collection of documents, wherein each document has a vector representation in the abstract mathematical space;

    (b) receiving a user query;

    (c) generating a vector representation of the user query in the abstract mathematical space;

    (d) computing a similarity between the vector representation of the user query and the vector representation of each document, wherein computing a similarity between the vector representation of the user query and the vector representation of a first document in the collection of documents comprises applying a weighting function that compensates for stop words, wherein the weighting function is applied to only one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document, thereby automatically compensating for a stop word contained in the first document, wherein computing the similarity between the vector representation of the user query and the vector representation of the first document comprises computing the following equation;



    d



    q


    = 1

    d







    q



    [ w

    ( d 1 ·

    q 1
    )
    +

    i = 2 k






    i


    ·

    q i
    ]
    ,
    wherein d is the vector representation of the first document, q is the vector representation of the user query, custom characterd|qcustom character is the similarity between the vector representation of the user query and the vector representation of the first document, w is the weighting function that compensates for stop words, and d1 is the one component of the vector representation of the first document that is associated with the most frequently occurring word contained in the first document; and

    (e) displaying a result based on the similarity computations.

View all claims
  • 4 Assignments
Timeline View
Assignment View
    ×
    ×