Document similarity scoring and ranking method, device and computer program product

US 7,689,559 B2
Filed: 02/08/2006
Issued: 03/30/2010
Est. Priority Date: 02/08/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-based method of searching, navigating or retrieving, from a set of electronic documents, comprising:

electronically constructing and storing a word corpus from said set of electronic documents, each document in said set of electronic documents having a corresponding document ID;

electronically constructing an inverted index, based on the electronic corpus and set of electronic documents;

for each word in the inverted index, obtaining a plurality of document similarity scores by;

sorting the document IDs of said documents according to a word similarity score to form a sorted set of document IDs, wherein said word appears in each document represented in said sorted set, and wherein said sorted set is an index-word document list, said documents being sorted into decreasing order of similarity,calculating a document similarity score between pairs of documents identified in said index-word document list,entering the calculated document similarity scores into a matrix of similarity scores wherein each similarity score represents a degree of similarity between a pair of documents, said matrix being a similarity graph (S),in said matrix, treating a degree of similarity between each pair of documents for which a similarity score has not been calculated as being a zero value;

using said similarity graph (S) when performing a similarity analysis of said documents for at least one of;

searching said set of electronic documents based on said similarity analysis,navigating said set of electronic documents based on said similarity analysis, andretrieving from set of documents based on said similarity analysis,wherein the obtaining a plurality of document similarity scores further includes at least one of;

the step of sorting the document IDs of said documents further includes truncating said sorted list by removing documents whose similarity is less than a threshold τ

_word, andthe step of calculating a document similarity score further includes calculating the document similarity score between pairs of documents identified in said index-word document list until a first occurrence of a similarity score lower than a threshold τ

_set.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A device, computer program product and a method for searching, navigating or retrieving documents in a set of electronic documents, including performing a link analysis of the set of electronic documents. The link analysis includes one of analyzing at least two of the set of documents with at least a portion of a similarity graph constructed among the set of documents and analyzing the at least two of the set of documents with the at least a portion of the similarity graph and at least a portion of a hyperlink graph constructed from hyperlinks between the set of documents. Also described is a method for building a similarity matrix.

Citations

12 Claims

1. A computer-based method of searching, navigating or retrieving, from a set of electronic documents, comprising:
- electronically constructing and storing a word corpus from said set of electronic documents, each document in said set of electronic documents having a corresponding document ID;
  
  electronically constructing an inverted index, based on the electronic corpus and set of electronic documents;
  
  for each word in the inverted index, obtaining a plurality of document similarity scores by;
  
  sorting the document IDs of said documents according to a word similarity score to form a sorted set of document IDs, wherein said word appears in each document represented in said sorted set, and wherein said sorted set is an index-word document list, said documents being sorted into decreasing order of similarity,calculating a document similarity score between pairs of documents identified in said index-word document list,entering the calculated document similarity scores into a matrix of similarity scores wherein each similarity score represents a degree of similarity between a pair of documents, said matrix being a similarity graph (S),in said matrix, treating a degree of similarity between each pair of documents for which a similarity score has not been calculated as being a zero value;
  
  using said similarity graph (S) when performing a similarity analysis of said documents for at least one of;
  
  searching said set of electronic documents based on said similarity analysis,navigating said set of electronic documents based on said similarity analysis, andretrieving from set of documents based on said similarity analysis,wherein the obtaining a plurality of document similarity scores further includes at least one of;
  
  the step of sorting the document IDs of said documents further includes truncating said sorted list by removing documents whose similarity is less than a threshold τ
  
  _word, andthe step of calculating a document similarity score further includes calculating the document similarity score between pairs of documents identified in said index-word document list until a first occurrence of a similarity score lower than a threshold τ
  
  _set.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the inverted index identifies a set of documents in which a particular word appears.
  - 3. The method of claim 2, further comprising:
    - calculating said word similarity score between said word and each document in said set of electronic documents in which the word appears.
  - 4. The method of claim 1, wherein the following operations are performed on at least two documents in the index-word document list, in order of decreasing word similarity:
    - for each of said at least two documents, setting one of said at least two documents as a first document D;
      
      choosing a second document D2 to be the highest-lying document on the index-word document list which (i) has not been chosen before, and (ii) lies lower on the index-word document list than first document D;
      
      checking, for each second document D2 lying lower on the index-word document list than first document D, whether the similarity S(D,D2) has been previously calculated;
      
      for each second document D2 for which S(D,D2) has not been previously calculated, calculating a similarity S(D,D2) of first document D to second document D2; and
      
      enforcing a similarity threshold τ
      
      _SIMby stopping said step of choosing second documents, and by stopping said step of calculating S(D,D2) for any further second document D2, when a similarity between document D and some second document D2 is less than the similarity threshold.
  - 5. The method of claim 4, further comprising:
    - storing each similarity score which is calculated in said similarity matrix.
  - 6. The method of claim 3, wherein the step of calculating said word similarity score comprises setting said word similarity score equal to a normalized word frequency for each document.
  - 7. The method of claim 6, wherein the normalized word frequency for word w in document D_iis given by
  - 8. The method of claim 1, wherein said step of truncating the sorted list of document IDs, comprisespredetermining the word threshold τ
    - _word, said step of predetermining including one of;
      
      setting the predetermined word threshold τ
      
      _wordequal to zero; and
      
      setting the predetermined word threshold τ
      
      _wordequal to a value greater than zero.
  - 9. The method of claim 1, wherein said step of truncating the sorted list of document IDs, comprises:
    - choosing τ
      
      _wordindependently for each word in the index.
  - 10. The method of claim 4, wherein said step of enforcing a similarity threshold τ
    - _SIM, comprisespredetermining the similarity threshold τ
      
      _SIM, said step of predetermining including one of;
      
      setting the predetermined similarity threshold τ
      
      _SIMequal to zero; and
      
      setting the predetermined similarity threshold τ
      
      _SIMequal to a value greater than zero.
  - 11. The method of claim 4, wherein said step of calculating a similarity of document D to all other documents of said at least two documents with a lower word similarity rank, and stopping said calculation when a calculated similarity is less than a similarity threshold τ
    - _SIM, comprises;
      
      choosing τ
      
      _SIMindependently for each word in the index.
  - 12. A computer readable medium including stored thereon a computer program product containing instructions configured to cause a computing device to execute the method recited in any one of claims 1-3 or 4-11.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Telenor Asa (Government of Norway)
Original Assignee
Telenor Asa (Government of Norway)
Inventors
Engo-Monsen, Kenth, Canright, Geoffrey
Primary Examiner(s)
Vo; Tim T.
Assistant Examiner(s)
Morrison; Jay A

Application Number

US11/349,235
Publication Number

US 20070185871A1
Time in Patent Office

1,511 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/3347   using vector based model

G06F 16/951   Indexing; Web crawling tech...

Y10S 707/99937   Sorting

Document similarity scoring and ranking method, device and computer program product

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Document similarity scoring and ranking method, device and computer program product

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links