Deriving document similarity indices

US 8,793,242 B2
Filed: 06/19/2013
Issued: 07/29/2014
Est. Priority Date: 12/16/2010
Status: Active Grant

First Claim

Patent Images

1. A computing system comprising:

at least one processor; and

one or more storage device having stored computer-executable instructions which, when executed by the at least one processor, implement a method for deriving a document similarity index for a plurality of documents, the method comprising;

an act of accessing a document;

an act of computing a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for the keyword to indicate a significance of the keyword within the document;

an act of identifying a specified number of most significant keywords in the document based on weights in the tag index;

for at least one keyword in the specified number of the most significant keywords, an act of determining the corresponding weight of the at least one keyword in each document in the plurality of documents;

an act of identifying a plurality of candidate documents, from the among the plurality of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents, at least some of the specified number of the most significant keywords in the document also being significant keywords in each of the plurality of candidate documents;

for each candidate document in the plurality of candidate documents, an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document; and

an act of selecting full similarities for one or more candidate documents for inclusion in the document similarity index to indicate documents that are similar to the document, selection of the full similarities for the one or more candidate documents being based on at least the full similarity calculations.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and computer program products are provided for deriving and updating document similarity indices for a plurality of documents. The number of maintained similarities can be controlled to conserve CPU and storage resources.

26 Citations

View as Search Results

20 Claims

1. A computing system comprising:
- at least one processor; and
  
  one or more storage device having stored computer-executable instructions which, when executed by the at least one processor, implement a method for deriving a document similarity index for a plurality of documents, the method comprising;
  
  an act of accessing a document;
  
  an act of computing a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for the keyword to indicate a significance of the keyword within the document;
  
  an act of identifying a specified number of most significant keywords in the document based on weights in the tag index;
  
  for at least one keyword in the specified number of the most significant keywords, an act of determining the corresponding weight of the at least one keyword in each document in the plurality of documents;
  
  an act of identifying a plurality of candidate documents, from the among the plurality of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents, at least some of the specified number of the most significant keywords in the document also being significant keywords in each of the plurality of candidate documents;
  
  for each candidate document in the plurality of candidate documents, an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document; and
  
  an act of selecting full similarities for one or more candidate documents for inclusion in the document similarity index to indicate documents that are similar to the document, selection of the full similarities for the one or more candidate documents being based on at least the full similarity calculations.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The computing system of claim 1, wherein the method further includes:
    - for each candidate document included in the prescribed number of candidate documents, an act of providing information from the full similarity between the document and the candidate document for inclusion in the document similarly index.
  - 3. The computing system of claim 1, wherein the method further includes:
    - for each candidate document included in the prescribed number of candidate documents, an act of storing information from the full similarity between the document and the candidate document in the document similarly index.
  - 4. The computing system as recited in claim 1, wherein an act computing a tag index for the document comprises computing keyword weights based on keyword frequency within the document and document length.
  - 5. The computing system as recited in claim 1, wherein the act of identifying a plurality of candidate documents, from the among the plurality of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents comprises an act of accessing at least one keyword/weight pair form a least recently used (“
    - LRU”
      
      ) cache.
  - 6. The computing system as recited in claim 1, wherein for each candidate document in the plurality of candidate documents, the act of calculating a full similarity between the document and candidate document comprises an act of using a cosine-similarity function to calculate the similarity between the document and the candidate document.
  - 7. The computing system as recited in claim 1, wherein an act of selecting full similarities includes selecting only a prescribed number of candidate documents for inclusion in the document similarity index in accordance with one of a hard limit or an express threshold, the hard limit or the express threshold limiting the number of candidate documents that can be selected for inclusion in the document similarity index.
  - 8. The computing system as recited in claim 3, wherein for each candidate document included in the prescribed number of candidate documents, the act of storing information from the full similarity in the document similarly index comprises an act of storing a mapping that quantifies the similarity between the document and the candidate document in a similarity value.

9. A computing system comprising:
- at least one processor; and
  
  one or more storage device having stored computer-executable instructions which, when executed by the at least one processor, implement a method for updating a document similarity index, wherein the computer system has access to plurality of documents and the document similarity index, the document similarity index indicating similarities between different documents in the plurality of documents, the method comprising;
  
  an act of accessing a batch of documents;
  
  for each document in the batch of documents, an act of computing a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for keyword to indicate a significance of the keyword within the document;
  
  for each document in the batch of documents subsequent to computing the tag indices;
  
  an act of identifying a specified number of the most significant keywords in the document based on weights in the tag index;
  
  for each keyword in the specified number of most significant keywords, an act of determining the corresponding weight of the keyword in each document in the plurality of documents and in the batch of documents;
  
  an act of identifying a plurality of candidate documents, from the among the plurality of documents and the batch of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents and in the batch of documents, at least some of the specified number of the most significant keywords in the document also being significant keywords in each of the plurality of candidate documents;
  
  for at least one candidate document identified from within the plurality of documents;
  
  an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document;
  
  an act of identifying a weakest similarity, from among a specified number of top similarities, for the candidate document from within a document similarity index, the weakest similarity indicating the similarity between the candidate document and another document in the plurality of documents;
  
  an act of determining that the candidate document and the document are more similar than the candidate document and the other document by comparing the calculated full similarity to the identified weakest similarity; and
  
  an act of replacing the weakest similarly with information from the calculated full similarity within the document similarity index to incrementally update the document similarity index, the replacement based on the determination.
- View Dependent Claims (10, 11, 12)
- - 10. The computing system as recited in claim 9, wherein the method further comprises for at least one other candidate document identified from within the plurality of documents:
    - an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document;
      
      an act of identifying the weakest similarity, from among a specified number of top similarities, for the candidate document from within the document similarity index, the weakest similarity indicating the similarity between the candidate document and a second other document in the plurality of documents;
      
      an act of determining that the candidate document and the second other document are more similar than the candidate document and the document by comparing the calculated full similarity to the identified weakest similarity; and
      
      an act of retaining the weakest similarity with within the document similarity index based on the determination.
  - 11. The computing system as recited in claim 9, wherein for each document in the batch of documents, the act of computing a tag index for the document comprises an act of computing keyword weights based on keyword frequency within the document and document length.
  - 12. The computing system as recited in claim 9, wherein the act of replacing the weakest similarly with information from the calculated full similarity within the document similarity index comprises an act of overwriting a similarity value that quantifies the similarity between the candidate document and the other document with a similarity value that quantifies the similarity between the candidate document and the document.

13. A computing system comprising:
- at least one processor; and
  
  one or more storage device having stored computer-executable instructions which, when executed by the at least one processor, implement a method for updating a document similarity index, wherein the computer system has access to plurality of documents and the document similarity index, the document similarity index indicating similarities between different documents in the plurality of documents, the method comprising;
  
  an act of accessing a batch of documents;
  
  for each document in the batch of documents, an act of computing a tag index for the document, the tag index including one more keyword/weight pairs, each keyword/weight pair mapping a keyword to a corresponding weight for keyword to indicate a significance of the keyword within the document;
  
  for each document in the batch of documents subsequent to computing the tag indices;
  
  an act of identifying a specified number of the most significant keywords in the document based on weights in the tag index;
  
  for each keyword in the specified number of most significant keywords, an act of determining the corresponding weight of the keyword in each document in the plurality of documents and in the batch of documents;
  
  an act of identifying a plurality of candidate documents, from the among the plurality of documents and the batch of documents, based on the corresponding weights of the specified number of the most significant keywords in the plurality of documents and in the batch of documents, at least some of the specified number of the most significant keywords in the document also being significant keywords in each of the plurality of candidate documents; and
  
  for any candidate documents identified from within the batch of documents;
  
  an act of calculating a full similarity between the document and candidate document by determining the weight of additional keywords from the document within the candidate document;
  
  an act of selecting a prescribed number of candidate documents for inclusion in the document similarity index as documents that are similar to the document, selection of the prescribed number of candidate documents based on the full similarity calculations and in accordance with one of a hard limit or an express threshold, the hard limit or the express threshold limiting the number of candidate documents that can be selected for inclusion in the document similarity index; and
  
  an act of providing information from the calculated full similarity between the document and the candidate document to the document similarly index.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The computing system as recited in claim 13, wherein for each document in the batch of documents, the act of computing a tag index for the document comprises an act of computing keyword weights based on keyword frequency within the document and document length.
  - 15. The computing system as recited in claim 13, wherein the act of replacing the weakest similarly with information from the calculated full similarity within the document similarity index comprises an act of overwriting a similarity value that quantifies the similarity between the candidate document and the other document with a similarity value that quantifies the similarity between the candidate document and the document.
  - 16. The computing system as recited in claim 13, wherein for any candidate documents identified from within the batch of documents, the act of calculating a full similarity between the document and candidate document comprises an act of using a cosine-similarity function to calculate the similarity between the document and the candidate document.
  - 17. The computing system as recited in claim 16, wherein selecting full similarities for a prescribed number of a candidate documents for inclusion in the document similarity index comprises an act of selecting full similarities for a prescribed number of a candidate documents in accordance with a hard limit that limits the number of similarities that can be selected for inclusion in the document similarity index to ten or less.
  - 18. The computing system of claim 17, wherein the method further includes:
    - for each candidate document included in the prescribed number of candidate documents, an act of storing information from the full similarity between the document and the candidate document for inclusion in the document similarly index.
  - 19. The computing system as recited in claim 18, wherein the act of storing information from the full similarity in the document similarly index comprises an act of storing a mapping that quantifies the similarity between the document and the candidate document in a similarity value.
  - 20. The computing system as recited in claim 19, wherein for at least one candidate document identified from within the plurality of documents, the act determining that the candidate document and the document are more similar than the candidate document and the other document by comparing the calculated full similarity to the identified weakest similarity comprises an act of comparing a first similarity value quantifying the similarity candidate document and the document to a second similarity value quantifying the similarity candidate document and the other document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Gherman, Sorin, Mukerjee, Kunal, Prout, Adam
Primary Examiner(s)
Coby, Frantz

Application Number

US13/922,168
Publication Number

US 20130282730A1
Time in Patent Office

405 Days
Field of Search

707/673, 707/711, 707/715, 707/769
US Class Current

707/715
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/40   of multimedia data, e.g. sl...

G06F 16/41   Indexing; Data structures t...

G06F 16/93   Document management systems

Deriving document similarity indices

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

26 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Deriving document similarity indices

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

26 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links