×

Identifying similarities within large collections of unstructured data

  • US 6,947,933 B2
  • Filed: 12/17/2003
  • Issued: 09/20/2005
  • Est. Priority Date: 01/23/2003
  • Status: Active Grant
First Claim
Patent Images

1. A method for determining if a first and second document stored in a digital format in a data processing system are similar by comparing sparse representations of the two documents, the method comprising the steps of:

  • breaking the first and second documents into chunks of data of predefined sizes;

    selecting a subset of all chunks as representative of the data in each document;

    determining a set of coefficients that represent the selected chunks in each document;

    combining sets of coefficients for each document into coefficient clusters for each document, a coefficient cluster containing coefficients which are similar according to a predetermined similarity metric; and

    evaluating a degree of similarity between the two documents by counting coefficient clusters into which chunks from both documents fall.

View all claims
  • 13 Assignments
Timeline View
Assignment View
    ×
    ×