Identifying similarities within large collections of unstructured data
First Claim
1. A method for determining if a first and second document stored in a digital format in a data processing system are similar by comparing sparse representations of the two documents, the method comprising the steps of:
- breaking the first and second documents into chunks of data of predefined sizes;
selecting a subset of all chunks as representative of the data in each document;
determining a set of coefficients that represent the selected chunks in each document;
combining sets of coefficients for each document into coefficient clusters for each document, a coefficient cluster containing coefficients which are similar according to a predetermined similarity metric; and
evaluating a degree of similarity between the two documents by counting coefficient clusters into which chunks from both documents fall.
13 Assignments
0 Petitions
Accused Products
Abstract
A technique for determining when documents stored in digital format in a data processing system are similar. A method compares a sparse representation of two or more documents by breaking the documents into “chunks” of data of predefined sizes. Selected subsets of the chunks are determined as being representative of data in the documents and coefficients are developed to represent such chunks. Coefficients are then combined into coefficient clusters containing coefficients that are similar according to a predetermined similarity metric. The degree of similarity between documents is then evaluated by counting clusters into which chunks of similar documents fall.
108 Citations
16 Claims
-
1. A method for determining if a first and second document stored in a digital format in a data processing system are similar by comparing sparse representations of the two documents, the method comprising the steps of:
-
breaking the first and second documents into chunks of data of predefined sizes;
selecting a subset of all chunks as representative of the data in each document;
determining a set of coefficients that represent the selected chunks in each document;
combining sets of coefficients for each document into coefficient clusters for each document, a coefficient cluster containing coefficients which are similar according to a predetermined similarity metric; and
evaluating a degree of similarity between the two documents by counting coefficient clusters into which chunks from both documents fall. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
Specification