Approach For Application-Specific Duplicate Detection
First Claim
Patent Images
1. A computer-implemented method for detecting duplicate information, the computer-implemented method comprising:
- extracting, from a certain document, first view data of a view, wherein the view includes a plurality of view components;
identifying within said first view data, a first view component datum for each of the plurality of view components;
generating, for the first view data, a first view signature that includes a plurality of first view component signatures;
wherein each first view component signature of said first view signature is generated based on a first view component datum of at least one view component of said plurality of view components;
making a determination of whether the first view data matches any other view data extracted from a plurality of other documents by comparing the plurality of first view signatures against other view signatures of said plurality of other documents; and
establishing the certain document as a duplicate based on the determination.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques are provided for extracting view data from documents, where the data corresponds to an application-specific view and includes a plurality of components. Component data is identified within the view data and a view signature is generated for the view data that includes component signatures generated for each of the components on which the view data is comprised. Each component signature is generated based on the component data that corresponds to each component. The signatures generated are used to detect duplicates among the documents.
-
Citations
10 Claims
-
1. A computer-implemented method for detecting duplicate information, the computer-implemented method comprising:
-
extracting, from a certain document, first view data of a view, wherein the view includes a plurality of view components; identifying within said first view data, a first view component datum for each of the plurality of view components; generating, for the first view data, a first view signature that includes a plurality of first view component signatures; wherein each first view component signature of said first view signature is generated based on a first view component datum of at least one view component of said plurality of view components; making a determination of whether the first view data matches any other view data extracted from a plurality of other documents by comparing the plurality of first view signatures against other view signatures of said plurality of other documents; and establishing the certain document as a duplicate based on the determination. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer-implemented method for detecting duplicate information, the computer-implemented method comprising:
-
extracting, from a certain document, first view data of a view, wherein the view includes a plurality of view components; identifying within said first view data, a first view component datum for each of the plurality of view components; generating a plurality of first view component signatures, wherein each first view component signature of said plurality of first view signatures is generated based on a first view component datum of at least one view component of said plurality of view components; making a determination of whether the first view data matches any other view data extracted from a plurality of other documents; wherein making a determination includes; for each view component of said plurality of view components, generating a similarity value reflecting similarity between a respective first view component datum of said certain document and a respective view component datum of another document, wherein generating a similarity value is based on a respective first view component signature of said plurality of first view component signatures; and establishing the certain document as a duplicate based on the similarity values generated for each view component of said plurality of view components. - View Dependent Claims (10)
-
Specification