Method and apparatus for detecting and summarizing document similarity within large document sets
First Claim
1. A method of comparing a query file to one or more stored files, the method comprising:
- receiving a query file having a plurality of query file substrings;
selecting a first query file substring from the plurality of query file substrings;
preprocessing the first query file substring thereby making the substring more suitable for searching in the storage area;
searching a storage area storing a plurality of ordered file substrings for the first query file substring;
storing match data relating to a match between the first query file substring and a first ordered file substring; and
joining the first ordered file substring and a second ordered file substring if the first ordered file substring and the second ordered file substring are in a particular sequence and joining the first query file substring and a second query file substring if the first query file substring and the second query file substring are in the same particular sequences wherein the second ordered file substring and the second query file substring match, thereby forming a third coalesced ordered file substring and a third coalesced query file substring that can be used to format output comparison data.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and apparatus are disclosed for comparing an input or query file to a set of files to detect similarities and formatting the output comparison data are described. An input query file that can be segmented into multiple query file substrings is received. A query file substring is selected and used to search a storage area containing multiple ordered file substrings that were taken from previously analyzed files. If the selected query file substring matches any of the multiple ordered file substrings, match data relating to the match between the selected query file substring and the matching ordered file substring is stored in a temporary file. The matching ordered file substring and another ordered file substring are joined if the matching ordered file substring and the second ordered file substring are in a particular sequence and if the selected query file substring and a second query file substring are in the same particular sequence. If the matching ordered file substring and the second query file substring match, a coalesced matching ordered substring and a coalesced query file substring are formed that can be used to format output comparison data.
-
Citations
14 Claims
-
1. A method of comparing a query file to one or more stored files, the method comprising:
-
receiving a query file having a plurality of query file substrings;
selecting a first query file substring from the plurality of query file substrings;
preprocessing the first query file substring thereby making the substring more suitable for searching in the storage area;
searching a storage area storing a plurality of ordered file substrings for the first query file substring;
storing match data relating to a match between the first query file substring and a first ordered file substring; and
joining the first ordered file substring and a second ordered file substring if the first ordered file substring and the second ordered file substring are in a particular sequence and joining the first query file substring and a second query file substring if the first query file substring and the second query file substring are in the same particular sequences wherein the second ordered file substring and the second query file substring match, thereby forming a third coalesced ordered file substring and a third coalesced query file substring that can be used to format output comparison data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A system for comparing a query file to one or more stored files, the system comprising:
-
a file segmenter for creating a plurality of query file substrings from a query file;
a substring preprocessor for preprocessing the first query file substring thereby making the substring more suitable for searching in the storage area;
a storage searcher for searching a storage area storing a plurality of ordered file substrings for a first query file substring;
a data storer for storing match data relating to a match between the first query file substring and a first ordered file substring; and
a substring coalescer for joining the first ordered file substring and a second ordered file substring if the first ordered file substring and the second ordered file substring are in a particular sequence and for joining the first query file substring and a second query file substring if the first query file substring and the second query file substring are in the same particular sequence wherein the second ordered file substring and the second query file substring match, the substring coalescer thereby forming a third coalesced ordered file substring and a third coalesced query file substring that can be used to format output comparison data.
-
-
14. A computer readable medium containing programmed instructions for comparing a query file to one or more stored files, the programmed instructions comprising:
-
a computer code for receiving a query file having a plurality of query file substrings;
a computer code for selecting a first query file substring from the plurality of query file sub strings;
a computer code for preprocessing the first query file substring thereby making the substring more suitable for searching in the storage area a computer code for searching a storage area storing a plurality of ordered file substrings for the first query file substring;
a computer code for storing match data relating to a match between the first query file substring and a first ordered file substring; and
a computer code for joining the first ordered file substring and a second ordered file substring if the first ordered file substring and the second ordered file substring are in a particular sequence and for joining the first query file substring and a second query file substring if the first query file substring and the second query file substring are in the same particular sequence wherein the second ordered file substring and the second query file substring match, thereby forming a third coalesced ordered file substring and a third coalesced query file substring that can be used to format output comparison data.
-
Specification