Determining a document similarity metric
First Claim
1. A method for determining similarity of a source document to a pattern file by a computer, comprising:
- creating a plurality of tables storing data associated with a plurality of patterns included in a pattern file;
determining which of the plurality of patterns in the pattern file exists in the source document, by analyzing the source document with reference to the plurality of tables;
determining a coverage metric, a count metric, a clustering metric and a uniqueness metric responsive to determining which of the patterns exist in the source document, the coverage metric indicative of the frequency of patterns in the pattern file appearing in the source document, the count metric indicative of a count of the patterns in the pattern file existing in the source document, the clustering metric indicative of the degree of proximity between the patterns of the pattern file in the source document, the uniqueness metric indicative of the frequency of a pattern in the pattern file appearing in other pattern files; and
determining a document similarity metric for each pattern file based on the coverage metric, the count metric, the clustering metric and the uniqueness metric, the document similarity metric indicative of the degree of similarity between the source document and the pattern file.
9 Assignments
0 Petitions
Accused Products
Abstract
To perform multi-pattern searching, a preprocessing engine populates a SUFFIX table, a PREFIX table and a PATTERN table. The SUFFIX table combines data conventionally stored in SHIFT and HASH tables. Pointers in the SUFFIX table refer to corresponding segments in the PREFIX table. Each PREFIX table segment is sorted by a prefix hash. A PATTERN table includes a hash of each full pattern sorted and grouped into segments, with each segment corresponding to a suffix hash and prefix hash combination. Pointers in the PREFIX table refer to corresponding segments in the PATTERN table. The PREFIX and PATTERN can be kept in secondary storage, allowing potentially billions of patterns to be used. After preprocessing, patterns are evaluated against a source file. A document metric is determine to qualitatively describe the similarity between the source file and each pattern file.
24 Citations
12 Claims
-
1. A method for determining similarity of a source document to a pattern file by a computer, comprising:
-
creating a plurality of tables storing data associated with a plurality of patterns included in a pattern file; determining which of the plurality of patterns in the pattern file exists in the source document, by analyzing the source document with reference to the plurality of tables; determining a coverage metric, a count metric, a clustering metric and a uniqueness metric responsive to determining which of the patterns exist in the source document, the coverage metric indicative of the frequency of patterns in the pattern file appearing in the source document, the count metric indicative of a count of the patterns in the pattern file existing in the source document, the clustering metric indicative of the degree of proximity between the patterns of the pattern file in the source document, the uniqueness metric indicative of the frequency of a pattern in the pattern file appearing in other pattern files; and determining a document similarity metric for each pattern file based on the coverage metric, the count metric, the clustering metric and the uniqueness metric, the document similarity metric indicative of the degree of similarity between the source document and the pattern file. - View Dependent Claims (2, 3, 4)
-
-
5. A system for determining similarity of a source document to a pattern file, the system comprising:
-
a computer readable storage medium storing a plurality of pattern files, each pattern file including a plurality of patterns; a preprocessing engine adapted to create a plurality of tables storing data associated with a plurality of patterns included in a pattern file; a pattern analysis engine, coupled to the preprocessing engine, adapted to determine which of the plurality of patterns in the pattern file exists in the source document, by analyzing the source document with reference to the plurality of tables, the pattern analysis engine further adapted to generate match information identifying a pattern of the pattern file existing in the source document and a position of the pattern in the source document; and a document similarity engine, coupled to the pattern analysis engine, adapted to determine a coverage metric, a count metric, a clustering metric and a uniqueness metric based on the match information, the coverage metric indicative of the frequency of patterns in the pattern file appearing in the source document, the count metric indicative of a count of the patterns in the pattern file existing in the source document, the clustering metric indicative of the degree of proximity between the patterns of the pattern file in the source document, the uniqueness metric indicative of the frequency of a pattern in the pattern file appearing in other pattern files, the document similarity engine further adapted to determine a document similarity metric for each pattern file based on the coverage metric, the count metric, the clustering metric and the uniqueness metric, the document similarity metric indicative of the degree of similarity between the source document and the pattern file. - View Dependent Claims (6, 7, 8)
-
-
9. A computer readable storage medium storing instructions adapted to identify patterns in a source document, the instructions when executed by a processor causing the processor to:
-
create a plurality of tables storing data associated with a plurality of patterns included in a pattern file; determine which of the plurality of patterns in the pattern file exists in the source document, by analyzing the source document with reference to the plurality of tables; generate match information identifying a pattern of the pattern table existing in the source document and a position of the pattern existing in the source document; determine a coverage metric, a count metric, a clustering metric and a uniqueness metric based on the match information, the coverage metric indicative of the frequency of patterns in the pattern file appearing in the source document, the count metric indicative of a count of the patterns in the pattern file existing in the source document, the clustering metric indicative of the degree of proximity between the patterns of the pattern file in the source document, the uniqueness metric indicative of the frequency of a pattern in the pattern file appearing in other pattern files; and determine a document similarity metric for each pattern file based on the coverage metric, the count metric, the clustering metric and the uniqueness metric. - View Dependent Claims (10, 11, 12)
-
Specification