Document matching engine using asymmetric signature generation
First Claim
1. An automated method of document matching using asymmetrical signature generation, the method comprising:
- receiving documents from a document repository;
generating signatures for each of the documents using a first signature generator;
providing the signatures and a document identifier for each of the documents to a signature database;
receiving an input document;
generating signatures for the input document using a second signature generator; and
searching the signature database using the signatures generated for the input document,wherein the first and second signature generators are configured such that different numbers of signatures are generated for a same document, andwherein the first and second signature generators each;
receive a document comprising a plurality of characters;
normalize the document to remove non-informative characters from the plurality of characters;
calculate a score for each informative character of the plurality of characters based on an occurrence frequency and distribution in the document;
rank each informative character of the plurality of characters based on the calculated score;
select, from the ranked informative characters, character occurrences; and
generate a signature for each selected character occurrence.
4 Assignments
0 Petitions
Accused Products
Abstract
An automated method of matching an input document to a set of documents from a document repository. A signature database is stored, the signature database including a document identifier and signatures generated by a first signature generator for each of the set of documents. The input document is received and signatures are generated for the input document using a second signature generator, and the signature database is searched using the signatures generated for the input document. The first and second signature generators are configured such that different numbers of signatures are generated for a same document. Other embodiments, aspects and features are also disclosed.
39 Citations
20 Claims
-
1. An automated method of document matching using asymmetrical signature generation, the method comprising:
-
receiving documents from a document repository; generating signatures for each of the documents using a first signature generator; providing the signatures and a document identifier for each of the documents to a signature database; receiving an input document; generating signatures for the input document using a second signature generator; and searching the signature database using the signatures generated for the input document, wherein the first and second signature generators are configured such that different numbers of signatures are generated for a same document, and wherein the first and second signature generators each; receive a document comprising a plurality of characters; normalize the document to remove non-informative characters from the plurality of characters; calculate a score for each informative character of the plurality of characters based on an occurrence frequency and distribution in the document; rank each informative character of the plurality of characters based on the calculated score; select, from the ranked informative characters, character occurrences; and generate a signature for each selected character occurrence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. An automated method of matching an input document to a set of documents from a document repository, the method comprising:
-
storing a signature database including a document identifier and signatures generated by a first signature generator for each of the set of documents; receiving an input document; generating signatures for the input document using a second signature generator; and searching the signature database using the signatures generated for the input document, wherein the first and second signature generators are configured such that different numbers of signatures are generated for a same document, and wherein the first and second signature generators each; receive a document comprising text; parse the document to generate a token set comprising a plurality of tokens, each token corresponding to the text in the document separated by a predefined character characteristic; calculate a score for each token in the token set based on a frequency and distribution of the text in the document; rank each token in the token set based on the calculated score; select, from the ranked tokens, a subset of ranked tokens; and generate a signature for each occurrence of the selected tokens. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer readable storage medium structured to store instructions executable by a processor, the instructions when executed causing a processor to:
-
store a signature database including a document identifier and signatures generated by a first signature generator for each of the set of documents; receive an input document; generate signatures for the input document using a second signature generator; and search the signature database using the signatures generated for the input document, wherein the first and second signature generators are configured such that different numbers of signatures are generated for a same document, and wherein the first and second signature generators are each configured to; receive a document comprising a plurality of characters; normalize the document to remove non-informative characters from the plurality of characters; calculate a score for each informative character of the plurality of characters based on an occurrence frequency and distribution in the document; rank each informative character of the plurality of characters based on the calculated score; select, from the ranked informative characters, character occurrences; and generate a signature for each selected character occurrence.
-
-
20. A computer apparatus comprising:
-
a processor configured to execute computer-readable instructions; memory configured to store data, including said computer-readable instructions; and a communications system interconnecting said processor and memory, wherein said computer-readable instructions are configured to store a signature database including a document identifier and signatures generated by a first signature generator for each of the set of documents, receive an input document, generate signatures for the input document using a second signature generator, search the signature database using the signatures generated for the input document, and wherein the first and second signature generators are configured such that different numbers of signatures are generated for a same document, and wherein the first and second signature generators are each configured to; receive a document comprising text; parse the document to generate a token set comprising a plurality of tokens, each token corresponding to the text in the document separated by a predefined character characteristic; calculate a score for each token in the token set based on a frequency and distribution of the text in the document; rank each token in the token set based on the calculated score; select, from the ranked tokens, a subset of ranked tokens; and generate a signature for each occurrence of the selected tokens.
-
Specification