Method and system for document similarity analysis
First Claim
Patent Images
1. A method for document similarity analysis, the method comprising:
- receiving, by a processor, an indication of a reference document, the indication received based on user interaction with a user interface;
generating, by the processor, a reference document content identifier for the reference document, comprising;
identifying frequently occurring terms in reference document content;
encoding each frequently occurring term of the identified frequently occurring terms in a term identifier; and
combining the term identifiers to form the reference document content identifier associated with the reference document, the reference document content identifier associated with the reference document comprising a bit array representation with locations in the bit array representation representing corresponding terms and bits of the bit array representation set at respective locations in the bit array representation the respective locations corresponding to numbers representing the encoded frequently occurring terms;
obtaining at least one document similarity value by comparing the reference document content identifier to a plurality of archived document content identifiers associated with archived documents stored in a document repository, each of the plurality of archived document content identifiers comprising a representation of frequently occurring terms in content of an associated archived document, wherein the comparing includes determining a degree of difference between reference document content and archived document content of the archived documents based on at least one count of deviating bits between the reference document content identifier and each of the plurality of archived document content identifiers; and
returning, by the processor, a document list for presentation in the user interface, the document list listing a set of documents identified from the archived documents based on the at least one document similarity value.
9 Assignments
0 Petitions
Accused Products
Abstract
A method for document similarity analysis. The method includes generating a reference document content identifier for a reference document, including identifying frequently occurring terms in reference document content, encoding each frequently occurring term in a term identifier and combining the term identifiers to form the reference document content identifier associated with the reference document. The method also includes obtaining at least one document similarity value by comparing the reference document content identifier to a set of archived document content identifiers stored in a document repository.
-
Citations
23 Claims
-
1. A method for document similarity analysis, the method comprising:
-
receiving, by a processor, an indication of a reference document, the indication received based on user interaction with a user interface; generating, by the processor, a reference document content identifier for the reference document, comprising; identifying frequently occurring terms in reference document content; encoding each frequently occurring term of the identified frequently occurring terms in a term identifier; and combining the term identifiers to form the reference document content identifier associated with the reference document, the reference document content identifier associated with the reference document comprising a bit array representation with locations in the bit array representation representing corresponding terms and bits of the bit array representation set at respective locations in the bit array representation the respective locations corresponding to numbers representing the encoded frequently occurring terms; obtaining at least one document similarity value by comparing the reference document content identifier to a plurality of archived document content identifiers associated with archived documents stored in a document repository, each of the plurality of archived document content identifiers comprising a representation of frequently occurring terms in content of an associated archived document, wherein the comparing includes determining a degree of difference between reference document content and archived document content of the archived documents based on at least one count of deviating bits between the reference document content identifier and each of the plurality of archived document content identifiers; and returning, by the processor, a document list for presentation in the user interface, the document list listing a set of documents identified from the archived documents based on the at least one document similarity value. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A non-transitory computer readable medium (CRM) storing computer-readable instructions for document similarity analysis, the computer-readable instructions executable to:
-
receive, based on user interaction with a user interface, an indication of a reference document; generate a reference document content identifier for the reference document, comprising; identify frequently occurring terms in reference document content; encode each frequently occurring term of the identified frequently occurring terms in a term identifier; combine the term identifiers to form the reference document content identifier associated with the reference document, wherein the reference document content identifier comprises a bit array representation with locations in the bit array representation representing corresponding terms and bits of the bit array representation set at respective locations in the bit array representation, the respective locations corresponding to numbers representing the encoded frequently occurring terms; and obtain at least one document similarity value by comparing the reference document content identifier to a plurality of archived document content identifiers associated with archived documents stored in a document repository, each of the plurality of archived document content identifiers comprising a representation of frequently occurring terms in content of an associated archived document, wherein the comparing includes determining a degree of difference between reference document content and archived document content of the plurality of archived documents based on at least one count of deviating bits between the reference document content identifier and each of the plurality of archived document content identifiers; and return a document list for presentation in the user interface, the document list listing a set of documents identified from the archived documents based on the at least one document similarity value. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A system for document similarity analysis, comprising:
-
a computing device comprising a computer processor; a document content identifier encoding engine, executable on the computer processor to; receive, based on user interaction with a user interface, an indication of a reference document; identify frequently occurring terms in reference document content; encode each frequently occurring term of the identified frequently occurring terms in a term identifier; combine the term identifiers to form a reference document content identifier associated with the reference document, wherein the reference document content identifier associated with the reference document comprises a bit array representation with locations in the bit array representation representing corresponding terms and bits of the bit array representation set at respective locations in the bit array representation, the respective locations corresponding to numbers representing the encoded frequently occurring terms; and a document content identifier similarity analysis engine, executable on the computer processor to; obtain at least one document similarity value by comparing the reference document content identifier to a plurality of archived document content identifiers associated with archived documents stored in a document repository, each of the plurality of archived document content identifiers comprising a representation of frequently occurring terms in content of an associated archived document, wherein the comparing includes determining a degree of difference between reference document content and archived document content of the archived documents based on at least one count of deviating bits between the reference document content identifier and each of the plurality of archived document content identifiers; and return a document list for presentation in the user interface, the document list listing a set of documents identified from the archived documents based on the at least one document similarity value. - View Dependent Claims (18, 19, 20, 21, 22, 23)
-
Specification