×

Method and system for document similarity analysis

  • US 10,572,544 B1
  • Filed: 12/14/2015
  • Issued: 02/25/2020
  • Est. Priority Date: 12/14/2015
  • Status: Active Grant
First Claim
Patent Images

1. A method for document similarity analysis, the method comprising:

  • receiving, by a processor, an indication of a reference document, the indication received based on user interaction with a user interface;

    generating, by the processor, a reference document content identifier for the reference document, comprising;

    identifying frequently occurring terms in reference document content;

    encoding each frequently occurring term of the identified frequently occurring terms in a term identifier; and

    combining the term identifiers to form the reference document content identifier associated with the reference document, the reference document content identifier associated with the reference document comprising a bit array representation with locations in the bit array representation representing corresponding terms and bits of the bit array representation set at respective locations in the bit array representation the respective locations corresponding to numbers representing the encoded frequently occurring terms;

    obtaining at least one document similarity value by comparing the reference document content identifier to a plurality of archived document content identifiers associated with archived documents stored in a document repository, each of the plurality of archived document content identifiers comprising a representation of frequently occurring terms in content of an associated archived document, wherein the comparing includes determining a degree of difference between reference document content and archived document content of the archived documents based on at least one count of deviating bits between the reference document content identifier and each of the plurality of archived document content identifiers; and

    returning, by the processor, a document list for presentation in the user interface, the document list listing a set of documents identified from the archived documents based on the at least one document similarity value.

View all claims
  • 9 Assignments
Timeline View
Assignment View
    ×
    ×