Method and system for document similarity analysis

US 10,572,544 B1
Filed: 12/14/2015
Issued: 02/25/2020
Est. Priority Date: 12/14/2015
Status: Active Grant

First Claim

Patent Images

1. A method for document similarity analysis, the method comprising:

receiving, by a processor, an indication of a reference document, the indication received based on user interaction with a user interface;

generating, by the processor, a reference document content identifier for the reference document, comprising;

identifying frequently occurring terms in reference document content;

encoding each frequently occurring term of the identified frequently occurring terms in a term identifier; and

combining the term identifiers to form the reference document content identifier associated with the reference document, the reference document content identifier associated with the reference document comprising a bit array representation with locations in the bit array representation representing corresponding terms and bits of the bit array representation set at respective locations in the bit array representation the respective locations corresponding to numbers representing the encoded frequently occurring terms;

obtaining at least one document similarity value by comparing the reference document content identifier to a plurality of archived document content identifiers associated with archived documents stored in a document repository, each of the plurality of archived document content identifiers comprising a representation of frequently occurring terms in content of an associated archived document, wherein the comparing includes determining a degree of difference between reference document content and archived document content of the archived documents based on at least one count of deviating bits between the reference document content identifier and each of the plurality of archived document content identifiers; and

returning, by the processor, a document list for presentation in the user interface, the document list listing a set of documents identified from the archived documents based on the at least one document similarity value.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for document similarity analysis. The method includes generating a reference document content identifier for a reference document, including identifying frequently occurring terms in reference document content, encoding each frequently occurring term in a term identifier and combining the term identifiers to form the reference document content identifier associated with the reference document. The method also includes obtaining at least one document similarity value by comparing the reference document content identifier to a set of archived document content identifiers stored in a document repository.

Citations

23 Claims

1. A method for document similarity analysis, the method comprising:
- receiving, by a processor, an indication of a reference document, the indication received based on user interaction with a user interface;
  
  generating, by the processor, a reference document content identifier for the reference document, comprising;
  
  identifying frequently occurring terms in reference document content;
  
  encoding each frequently occurring term of the identified frequently occurring terms in a term identifier; and
  
  combining the term identifiers to form the reference document content identifier associated with the reference document, the reference document content identifier associated with the reference document comprising a bit array representation with locations in the bit array representation representing corresponding terms and bits of the bit array representation set at respective locations in the bit array representation the respective locations corresponding to numbers representing the encoded frequently occurring terms;
  
  obtaining at least one document similarity value by comparing the reference document content identifier to a plurality of archived document content identifiers associated with archived documents stored in a document repository, each of the plurality of archived document content identifiers comprising a representation of frequently occurring terms in content of an associated archived document, wherein the comparing includes determining a degree of difference between reference document content and archived document content of the archived documents based on at least one count of deviating bits between the reference document content identifier and each of the plurality of archived document content identifiers; and
  
  returning, by the processor, a document list for presentation in the user interface, the document list listing a set of documents identified from the archived documents based on the at least one document similarity value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, further comprising:
    - for the at least one document similarity value, identifying a corresponding archived document in the document repository; and
      
      obtaining the identified corresponding archived document from the document repository.
  - 3. The method of claim 1, further comprising:
    - storing the reference document and the reference document content identifier in the document repository.
  - 4. The method of claim 1, wherein generating the reference document content identifier further comprises, prior to identifying the frequently occurring terms in the reference document content:
    - tokenizing the reference document content;
      
      removing stop words from the reference document content; and
      
      stemming the reference document content.
  - 5. The method of claim 1, wherein the plurality of archived document content identifiers are organized in a binary search tree.
  - 6. The method of claim 5, wherein obtaining the at least one document similarity value further comprises:
    - traversing the binary search tree to identify an archived document content identifier with a highest similarity to the reference document content identifier.
  - 7. The method of claim 6, further comprising, after identifying the archived document content identifier with the highest similarity to the reference document content identifier:
    - making a determination that additional document similarity values are required; and
      
      , based on the determination that additional document similarity values are required;
      
      traversing the binary search tree to identify an archived document content identifier with a second highest similarity to the reference document content identifier.
  - 8. The method of claim 1, wherein obtaining the at least one document similarity value by comparing the reference document content identifier to the plurality of archived document content identifiers associated with the archived documents stored in the document repository includes:
    - traversing a binary search tree representation of the plurality of archived document content identifiers.

9. A non-transitory computer readable medium (CRM) storing computer-readable instructions for document similarity analysis, the computer-readable instructions executable to:
- receive, based on user interaction with a user interface, an indication of a reference document;
  
  generate a reference document content identifier for the reference document, comprising;
  
  identify frequently occurring terms in reference document content;
  
  encode each frequently occurring term of the identified frequently occurring terms in a term identifier;
  
  combine the term identifiers to form the reference document content identifier associated with the reference document, wherein the reference document content identifier comprises a bit array representation with locations in the bit array representation representing corresponding terms and bits of the bit array representation set at respective locations in the bit array representation, the respective locations corresponding to numbers representing the encoded frequently occurring terms; and
  
  obtain at least one document similarity value by comparing the reference document content identifier to a plurality of archived document content identifiers associated with archived documents stored in a document repository, each of the plurality of archived document content identifiers comprising a representation of frequently occurring terms in content of an associated archived document, wherein the comparing includes determining a degree of difference between reference document content and archived document content of the plurality of archived documents based on at least one count of deviating bits between the reference document content identifier and each of the plurality of archived document content identifiers; and
  
  return a document list for presentation in the user interface, the document list listing a set of documents identified from the archived documents based on the at least one document similarity value.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The non-transitory CRM of claim 9, wherein the computer-readable instructions for document similarity analysis are further executable to:
    - for the at least one document similarity value, identify a corresponding archived document in the document repository; and
      
      obtain the identified corresponding archived document from the document repository.
  - 11. The non-transitory CRM of claim 9, wherein the computer-readable instructions for document similarity analysis are further executable to:
    - store the reference document and the reference document content identifier in the document repository.
  - 12. The non-transitory CRM of claim 9, wherein the computer-readable instructions comprise instructions executable to:
    - prior to identifying the frequently occurring terms in the reference document content;
      
      tokenizing the reference document content;
      
      removing stop words from the reference document content; and
      
      stemming the reference document content.
  - 13. The non-transitory CRM of claim 9, wherein the plurality of archived document content identifiers are organized in a binary search tree.
  - 14. The non-transitory CRM of claim 13, wherein the computer-readable instructions executable to obtain the at least one document similarity value further comprise instructions executable to:
    - traverse the binary search tree to identify an archived document content identifier with a highest similarity to the reference document content identifier.
  - 15. The non-transitory CRM of claim 14, wherein the computer-readable instructions executable to obtain the at least one document similarity value further comprise instructions executable to:
    - make a determination that additional document similarity values are required; and
      
      , based on the determination that additional document similarity factors are required;
      
      traverse the binary search tree to identify an archived document content identifier with a second highest similarity to the reference document content identifier.
  - 16. The non-transitory CRM of claim 9, wherein obtaining the at least one document similarity value by comparing the reference document content identifier to the plurality of archived document content identifiers associated with the archived documents stored in the document repository includes:
    - traversing a binary search tree representation of the plurality of archived document content identifiers.

17. A system for document similarity analysis, comprising:
- a computing device comprising a computer processor;
  
  a document content identifier encoding engine, executable on the computer processor to;
  
  receive, based on user interaction with a user interface, an indication of a reference document;
  
  identify frequently occurring terms in reference document content;
  
  encode each frequently occurring term of the identified frequently occurring terms in a term identifier;
  
  combine the term identifiers to form a reference document content identifier associated with the reference document, wherein the reference document content identifier associated with the reference document comprises a bit array representation with locations in the bit array representation representing corresponding terms and bits of the bit array representation set at respective locations in the bit array representation, the respective locations corresponding to numbers representing the encoded frequently occurring terms; and
  
  a document content identifier similarity analysis engine, executable on the computer processor to;
  
  obtain at least one document similarity value by comparing the reference document content identifier to a plurality of archived document content identifiers associated with archived documents stored in a document repository, each of the plurality of archived document content identifiers comprising a representation of frequently occurring terms in content of an associated archived document, wherein the comparing includes determining a degree of difference between reference document content and archived document content of the archived documents based on at least one count of deviating bits between the reference document content identifier and each of the plurality of archived document content identifiers; and
  
  return a document list for presentation in the user interface, the document list listing a set of documents identified from the archived documents based on the at least one document similarity value.
- View Dependent Claims (18, 19, 20, 21, 22, 23)
- - 18. The system of claim 17, wherein the document content identifier similarity analysis engine is executable to, for the at least one document similarity value:
    - identifies a corresponding archived document in the document repository; and
      
      obtains the identified corresponding archived document from the document repository.
  - 19. The system of claim 17, wherein the document content identifier encoding engine is executable to store the reference document and the reference document content identifier in the document repository.
  - 20. The system of claim 17, wherein the document content identifier encoding engine is executable to, prior to identifying the frequently occurring terms in the reference document content:
    - tokenize the reference document content;
      
      remove stop words from the reference document content; and
      
      stem the reference document content.
  - 21. The system of claim 17, wherein the plurality of archived document content identifiers are organized in a binary search tree.
  - 22. The system of claim 21, wherein the binary search tree is traversed to identify an archived document content identifier with a highest similarity to the reference document content identifier.
  - 23. The system of claim 17, wherein obtaining the at least one document similarity value by comparing the reference document content identifier to the plurality of archived document content identifiers associated with the archived documents stored in the document repository includes:
    - traversing a binary search tree representation of the plurality of archived document content identifiers.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
EMC Corporation (Dell Technologies Inc.)
Original Assignee
Open Text Corporation
Inventors
Zhang, Lei, Chen, Chao, Zhao, Kun, Liu, Jingjing, Teng, Ying
Primary Examiner(s)
Jalil, Neveen Abel
Assistant Examiner(s)
Conyers, Dawaune A

Application Number

US14/968,421
Time in Patent Office

1,534 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/2246   Trees, e.g. B+trees

G06F 16/2455   Query execution

G06F 16/24578   using ranking

G06F 16/383   using metadata automaticall...

G06F 16/93   Document management systems

Method and system for document similarity analysis

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for document similarity analysis

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links