Method and system for document similarity analysis based on common denominator similarity

US 10,248,626 B1
Filed: 09/29/2016
Issued: 04/02/2019
Est. Priority Date: 09/29/2016
Status: Active Grant

First Claim

Patent Images

1. A method for document similarity analysis, the method comprising:

obtaining a document to be archived;

identifying a document category similar to the document to be archived, based on indexing terms and corresponding term frequencies, comprising;

identifying a document category that includes a plurality of indexing terms that are identical to indexing terms identified in the document to be archived;

obtaining a term frequency vector for the identical indexing terms in the document to be archived;

generating a normalized term frequency vector, from the term frequency vector for the document to be archived;

obtaining a term frequency vector for the identical indexing terms in the identified document category;

generating a normalized term frequency vector, from the term frequency vector for the identified document category;

calculating a common denominator similarity based on the normalized term frequency vector for the document to be archived, the normalized term frequency vector for the identified document category, and a common denominator;

making a determination that the document category is similar to the document to be archived based on the common denominator similarity; and

registering the document to be archived in the document category.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for document similarity analysis. The method includes obtaining a document to be archived, and identifying a document category similar to the document to be archived. The similar document category is identified by: identifying a document category that includes indexing terms that are identical to indexing terms in the document to be archived, obtaining term frequency vectors for the identical indexing terms in the document to be archived and in the identified document category, generating normalized term frequency vectors, from the term frequency vectors, calculating a common denominator similarity based on the normalized term frequency vectors and a common denominator, and determining that the document category is similar to the document to be archived based on the common denominator similarity. The method further includes registering the document to be archived in the document category.

16 Citations

View as Search Results

20 Claims

1. A method for document similarity analysis, the method comprising:
- obtaining a document to be archived;
  
  identifying a document category similar to the document to be archived, based on indexing terms and corresponding term frequencies, comprising;
  
  identifying a document category that includes a plurality of indexing terms that are identical to indexing terms identified in the document to be archived;
  
  obtaining a term frequency vector for the identical indexing terms in the document to be archived;
  
  generating a normalized term frequency vector, from the term frequency vector for the document to be archived;
  
  obtaining a term frequency vector for the identical indexing terms in the identified document category;
  
  generating a normalized term frequency vector, from the term frequency vector for the identified document category;
  
  calculating a common denominator similarity based on the normalized term frequency vector for the document to be archived, the normalized term frequency vector for the identified document category, and a common denominator;
  
  making a determination that the document category is similar to the document to be archived based on the common denominator similarity; and
  
  registering the document to be archived in the document category.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, further comprising:
    - obtaining a target document;
      
      obtaining indexing terms for the target document;
      
      identifying document categories similar to the target document, based on indexing terms and corresponding term frequencies of the target document and of the document categories;
      
      identifying, in the identified document categories, at least one similar document; and
      
      returning the at least one similar document.
  - 3. The method of claim 1, wherein the term frequency vector for the document to be archived specifies frequencies of the identical indexing terms in the document to be archived.
  - 4. The method of claim 1, wherein the normalized term frequency vector is obtained from the term frequency vector by applying the common denominator to vector elements of the term frequency vector.
  - 5. The method of claim 1, wherein calculating the common denominator similarity comprises:
    - for each pair of a vector element of the normalized term frequency vector for the document to be archived and a corresponding vector element of the normalized term frequency vector for the identified document category;
      
      identifying the smaller vector element; and
      
      applying the common denominator to a vector composed of the identified smaller vector elements.
  - 6. The method of claim 1, further comprising:
    - identifying a second document category similar to the document to be archived; and
      
      registering the document in the second document category.
  - 7. The method of claim 1, further comprising:
    - making a second determination that the similarity between the document to be archived and the document category is weak, and based on the second determination;
      
      generating a new document category and registering the document to be archived in the new document category.
  - 8. The method of claim 7, wherein a common denominator similarity of at least 0.4 indicates that the document category is at least weakly similar to the document to be archived.
  - 9. The method of claim 7, wherein generating a new document category comprises:
    - assigning the indexing terms and the term frequencies for the document to be archived to the new document category.
  - 10. The method of claim 1, further comprising:
    - obtaining a second document to be archived;
      
      obtaining indexing terms for the second document to be archived;
      
      identifying document categories similar to the second document to be archived, based on indexing terms and corresponding term frequencies for the second document to be archived; and
      
      making a second determination that a highly similar document category exists and based on the second determination;
      
      registering the document to be archived with the highly similar document category.
  - 11. The method of claim 10, wherein a common denominator similarity between the highly similar document category and the document to be archived is at least 0.7.

12. A non-transitory computer readable medium (CRM) comprising instructions that enable a system for document similarity analysis to:
- obtain a document to be archived;
  
  identify a document category similar to the document to be archived, based on index terms and corresponding term frequencies, comprising;
  
  identifying a document category that includes a plurality of indexing terms that are identical to indexing terms identified in the document to be archived;
  
  obtaining a term frequency vector for the identical indexing terms in the document to be archived;
  
  generating a normalized term frequency vector, from the term frequency vector for the document to be archived;
  
  obtaining a term frequency vector for the identical indexing terms in the identified document category;
  
  generating a normalized term frequency vector, from the term frequency vector for the identified document category;
  
  calculating a common denominator similarity based on the normalized term frequency vector for the document to be archived, the normalized term frequency vector for the identified document category, and a common denominator;
  
  making a determination that the document category is similar to the document to be archived based on the common denominator similarity; and
  
  register the document to be archived in the document category.
- View Dependent Claims (13, 14, 15, 16)
- - 13. The non-transitory CRM of claim 12, wherein the instructions further enable the system for document similarity analysis to:
    - obtain a target document;
      
      obtain indexing terms for the target document;
      
      identify document categories similar to the target document, based on indexing terms and corresponding term frequencies of the target document and of the document categories;
      
      identify, in the identified document categories, at least one similar document; and
      
      return the at least one similar document.
  - 14. The non-transitory CRM of claim 12, wherein the normalized term frequency vector is obtained from the term frequency vector by applying the common denominator to vector elements of the term frequency vector.
  - 15. The non-transitory CRM of claim 12, wherein calculating the common denominator similarity comprises:
    - for each pair of a vector element of the normalized term frequency vector for the document to be archived and a corresponding vector element of the normalized term frequency vector for the identified document category;
      
      identifying the smaller vector element; and
      
      applying the common denominator to a vector composed of the identified smaller vector elements.
  - 16. The non-transitory CRM of claim 12, wherein the instructions further enable the system for document similarity analysis to:
    - identify a second document category similar to the document to be archived; and
      
      register the document in the second document category.

17. A system for document similarity analysis, the system comprising:
- a document categorization and search engine; and
  
  a document repository;
  
  wherein the document categorization and search engine;
  
  obtains a document to be archived;
  
  identifies, in the document repository, a document category similar to the document to be archived, based on indexing terms and corresponding term frequencies, comprising;
  
  identifying a document category that includes a plurality of indexing terms that are identical to indexing terms identified in the document to be archived;
  
  obtaining a term frequency vector for the identical indexing terms in the document to be archived;
  
  generating a normalized term frequency vector, from the term frequency vector for the document to be archived;
  
  obtaining a term frequency vector for the identical indexing terms in the identified document category;
  
  generating a normalized term frequency vector, from the term frequency vector for the identified document category;
  
  calculating a common denominator similarity based on the normalized term frequency vector for the document to be archived, the normalized term frequency vector for the identified document category, and a common denominator;
  
  making a determination that the document category is similar to the document to be archived based on the common denominator similarity; and
  
  registers the document to be archived in the document category.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17, wherein the document categorization and search engine further:
    - obtains a target document;
      
      obtains indexing terms for the target document;
      
      identifies document categories similar to the target document, based on indexing terms and corresponding term frequencies of the target document and of the document categories;
      
      identifies, in the identified document categories, at least one similar document; and
      
      returns the at least one similar document.
  - 19. The system of claim 17, wherein the normalized term frequency vector is obtained from the term frequency vector by applying the common denominator to vector elements of the term frequency vector.
  - 20. The system of claim 17, wherein calculating the common denominator similarity comprises:
    - for each pair of a vector element of the normalized term frequency vector for the document to be archived and a corresponding vector element of the normalized term frequency vector for the identified document category;
      
      identifying the smaller vector element; and
      
      applying the common denominator to a vector composed of the identified smaller vector elements.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Original Assignee
Emc IP Holding Company LLC (Dell Technologies Inc.)
Inventors
Zhang, Lei, Chen, Chao, Huang, Kunwu, Dai, Hongtao, Liu, Jingjing, Teng, Ying
Primary Examiner(s)
Dang, Thanh-Ha

Application Number

US15/279,919
Time in Patent Office

915 Days
Field of Search

707740
US Class Current
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/334   Query execution G06F16/335 ...

G06F 16/35   Clustering; Classification

G06F 16/93   Document management systems

Method and system for document similarity analysis based on common denominator similarity

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

16 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for document similarity analysis based on common denominator similarity

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links