Method and system for document similarity analysis based on common denominator similarity
First Claim
1. A method for document similarity analysis, the method comprising:
- obtaining a document to be archived;
identifying a document category similar to the document to be archived, based on indexing terms and corresponding term frequencies, comprising;
identifying a document category that includes a plurality of indexing terms that are identical to indexing terms identified in the document to be archived;
obtaining a term frequency vector for the identical indexing terms in the document to be archived;
generating a normalized term frequency vector, from the term frequency vector for the document to be archived;
obtaining a term frequency vector for the identical indexing terms in the identified document category;
generating a normalized term frequency vector, from the term frequency vector for the identified document category;
calculating a common denominator similarity based on the normalized term frequency vector for the document to be archived, the normalized term frequency vector for the identified document category, and a common denominator;
making a determination that the document category is similar to the document to be archived based on the common denominator similarity; and
registering the document to be archived in the document category.
7 Assignments
0 Petitions
Accused Products
Abstract
A method for document similarity analysis. The method includes obtaining a document to be archived, and identifying a document category similar to the document to be archived. The similar document category is identified by: identifying a document category that includes indexing terms that are identical to indexing terms in the document to be archived, obtaining term frequency vectors for the identical indexing terms in the document to be archived and in the identified document category, generating normalized term frequency vectors, from the term frequency vectors, calculating a common denominator similarity based on the normalized term frequency vectors and a common denominator, and determining that the document category is similar to the document to be archived based on the common denominator similarity. The method further includes registering the document to be archived in the document category.
16 Citations
20 Claims
-
1. A method for document similarity analysis, the method comprising:
-
obtaining a document to be archived; identifying a document category similar to the document to be archived, based on indexing terms and corresponding term frequencies, comprising; identifying a document category that includes a plurality of indexing terms that are identical to indexing terms identified in the document to be archived; obtaining a term frequency vector for the identical indexing terms in the document to be archived; generating a normalized term frequency vector, from the term frequency vector for the document to be archived; obtaining a term frequency vector for the identical indexing terms in the identified document category; generating a normalized term frequency vector, from the term frequency vector for the identified document category; calculating a common denominator similarity based on the normalized term frequency vector for the document to be archived, the normalized term frequency vector for the identified document category, and a common denominator; making a determination that the document category is similar to the document to be archived based on the common denominator similarity; and registering the document to be archived in the document category. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A non-transitory computer readable medium (CRM) comprising instructions that enable a system for document similarity analysis to:
-
obtain a document to be archived; identify a document category similar to the document to be archived, based on index terms and corresponding term frequencies, comprising; identifying a document category that includes a plurality of indexing terms that are identical to indexing terms identified in the document to be archived; obtaining a term frequency vector for the identical indexing terms in the document to be archived; generating a normalized term frequency vector, from the term frequency vector for the document to be archived; obtaining a term frequency vector for the identical indexing terms in the identified document category; generating a normalized term frequency vector, from the term frequency vector for the identified document category; calculating a common denominator similarity based on the normalized term frequency vector for the document to be archived, the normalized term frequency vector for the identified document category, and a common denominator; making a determination that the document category is similar to the document to be archived based on the common denominator similarity; and register the document to be archived in the document category. - View Dependent Claims (13, 14, 15, 16)
-
-
17. A system for document similarity analysis, the system comprising:
-
a document categorization and search engine; and a document repository; wherein the document categorization and search engine; obtains a document to be archived; identifies, in the document repository, a document category similar to the document to be archived, based on indexing terms and corresponding term frequencies, comprising; identifying a document category that includes a plurality of indexing terms that are identical to indexing terms identified in the document to be archived; obtaining a term frequency vector for the identical indexing terms in the document to be archived; generating a normalized term frequency vector, from the term frequency vector for the document to be archived; obtaining a term frequency vector for the identical indexing terms in the identified document category; generating a normalized term frequency vector, from the term frequency vector for the identified document category; calculating a common denominator similarity based on the normalized term frequency vector for the document to be archived, the normalized term frequency vector for the identified document category, and a common denominator; making a determination that the document category is similar to the document to be archived based on the common denominator similarity; and registers the document to be archived in the document category. - View Dependent Claims (18, 19, 20)
-
Specification