System, method, and apparatus for pairing a short document to another short document from a plurality of short documents
First Claim
1. A computer-implemented method for pairing a new document to a document from a plurality of documents in a document repository, comprising:
- for each of the new document and the plurality of documents in the document repository, generating a vector uniquely associated with a document of the new document and the plurality of documents, wherein;
the vector comprises a number of elements equal to a number of terms of interest; and
for each term of interest, an associated element value of the vector is assigned as zero if the term of interest does not occur in the document and one if the term does occur in the document;
for each document from the plurality of documents, determining a similarity between the vector for the new document and the vector for the document from the plurality of documents comprising calculating a cosine measurement of similarity between the vector for the new document and the vector for the document from the plurality of documents;
if it is determined that the similarity between the vector for the new document and the vector for a document from the plurality of documents is greater than or equal to a threshold value then;
selecting the document from the plurality of documents;
generating a merged document by merging the new document with the document from the plurality of documents in response to the document from the plurality of documents being selected, wherein the merging comprises combining at least a portion of the new document with at least a portion of the selected document into the merged document;
removing the selected document from the document repository and adding the merged document to the document repository; and
generating a new vector for the merged document; and
if it is determined that the similarity is less than the threshold value then adding the new document to the document repository without merging the new document.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented method for pairing a new document to a document from a plurality of documents. Embodiments include, for each of the new document and the plurality of documents, generating a vector of terms of interest uniquely associated with a document of the new document and the plurality of documents. For each term of interest, an associated element value of the vector is assigned as zero if the term of interest does not occur in the document and one otherwise. The method also includes, for each document from the plurality of documents, determining a similarity between the vectors. The method also includes selecting a document from the plurality of documents as related to the new document if the similarity between the vector for the new document and the vector for the document from the plurality of documents is greater than or equal to a threshold value.
-
Citations
16 Claims
-
1. A computer-implemented method for pairing a new document to a document from a plurality of documents in a document repository, comprising:
-
for each of the new document and the plurality of documents in the document repository, generating a vector uniquely associated with a document of the new document and the plurality of documents, wherein; the vector comprises a number of elements equal to a number of terms of interest; and for each term of interest, an associated element value of the vector is assigned as zero if the term of interest does not occur in the document and one if the term does occur in the document; for each document from the plurality of documents, determining a similarity between the vector for the new document and the vector for the document from the plurality of documents comprising calculating a cosine measurement of similarity between the vector for the new document and the vector for the document from the plurality of documents; if it is determined that the similarity between the vector for the new document and the vector for a document from the plurality of documents is greater than or equal to a threshold value then; selecting the document from the plurality of documents; generating a merged document by merging the new document with the document from the plurality of documents in response to the document from the plurality of documents being selected, wherein the merging comprises combining at least a portion of the new document with at least a portion of the selected document into the merged document; removing the selected document from the document repository and adding the merged document to the document repository; and generating a new vector for the merged document; and if it is determined that the similarity is less than the threshold value then adding the new document to the document repository without merging the new document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 14, 15, 16)
-
-
10. A non-transitory computer useable storage medium to store a computer readable program, wherein the computer readable program, when executed on a computer, causes the computer to perform operations for merging a new document with a document from a plurality of documents, the operations comprising:
-
receiving a new document; for each of the new document and each document of a plurality of documents in a document repository, generating a vector uniquely associated with each of the documents of the plurality of documents and the new document, wherein; the vector comprises a number of elements equal to a number of terms of interest; and for each term of interest, an associated element value of the vector is assigned as zero if the term of interest does not occur in the document and one if the term does occur in the document; for each document from the plurality of documents, determining a similarity between the vector for the new document and the vector for the document from the plurality of documents; selecting a document from the plurality of documents as related to the new document in response to; determining that the similarity between the vector for the new document and the vector for the document from the plurality of documents is greater than or equal to a threshold value, wherein determining the similarity comprises calculating a cosine measurement of similarity between the vector for the new document and the vector for the document from the plurality of documents; and determining that the similarity between the vector for the new document and the vector for the document from the plurality of documents is greater than or equal to the similarity between the vector for the new document and the vector for any other document from the plurality of documents; and merging the new document with the selected document if a document from the plurality of documents is selected, wherein the merging comprises combining at least a portion of the new document with at least a portion of the selected document into a merged document; removing the selected document from the document repository and adding the merged document to the document repository; and generating a new vector for the merged document; and if it is determined that the similarity is less than the threshold value then adding the new document to the document repository without merging the new document with a document from the plurality of documents. - View Dependent Claims (11)
-
-
12. A system comprising:
-
a document repository to store a plurality of documents; a comparison engine comprising; a vector generator to generate a vector for each of a new document and a document from the document repository, wherein; the vector comprises a number of elements equal to a number of terms of interest; and for each term of interest, an associated element value of the vector is assigned as zero if the term of interest does not occur in the document, and one if the term does occur in the document; and a similarity generator to determine a similarity between the document and the document from the document repository, the similarity based on the vector for the new document and the vector for the document from the document repository, wherein to determine the similarity between the document and the document from the document repository comprises calculating a cosine measurement of similarity between the vector for the new document and the vector for the document from the plurality of documents; and a pairing engine comprising; a document receiver to receive the new document to be paired with the documents in the document repository; a document submitter to submit the new document to the comparison engine and direct the comparison engine to determine a similarity between the new document and each document of the plurality of documents in the document repository; and a pair indicator to indicate a pairing between the new document and a document in the document repository having the highest similarity with the new document if the similarity between the new document and the document in the document repository having the highest similarity with the new document is greater than or equal to a threshold value; and a merge engine comprising; a document receiver to receive a new document to be incorporated with the documents in the document repository; a document submitter to submit the new document to the comparison engine and direct the comparison engine to determine a similarity between the new document and each document of the plurality of documents in the document repository; a merged document generator to generate a merged document by combining at least a portion of the new document with at least a portion of a document in the document repository having the highest similarity with the new document if the similarity between the new document and the document in the document repository having the highest similarity with the new document is greater than or equal to a threshold value; a document remover to remove the document in the document repository having the highest similarity with the new document; and a merged document adder to add the merged document to the document repository; wherein the vector generator further generates a new vector for the merged document. - View Dependent Claims (13)
-
Specification