Automated determination of document utility for a document corpus
First Claim
1. A computer implemented method for determining whether to add a document to an electronic document corpus maintained in at least one electronic data store, the method comprising:
- creating, based at least in part on the content of the document corpus, a corpus vector including corpus cells, each corpus cell indicating a word in the document corpus and corpus word count indicating a number of instances of the word in the document corpus; and
receiving, via one or more processors, an electronic candidate document;
creating, based at least in part on the content of the candidate document, a document vector including document cells, each document cell indicating a word in the candidate document and document word count indicating a number of instances of the word in the candidate document;
determining, by at least one of the processors, a relevance value indicating relevance of the candidate document to the document corpus, wherein determining the relevance value includesfor each corpus cell in the document corpus vector, determining a product by multiplying the corpus word count by a document word count of a corresponding document cell in the document vector;
determining a numerator by summing the products;
determining a first sum by summing a square of the corpus word count in each corpus cell in the corpus vector,determining a second sum by summing a square of the document word count in each document cell in the document vector,determining a first square root of first sum and a second square root of the second sum;
determining a denominator by multiplying the first square root by the second square root;
dividing the numerator by the denominator;
determining, by the one or more processors, that the candidate document is novel with respect to the document corpus based on the relevance value; and
in response to determining that the candidate document is relevant to the document corpus and novel with respect to the document corpus, adding the candidate document to the document corpus to make at least a portion of the content of the candidate document available for a response to a search query.
1 Assignment
0 Petitions
Accused Products
Abstract
A candidate document is received, for example, by a document filter. A determination is made based on the content of the candidate document, whether the candidate document is relevant to a document corpus. A determination is made based on the content of the candidate document, whether the candidate document is novel with respect to the document corpus. In response to determining that the candidate document is relevant to the document corpus and novel with respect to the document corpus, the candidate document is added to the document corpus to make at least a portion of the content of the candidate document available for a response to a search query.
10 Citations
14 Claims
-
1. A computer implemented method for determining whether to add a document to an electronic document corpus maintained in at least one electronic data store, the method comprising:
-
creating, based at least in part on the content of the document corpus, a corpus vector including corpus cells, each corpus cell indicating a word in the document corpus and corpus word count indicating a number of instances of the word in the document corpus; and receiving, via one or more processors, an electronic candidate document; creating, based at least in part on the content of the candidate document, a document vector including document cells, each document cell indicating a word in the candidate document and document word count indicating a number of instances of the word in the candidate document; determining, by at least one of the processors, a relevance value indicating relevance of the candidate document to the document corpus, wherein determining the relevance value includes for each corpus cell in the document corpus vector, determining a product by multiplying the corpus word count by a document word count of a corresponding document cell in the document vector; determining a numerator by summing the products; determining a first sum by summing a square of the corpus word count in each corpus cell in the corpus vector, determining a second sum by summing a square of the document word count in each document cell in the document vector, determining a first square root of first sum and a second square root of the second sum; determining a denominator by multiplying the first square root by the second square root; dividing the numerator by the denominator; determining, by the one or more processors, that the candidate document is novel with respect to the document corpus based on the relevance value; and in response to determining that the candidate document is relevant to the document corpus and novel with respect to the document corpus, adding the candidate document to the document corpus to make at least a portion of the content of the candidate document available for a response to a search query. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to determine document utility, the program instructions comprising:
-
program instructions to create, based at least in part on the content of the document corpus, a corpus vector including corpus cells, each corpus cell indicating a word in the document corpus and corpus word count indicating a number of instances of the word in the document corpus; and program instructions to receive, via one or more processors, an electronic candidate document; program instructions to create, based at least in part on the content of the candidate document, a document vector including document cells, each document cell indicating a word in the candidate document and document word count indicating a number of instances of the word in the candidate document; program instructions to determine, by at least one of the processors, a relevance value indicating relevance of the candidate document to a the document corpus, wherein the program instructions to determine the relevance value includes program instructions to for each corpus cell in the document corpus vector, determining a product by multiply the corpus word count by a document word count of a corresponding candidate document cell; determine a numerator by summing the products; determine a first sum by summing a square of the corpus word count in each corpus cell in the corpus vector, determine a second sum by summing a square of the document word count in each document cell in the document vector, determine a first square root of first sum and a second square root of the second sum; determine a denominator by multiplying the first square root by the second square root; divide the numerator by the denominator; program instructions to determine, by the one or more processors, that the candidate document is novel with respect to the document corpus based on the relevance value; and program code to in response to the determination that the candidate document is relevant to the document corpus and novel with respect to the document corpus, add the candidate document to the document corpus to make at least a portion of the content of the candidate document available for a response to a search query. - View Dependent Claims (8, 9, 10)
-
-
11. An apparatus comprising:
-
one or more processors; a computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by the one or more processors to cause the computer to determine document utility, the program instructions comprising; program instructions to create, based at least in part on the content of the document corpus, a corpus vector including corpus cells, each corpus cell indicating a word in the document corpus and corpus word count indicating a number of instances of the word in the document corpus; and program instructions to receive, via one or more processors, an electronic candidate document; program instructions to create, based at least in part on the content of the candidate document, a document vector including document cells, each document cell indicating a word in the candidate document and document word count indicating a number of instances of the word in the candidate document; program instructions to determine, by at least one of the processors, a relevance value indicating relevance of the candidate document to a the document corpus, wherein the program instructions to determine the relevance value includes program instructions to for each corpus cell in the document corpus vector, determining a product by multiply the corpus word count by a document word count of a corresponding document cell; determine a numerator by summing the products; determine a first sum by summing a square of the corpus word count in each corpus cell in the corpus vector, determine a second sum by summing a square of the document word count in each document cell in the document vector, determine a first square root of first sum and a second square root of the second sum; determine a denominator by multiplying the first square root by the second square root; divide the numerator by the denominator; program instructions to determine, by the one or more processors, that the candidate document is novel with respect to the document corpus based on the relevance value; and program code to in response to the determination that the candidate document is relevant to the document corpus and novel with respect to the document corpus, add the candidate document to the document corpus to make at least a portion of the content of the candidate document available for a response to a search query. - View Dependent Claims (12, 13, 14)
-
Specification