System of document representation retrieval by successive iterated probability sampling
First Claim
1. In a computer system for identifying a predetermined number of documents of a document collection containing representations that have high probabilities of matching a query containing a plurality of concepts, in which the system has a database containing identifications of documents in the document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining the query, apparatus comprising:
- sample selection means for iteratively selecting successive samples of a plurality of documents from the collection, each sample containing fewer documents than the entire collection and each successive sample containing documents different from each previous sample;
processing means responsive to the sample selection means for calculating, during each iteration, probabilities that documents contained in the sample contain representations that match the query and for identifying a preselected number of documents having the highest probabilities, the documents being identified during an iteration from a group consisting of the respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and
output means outputting the identifications of the predetermined number of documents identified by the processing means.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system based on probabilities that documents meet information needs. The frequency of occurrence of a representation in a collection of documents is estimated by identifying the frequency of occurrence of the representation in a sample of documents and calculating the difference between the maximum and minimum probable frequencies of occurrence of the representation in the collection. If the difference does not exceed a limit, a midpoint of the maximum and minimum probable frequencies is the estimated frequency of occurrence of the representation.
Document distribution probabilities are optimized and probability thresholds are established for the identification of documents. An initial probability threshold is established and is adjusted as the probabilities are scored for documents in samples. The document result list is iteratively adjusted through the samples.
647 Citations
46 Claims
-
1. In a computer system for identifying a predetermined number of documents of a document collection containing representations that have high probabilities of matching a query containing a plurality of concepts, in which the system has a database containing identifications of documents in the document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining the query, apparatus comprising:
-
sample selection means for iteratively selecting successive samples of a plurality of documents from the collection, each sample containing fewer documents than the entire collection and each successive sample containing documents different from each previous sample; processing means responsive to the sample selection means for calculating, during each iteration, probabilities that documents contained in the sample contain representations that match the query and for identifying a preselected number of documents having the highest probabilities, the documents being identified during an iteration from a group consisting of the respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and output means outputting the identifications of the predetermined number of documents identified by the processing means. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system for identifying documents matching a comprising:
-
a memory containing a database containing identification of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, the database further containing indications of the frequencies of occurrence of documents containing first representations in the collection; computer means responsive to a query defining a plurality of concepts, the computer means including matching means for matching the concepts to representations, estimating means for estimating the frequency of occurrence of documents containing a second selected representation in the collection, the second selected representation being different from any of the first representations, the estimating means including sample selection means for selecting a sample comprising a plurality of documents from the collection, the sample containing fewer documents than the entire collection; frequency identifying means responsive to the sample selection means for identifying the frequency of occurrence of documents containing the second selected representation in the selected sample of documents; processor means responsive to the memory means and to the frequency identifying means for calculating a maximum and a minimum probable frequency of occurrence of documents containing the second selected representation in the collection; and selection means responsive to the processor means for selecting the midpoint, of the maximum and minimum probable frequencies as the estimated frequency of occurrence of the second selected representation; retrieval means for selecting documents meeting the query based on the frequencies of occurrence of documents containing first representations which match the concepts and the estimated frequencies of occurrence of documents containing second representations which match the concepts, and output means responsive to the retrieval means and the memory for outputting identifications of the selected documents. - View Dependent Claims (9, 10, 11)
-
-
12. In a system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection compromising a plurality of documents and the database containing a frequency of occurrence of documents containing each of at least some of the representations in the collection of documents, query means for defining a query containing a plurality of concepts, matching means for matching concepts to representations means for selecting documents meeting the query based on frequencies of occurrence of documents in the collection containing representations matching the concepts, and output means responsive to the means for selecting documents for outputting identifications of selected documents, the improvement comprising a process of estimating the frequency of occurrence of documents containing a representation in the collection of documents for which the database does not contain a frequency of occurrence, comprising:
-
identifying, on the basis of concepts in the query, a representation for which the database does not contain a frequency of occurrence; selecting a sample comprising a plurality of documents from the collection, the sample containing fewer documents than the entire collection; identifying the frequency of occurrence of documents containing the identified representation in the selected sample of documents; calculating a maximum and a minimum probable frequency of occurrence of documents containing the identified representation in the collection; and selecting a midpoint of the maximum and minimum probable frequencies as the estimated frequency of occurrence of documents containing the identified representation, whereby the means for selecting documents meeting the query is responsive to the frequencies of occurrence in the database of documents in the collection containing representations matching the concepts and to estimated frequencies of occurrence to select documents in the collection containing representations matching the concepts. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. In a computer system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining a query containing a plurality of concepts, apparatus for identifying documents of the document collection containing representations that match the query containing a plurality of concepts, the apparatus comprising:
-
processing means for calculating probabilities that documents match the query and for identifying a first document having a calculated probability; threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of the first document; estimating means responsive to the processing means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated; the processing means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts of the query until either the estimated maximum probability for the second document does not at least equal the probability threshold or the probability is calculated for all the concepts in the query; the estimating means being further responsive to the processing means ceasing or completing the calculation of the probability for the second document to estimate a maximum probability for a third document different from the first and second documents; and output means responsive to the processing means for outputting identifications of only documents whose probability is calculated for all concepts in the query. - View Dependent Claims (22, 23, 24)
-
-
25. A document identification system for identifying a predetermined number of documents matching a query, comprising:
-
a read-only memory containing a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of documents in the document collection, the collection comprising a plurality of documents; query means for defining the query containing a plurality of concepts; computer means responsive to the query containing a plurality of concepts, the computer means including matching means for matching the concepts to representations; sample selection means for iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection, and each successive sample containing documents different from each previous sample; processing means responsive to the sample selection means for calculating, during each iteration probabilities that documents contained in the sample contain representations that match the query and for identifying up to a preselected number of documents having the highest probabilities, the documents being identified during each iteration from a group consisting of the respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and output means outputting identifications of the predetermined number of documents identified by the processing means. - View Dependent Claims (26, 27, 28, 29, 30)
-
-
31. A document identification system for identifying documents matching a query, comprising:
-
a read-only memory containing a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of documents in a document collection, the collection comprising a plurality of documents, the database further containing indications of the frequencies of occurrences of a plurality of representations in the documents; query means for defining the query containing a plurality of concepts; computer means responsive to the query, the computer means including matching means for matching the concepts to representations; calculating means for calculating the probabilities that documents meet the query based on the frequencies of occurrence of representations in the respective documents which match the concepts; processing means responsive to the calculating means for identifying a first document contained in the sample having the highest calculated probability; threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of the first document; estimating means responsive to the calculating means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability for the second document and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated, said calculating means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts in the query until either the estimated maximum probability for the second document does not at least equal the probability threshold or the probability is calculated for all concepts in the query, the estimating means being further responsive to the calculating means ceasing or the completing the calculation of the probability for the second document to estimate a maximum probability for a third document different from the first and second documents; and output means responsive to the processing means for outputting identifications of only documents whose probability is calculated for all concepts in the query. - View Dependent Claims (32)
-
-
33. In a computer system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining a query containing a plurality of concepts, a process of identifying a predetermined number of documents of the document collection containing representations that have high probabilities of matching the query containing a plurality of concepts, the process comprising:
-
iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection, and each successive sample containing documents different from each previous sample; calculating the probabilities that documents contained in the sample contain representations that match the query; identifying, during each iteration, a preselected number of documents having the highest probabilities, the documents being selected from a group consisting of a respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and outputting identifications of the predetermined number of identified documents upon completion of the last iteration. - View Dependent Claims (34, 35, 36, 37, 38)
-
-
39. In a computer system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining a query containing a plurality of concepts, a process of identifying documents of the document collection containing representations that match the query containing a plurality of concepts, the process comprising:
-
computing the full probability that a first document matches the concepts in the query; setting a probability threshold equal to the full probability of the first document; calculating a partial probability that a second document matches some but not all concepts in the query; estimating a maximum probability for the second document based on the calculated probability and an assumption that the representations in the document match the concepts of the query for which probabilities have not been calculated; repeating the steps of calculating and estimating for additional query concepts until either the estimated maximum probability tier the second document is not as large as the probability threshold or the full probability of the second document is calculated for all concepts in the query; repeating the repetitive steps of calculating and estimating for a third document different from the first and second documents; and outputting identifications of only documents having a full probability at least as great as the probability threshold. - View Dependent Claims (40)
-
-
41. In a system identifying a predetermined number of documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, query means for defining containing a plurality of concepts, means for determining a probability that a document meets the query based on matches of representations in the document and concepts in the query, and output means for outputting the identifications of documents having a probability at least as great as a probability threshold, apparatus for establishing the probability threshold comprising:
-
sample selection means for iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection and each successive sample containing documents different from each previous sample; calculating means for calculating probabilities that documents contained in the sample contain representations that match the query; processing means responsive to the sample selection means to identify, during each iteration, up to a preselected number of documents having the highest probabilities, the documents being identified during each iteration from a group consisting of a respective sample of documents end the documents identified during the previous iteration; and threshold setting means responsive to the processing means for setting the probability threshold to the probability of the identified document having the lowest probability. - View Dependent Claims (42, 43)
-
-
44. In a system for identifying a predetermined number of documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, query means for defining a query containing a plurality of concepts, means for determining a probability that a document meets the query based on a match of representations in the document and concepts in the query, and output means for outputting the identifications of documents having a probability at least as great as a probability threshold, a process for establishing the probability threshold comprising:
-
iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection, and each successive sample containing documents different from each previous sample; calculating probabilities that documents in the sample contain representations that match the query; identifying, during each iteration, up to a preselected number of documents having the highest probabilities, the documents being identified during each iteration from a group consisting of a respective sample of documents and the documents identified during the next previous iteration; and setting the probability threshold to the probability of the identified document having the lowest probability. - View Dependent Claims (45, 46)
-
Specification