System of document representation retrieval by successive iterated probability sampling

US 5,488,725 A
Filed: 03/30/1993
Issued: 01/30/1996
Est. Priority Date: 10/08/1991
Status: Expired due to Term

First Claim

Patent Images

1. In a computer system for identifying a predetermined number of documents of a document collection containing representations that have high probabilities of matching a query containing a plurality of concepts, in which the system has a database containing identifications of documents in the document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining the query, apparatus comprising:

sample selection means for iteratively selecting successive samples of a plurality of documents from the collection, each sample containing fewer documents than the entire collection and each successive sample containing documents different from each previous sample;

processing means responsive to the sample selection means for calculating, during each iteration, probabilities that documents contained in the sample contain representations that match the query and for identifying a preselected number of documents having the highest probabilities, the documents being identified during an iteration from a group consisting of the respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and

output means outputting the identifications of the predetermined number of documents identified by the processing means.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system based on probabilities that documents meet information needs. The frequency of occurrence of a representation in a collection of documents is estimated by identifying the frequency of occurrence of the representation in a sample of documents and calculating the difference between the maximum and minimum probable frequencies of occurrence of the representation in the collection. If the difference does not exceed a limit, a midpoint of the maximum and minimum probable frequencies is the estimated frequency of occurrence of the representation.

Document distribution probabilities are optimized and probability thresholds are established for the identification of documents. An initial probability threshold is established and is adjusted as the probabilities are scored for documents in samples. The document result list is iteratively adjusted through the samples.

647 Citations

46 Claims

1. In a computer system for identifying a predetermined number of documents of a document collection containing representations that have high probabilities of matching a query containing a plurality of concepts, in which the system has a database containing identifications of documents in the document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining the query, apparatus comprising:
- sample selection means for iteratively selecting successive samples of a plurality of documents from the collection, each sample containing fewer documents than the entire collection and each successive sample containing documents different from each previous sample;
  
  processing means responsive to the sample selection means for calculating, during each iteration, probabilities that documents contained in the sample contain representations that match the query and for identifying a preselected number of documents having the highest probabilities, the documents being identified during an iteration from a group consisting of the respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and
  
  output means outputting the identifications of the predetermined number of documents identified by the processing means.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The apparatus according to claim 1 further including threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of a first identified document.
  - 3. The apparatus according to claim 2 including determining means operable during each respective iteration and responsive to the identification of the preselected number of documents by the processing means to determine if an additional document of the respective sample has a probability greater than the probability threshold, the processing means being responsive to the determining means identifying an additional document having a probability greater that the probability threshold to replace the previously-identified document having the lowest probability by the additional document, and the threshold setting means being responsive to the processing means to reset the probability threshold to the probability of the identified document having the new lowest probability.
  - 4. The apparatus according to claim 1 wherein the preselected number is equal to the number of the respective iteration, and the predetermined number is equal to the number of the last iteration.
  - 5. The apparatus according to claim 1 including estimating means responsive to the processing means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability for the second document and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated, the processing means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts of the query until either the estimated maximum probability does not at least equal the threshold or the probability is calculated for all the concepts in the query.
  - 6. The apparatus according to claim 5 wherein the output means includes a result list ranking the identified documents in probability order, the threshold setting means being responsive to the result list to reset the probability threshold equal to the probability of the document lowest on the result list.
  - 7. The apparatus according to claim 1 wherein the processing means includes a result list ranking the identified documents in probability order.

8. A system for identifying documents matching a comprising:
- a memory containing a database containing identification of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, the database further containing indications of the frequencies of occurrence of documents containing first representations in the collection;
  
  computer means responsive to a query defining a plurality of concepts, the computer means includingmatching means for matching the concepts to representations,estimating means for estimating the frequency of occurrence of documents containing a second selected representation in the collection, the second selected representation being different from any of the first representations, the estimating means includingsample selection means for selecting a sample comprising a plurality of documents from the collection, the sample containing fewer documents than the entire collection;
  
  frequency identifying means responsive to the sample selection means for identifying the frequency of occurrence of documents containing the second selected representation in the selected sample of documents;
  
  processor means responsive to the memory means and to the frequency identifying means for calculating a maximum and a minimum probable frequency of occurrence of documents containing the second selected representation in the collection; and
  
  selection means responsive to the processor means for selecting the midpoint, of the maximum and minimum probable frequencies as the estimated frequency of occurrence of the second selected representation;
  
  retrieval means for selecting documents meeting the query based on the frequencies of occurrence of documents containing first representations which match the concepts and the estimated frequencies of occurrence of documents containing second representations which match the concepts, andoutput means responsive to the retrieval means and the memory for outputting identifications of the selected documents.
- View Dependent Claims (9, 10, 11)
- - 9. The system according to claim 8 wherein the processor means includes means for identifying if the difference between the maximum and minimum probable frequencies is within a preselected limit, and further including adjusting means responsive to the processor means for adding additional documents from the collection to the sample of documents if the calculating difference between the maximum and minimum probable frequencies exceeds the preselected limit.
  - 10. The system according to claim 8 where the processor means calculates the maximum probable frequency, f_max, and the minimum probable frequency, f_min, in accordance with relationships based on the number of gaps between documents in the sample containing the second selected representation (n_i), the number of documents in the collection (n_c), and the number of documents in the sample (x_i).
  - 11. The system according to claim 10 where f_max and f_min are calculated accordance with the relationships ##EQU5## where s_i is the greater of x_i /n_i or the standard deviation of the n_i gaps, andz is the standard critical value for normal distribution for a preselected reliability.

12. In a system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection compromising a plurality of documents and the database containing a frequency of occurrence of documents containing each of at least some of the representations in the collection of documents, query means for defining a query containing a plurality of concepts, matching means for matching concepts to representations means for selecting documents meeting the query based on frequencies of occurrence of documents in the collection containing representations matching the concepts, and output means responsive to the means for selecting documents for outputting identifications of selected documents, the improvement comprising a process of estimating the frequency of occurrence of documents containing a representation in the collection of documents for which the database does not contain a frequency of occurrence, comprising:
- identifying, on the basis of concepts in the query, a representation for which the database does not contain a frequency of occurrence;
  
  selecting a sample comprising a plurality of documents from the collection, the sample containing fewer documents than the entire collection;
  
  identifying the frequency of occurrence of documents containing the identified representation in the selected sample of documents;
  
  calculating a maximum and a minimum probable frequency of occurrence of documents containing the identified representation in the collection; and
  
  selecting a midpoint of the maximum and minimum probable frequencies as the estimated frequency of occurrence of documents containing the identified representation,whereby the means for selecting documents meeting the query is responsive to the frequencies of occurrence in the database of documents in the collection containing representations matching the concepts and to estimated frequencies of occurrence to select documents in the collection containing representations matching the concepts.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The process according to claim 12 further including identifying whether the difference between the maximum and minimum probable frequencies is within a preselected limit, and adding additional documents to the sample from the collection if the calculated difference between the maximum and minimum probable frequencies exceeds the preselected limit.
  - 14. The process according to claim 13 where the preselected limit is 0.05.
  - 15. The process according to claim 12 where the maximum probable frequency, f_max, and the minimum probable frequency, f_min, are calculated in accordance with the relationships ##EQU6## where n_i is the number of gaps between documents in the sample containing the selected representation,n_c is the number of documents in the collection,x_i is the number of documents in the sample,s_i is the greater of x_i /n_i or the standard deviation of the n_i gaps, andz is the standard critical value for normal distribution for a preselected reliability.
  - 16. The process according to claim 15 where the selected representation contains a plurality of terms, the method including setting f_min equal to n_i if the calculated f_min is smaller than n_i, setting f_max equal to n_i +(n_c 31 x_i) if the calculated f_max is smaller than zero or smaller than n_i, and setting f_max equal to an a priori maximum if the calculated f_max is greater than the a priori maximum.
  - 17. The process according to claim 16 wherein the selected representation is a synonym represented by a plurality of terms, and wherein the a priori maximum is equal to the sum of all frequencies of occurrence of documents in the collection containing a term of the synonym, said method including setting f_min equal to an a priori minimum if the calculated f_min is smaller than the a priori minimum, where the a priori minimum is equal to the frequency of occurrence of documents containing the term of the synonym appearing in the greatest number of documents in the collection.
  - 18. The process according to claim 16 wherein the selected representation is a phrase containing a plurality of terms, and the a priori maximum is equal to the frequency of occurrence of documents containing the term of the phrase appearing in the least number of documents in the collection.
  - 19. The process according to claim 15 where the preselected reliability is 0.995 and z is 2.8070.
  - 20. The process according to claim 12 wherein the midpoint selected between the maximum and minimum probable frequencies is the mean of the maximum and minimum probable frequencies.

21. In a computer system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining a query containing a plurality of concepts, apparatus for identifying documents of the document collection containing representations that match the query containing a plurality of concepts, the apparatus comprising:
- processing means for calculating probabilities that documents match the query and for identifying a first document having a calculated probability;
  
  threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of the first document;
  
  estimating means responsive to the processing means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated;
  
  the processing means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts of the query until either the estimated maximum probability for the second document does not at least equal the probability threshold or the probability is calculated for all the concepts in the query;
  
  the estimating means being further responsive to the processing means ceasing or completing the calculation of the probability for the second document to estimate a maximum probability for a third document different from the first and second documents; and
  
  output means responsive to the processing means for outputting identifications of only documents whose probability is calculated for all concepts in the query.
- View Dependent Claims (22, 23, 24)
- - 22. The apparatus according to claim 21 wherein the output means includes a result list identifying in probability order, up to a predetermined number of documents whose probability is calculated for all concepts in the query, the threshold setting means being responsive to the result list to reset the probability threshold equal to the probability of the document lowest on the result list.
  - 23. Apparatus according to claim 21 wherein the threshold setting means is responsive to the processing means calculating the probability for the second document for all the concepts in the query to set the probability threshold equal to the probability of the second document.
  - 24. The apparatus according to claim 21 wherein the output means includes a result list identifying in probability order, up to a predetermined number of documents whose probability is calculated for all concepts in the query.

25. A document identification system for identifying a predetermined number of documents matching a query, comprising:
- a read-only memory containing a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of documents in the document collection, the collection comprising a plurality of documents;
  
  query means for defining the query containing a plurality of concepts;
  
  computer means responsive to the query containing a plurality of concepts, the computer means including matching means for matching the concepts to representations;
  
  sample selection means for iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection, and each successive sample containing documents different from each previous sample;
  
  processing means responsive to the sample selection means for calculating, during each iteration probabilities that documents contained in the sample contain representations that match the query and for identifying up to a preselected number of documents having the highest probabilities, the documents being identified during each iteration from a group consisting of the respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and
  
  output means outputting identifications of the predetermined number of documents identified by the processing means.
- View Dependent Claims (26, 27, 28, 29, 30)
- - 26. The system according to claim 25 further including threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of a first identified document.
  - 27. The system according to claim 26 including determining means operable during each respective iteration and responsive to the identification of the preselected number of documents by the processing means to determine if an additional document of the respective sample has a probability grater than the probability threshold, the processing means being responsive to the determining means identifying an additional document having a probability greater than the probability threshold to replace the previously-identified document having the lowest probability with the additional document, and the threshold setting means is responsive to the processing means to reset the probability threshold to the probability of the identified document having the new lowest probability.
  - 28. The system according to claim 25 including estimating means responsive to the processing means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability for the second document and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated, the processing means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts of the query until either the estimated maximum probability for the second document does not at least equal the threshold or the probability is calculated for all the concepts in the query.
  - 29. The system according to claim 28 wherein the output means includes a result list ranking the identified documents in probability order, the threshold setting means being responsive to the result list to reset the probability threshold equal to the probability of the document lowest on the result list.
  - 30. The system according to claim 25 wherein the output means includes a result list ranking the identified documents in probability order.

31. A document identification system for identifying documents matching a query, comprising:
- a read-only memory containing a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of documents in a document collection, the collection comprising a plurality of documents, the database further containing indications of the frequencies of occurrences of a plurality of representations in the documents;
  
  query means for defining the query containing a plurality of concepts;
  
  computer means responsive to the query, the computer means including matching means for matching the concepts to representations;
  
  calculating means for calculating the probabilities that documents meet the query based on the frequencies of occurrence of representations in the respective documents which match the concepts;
  
  processing means responsive to the calculating means for identifying a first document contained in the sample having the highest calculated probability;
  
  threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of the first document;
  
  estimating means responsive to the calculating means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability for the second document and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated,said calculating means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts in the query until either the estimated maximum probability for the second document does not at least equal the probability threshold or the probability is calculated for all concepts in the query,the estimating means being further responsive to the calculating means ceasing or the completing the calculation of the probability for the second document to estimate a maximum probability for a third document different from the first and second documents; and
  
  output means responsive to the processing means for outputting identifications of only documents whose probability is calculated for all concepts in the query.
- View Dependent Claims (32)
- - 32. The document identification system according to claim 31 wherein said output means includes a result list responsive to the calculating means to identify in probability order up to a predetermined number of those documents whose probability is calculated for all concepts in the query, said threshold setting means being responsive to the result list to reset the probability threshold equal to the probability of the document lowest on the result list.

33. In a computer system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining a query containing a plurality of concepts, a process of identifying a predetermined number of documents of the document collection containing representations that have high probabilities of matching the query containing a plurality of concepts, the process comprising:
- iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection, and each successive sample containing documents different from each previous sample;
  
  calculating the probabilities that documents contained in the sample contain representations that match the query;
  
  identifying, during each iteration, a preselected number of documents having the highest probabilities, the documents being selected from a group consisting of a respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and
  
  outputting identifications of the predetermined number of identified documents upon completion of the last iteration.
- View Dependent Claims (34, 35, 36, 37, 38)
- - 34. The process according to claim 33 including setting a probability threshold to the probability of the identified document having the lowest probability of all identified documents, and during each respective iteration and after the preselected number of documents has been identified, determining if an additional document of the respective sample has been identified having a probability greater than the probability threshold, and if so, replacing the previously-identified document having the lowest probability with the additional document and resetting the probability threshold to the probability of the identified document having the new lowest probability.
  - 35. The process according to claim 33 wherein the preselected number is equal to the number of the respective iteration, and the predetermined number is equal to the number of the last iteration.
  - 36. The process according to claim 33 including setting a probability threshold equal to the probability of a first document, estimating a maximum probability for a second document different from the first document based on a partially calculated probability for the second document and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated, and calculating partial probabilities that representations in the second document match concepts in the query until either the estimated maximum probability for the second document does not at least equal the threshold or the probability is calculated for all the concepts in the query.
  - 37. The process according to claim 36 including ranking the identified documents in probability order, and resetting the probability threshold equal to the probability of the document lowest on the list.
  - 38. The process according to claim 33 including ranking the identified documents in probability order.

39. In a computer system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining a query containing a plurality of concepts, a process of identifying documents of the document collection containing representations that match the query containing a plurality of concepts, the process comprising:
- computing the full probability that a first document matches the concepts in the query;
  
  setting a probability threshold equal to the full probability of the first document;
  
  calculating a partial probability that a second document matches some but not all concepts in the query;
  
  estimating a maximum probability for the second document based on the calculated probability and an assumption that the representations in the document match the concepts of the query for which probabilities have not been calculated;
  
  repeating the steps of calculating and estimating for additional query concepts until either the estimated maximum probability tier the second document is not as large as the probability threshold or the full probability of the second document is calculated for all concepts in the query;
  
  repeating the repetitive steps of calculating and estimating for a third document different from the first and second documents; and
  
  outputting identifications of only documents having a full probability at least as great as the probability threshold.
- View Dependent Claims (40)
- - 40. The process according to claim 39 wherein a predetermined number of documents of the document collection is identified and wherein documents whose probabilities are calculated for all concepts in the query are identified to a result list in probability order, up to said predetermined number, said process further including resetting the probability threshold equal to the probability of the document lowest on the result list.

41. In a system identifying a predetermined number of documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, query means for defining containing a plurality of concepts, means for determining a probability that a document meets the query based on matches of representations in the document and concepts in the query, and output means for outputting the identifications of documents having a probability at least as great as a probability threshold, apparatus for establishing the probability threshold comprising:
- sample selection means for iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection and each successive sample containing documents different from each previous sample;
  
  calculating means for calculating probabilities that documents contained in the sample contain representations that match the query;
  
  processing means responsive to the sample selection means to identify, during each iteration, up to a preselected number of documents having the highest probabilities, the documents being identified during each iteration from a group consisting of a respective sample of documents end the documents identified during the previous iteration; and
  
  threshold setting means responsive to the processing means for setting the probability threshold to the probability of the identified document having the lowest probability.
- View Dependent Claims (42, 43)
- - 42. The apparatus according to claim 41 including determining means operable during each respective iteration and responsive to the identification of the preselected number of documents by the processing means to determine if the processing means identifies an additional document of the respective sample having a probability greater than the probability threshold, the processing means being responsive to the determining means to replace the previously-identified document having the lowest probability by the additional document, and the threshold setting means is responsive to the processing means to reset the probability threshold to the probability of the identified document having the new lowest probability.
  - 43. The apparatus according to claim 41 wherein the preselected number is equal to the number of the respective iteration.

44. In a system for identifying a predetermined number of documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, query means for defining a query containing a plurality of concepts, means for determining a probability that a document meets the query based on a match of representations in the document and concepts in the query, and output means for outputting the identifications of documents having a probability at least as great as a probability threshold, a process for establishing the probability threshold comprising:
- iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection, and each successive sample containing documents different from each previous sample;
  
  calculating probabilities that documents in the sample contain representations that match the query;
  
  identifying, during each iteration, up to a preselected number of documents having the highest probabilities, the documents being identified during each iteration from a group consisting of a respective sample of documents and the documents identified during the next previous iteration; and
  
  setting the probability threshold to the probability of the identified document having the lowest probability.
- View Dependent Claims (45, 46)
- - 45. The process according to claim 44 including during each respective iteration and after the preselected number of documents has been identified, determining if an additional document of the sample has been identified having a probability greater than the probability threshold, replacing the previously-identified document having the lowest probability by the additional document, and resetting the probability threshold to the probability of the identified document having the new lowest probability.
  - 46. The process according to claim 44 wherein the preselected number is equal to the number of the respective iteration.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
West Services
Original Assignee
West Publishing Corporation (The Woodbridge Co. Ltd.)
Inventors
Morton, Gerald J., Larntz, F. Kinley, Turtle, Howard R.
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
AMSBURY, WAYNE P

Application Number

US08/039,757
Time in Patent Office

1,036 Days
Field of Search

395/600, 364/419.19
US Class Current

1/1
CPC Class Codes

G06F 16/3335   Syntactic pre-processing, e...

G06F 16/3346   using probabilistic model

G06F 16/93   Document management systems

Y10S 707/99935   Query augmenting and refini...

System of document representation retrieval by successive iterated probability sampling

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

647 Citations

46 Claims

Specification

Solutions

Use Cases

Quick Links

System of document representation retrieval by successive iterated probability sampling

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

647 Citations

46 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links