Method and device for classification using iterative information retrieval techniques
First Claim
Patent Images
1. A method for choosing documents of interest from a collection of documents, comprising:
- (a) determining an initial selection criterion;
(b) applying the initial selection criterion to each document in the collection, to generate a rank-ordered list of documents;
(c) if further refinement of the list is desired, evaluating a subset of the documents on the list to determine whether each document in the subset is relevant;
(d) modifying the selection criteria by at least one of;
adjusting weights assigned to each element of the selection criteria in the prior iteration, removing elements of the selection criteria from the prior iteration, and adding elements to the selection criteria, based upon features of the documents determined to be relevant;
(e) applying the modified selection criterion to each document in the collection, to generate a new rank-ordered list of documents;
(f) repeating the steps of (c), (d), and (e) until classification is sufficiently accurate for use.
4 Assignments
0 Petitions
Accused Products
Abstract
Disclosed is a method and device for improving the quality of documents selected in response to a user query for documents such as Web pages or sites. The method is one of iteration, and involves the successive review by the user of a limited number of documents as being relevant or not relevant, the analysis of the characteristics of the documents so graded by means of information retrieval techniques, and the modification of the search query based upon that analysis, until the user is satisfied with the quality of the documents presented to him.
-
Citations
48 Claims
-
1. A method for choosing documents of interest from a collection of documents, comprising:
-
(a) determining an initial selection criterion;
(b) applying the initial selection criterion to each document in the collection, to generate a rank-ordered list of documents;
(c) if further refinement of the list is desired, evaluating a subset of the documents on the list to determine whether each document in the subset is relevant;
(d) modifying the selection criteria by at least one of;
adjusting weights assigned to each element of the selection criteria in the prior iteration, removing elements of the selection criteria from the prior iteration, and adding elements to the selection criteria, based upon features of the documents determined to be relevant;
(e) applying the modified selection criterion to each document in the collection, to generate a new rank-ordered list of documents;
(f) repeating the steps of (c), (d), and (e) until classification is sufficiently accurate for use. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for choosing documents of interest from a collection of documents, comprising:
-
(a) determining an initial selection criterion;
(b) applying the initial selection criterion to each document in the collection, to generate a rank-ordered list of documents;
(c) if further refinement of the list is desired, evaluating a subset of the documents on the list to determine whether each document in the subset is relevant;
(d) modifying the selection criteria by at least one of;
adjusting weights assigned to each element of the selection criteria in the prior iteration, removing elements of the selection criteria in the prior iteration, and adding additional elements to the selection criteria, based upon features of the documents determined not to be relevant as well as features of the documents determined to be relevant;
(e) applying the modified selection criterion to each document in the collection, to generate a new rank-ordered list of documents;
(f) repeating the steps of (c), (d), and (e) until classification is sufficiently accurate for use, wherein modifying the selection criteria at step d comprises;
(1) giving each term found in the collection of documents a score based upon how often the term occurs in documents determined to be relevant, compared to how often the term occurs in the collection of documents as a whole, and based upon how often the term occurs in documents determined to be irrelevant, compared to how often the term occurs in the collection of documents as a whole;
(2) choosing terms with the highest positive weights thus determined to be the terms in the selection criteria; and
(3) weighing the terms in the selection criteria according to the scores achieved in the above process, and the relative frequency of the terms in the collection. - View Dependent Claims (8, 9, 10, 11, 12, 13)
-
-
9. The method of claim 8, wherein the terms chosen at step b are the terms whose scores WT exceed an average score WT by two or more standard deviations.
-
10. The method of claim 9, wherein weights ST assigned to terms at step c are determined by a formula:
-
11. The method of claim 10, wherein K3 is 0.5, and K4 is 1.0.
-
12. The method of claim 10, wherein in applying the modified selection criterion to each document in the collection, to generate a new rank-ordered list of documents, documents are ranked in order of their scores SD, where:
-
13. The method of claim 12, wherein K1 is 0.5, and K2 is 1.5.
-
14. A method for identifying documents in a collection as having a particular characteristic, comprising:
-
(a) choosing an initial list of documents from among the documents in the collection;
(b) evaluating a subset of the documents on the list to determine whether each document in the subset has the characteristic;
(c) modifying the selection criteria by at least one of;
adjusting the weights assigned to each element of the selection criteria in the prior iteration, removing elements of the selection criteria in the prior iteration, and adding additional elements to the criteria, based upon features of the documents determined to have the characteristic, and based upon features of the documents determined not to have the characteristic;
(d) applying the modified selection criterion to each document in the initial list of documents, to generate a new rank-ordered list of documents;
(e) repeating the steps of (b), (c), and (d) until the classification is sufficiently accurate;
(f) choosing a cutoff score to be applied;
(g) concluding that all documents in the collection with scores above the cutoff score have the characteristic.
-
-
15. A method for identifying documents in a collection as having a particular characteristic, comprising:
-
(a) choosing an initial list of documents from among the documents in the collection;
(b) evaluating a subset of the documents on the list to determine whether each document in the subset has the characteristic;
(c) modifying selection criteria by at least one of;
adjusting weights assigned to each element of the selection criteria in the prior iteration, removing elements of the selection criteria in the prior iteration, and adding additional elements to the criteria, based upon features of the documents determined to have the characteristic, and based upon features of the documents determined not to have the characteristic;
(d) applying the modified selection criterion to each document in the initial list of documents, to generate a new rank-ordered list of documents;
(e) repeating the steps of (b), (c), and (d) until classification is sufficiently accurate;
(f) choosing a cutoff score to be applied;
(g) concluding that all documents in the collection with scores above the cutoff score have the characteristic, wherein modifying the selection criteria at step c comprises;
(1) giving each term found in the subset of documents a score based upon how often the term occurs in documents determined to have the characteristic, compared to how often the term occurs in the subset of documents as a whole, and based upon how often the term occurs in documents determined not to have the characteristic, compared to how often the term occurs in the subset of documents as a whole;
(2) choosing terms with the highest positive weights thus determined to be the terms in the selection criteria; and
(3) weighing the terms in the selection criteria according to the scores achieved in the above process, and their relative frequency in the subset. - View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24)
-
-
17. The method of claim 16, wherein the terms chosen at step b are the terms whose scores WT exceed an average score WT by two or more standard deviations.
-
18. The method of claim 17, wherein weights ST assigned to the terms at step c are determined by a formula:
-
19. The method of claim 18, wherein K3 is 0.5, and K4 is 1.0.
-
20. The method of claim 18, wherein in applying the modified selection criterion to each document in the subset, to generate a new rank-ordered list of documents, documents are ranked in order of their scores SD, where:
-
21. The method of claim 20, wherein K1 is 0.5, and K2 is 1.5.
-
22. The method of claim 20, where the documents are Web pages.
-
23. The method of claim 20, where the documents are Web sites.
-
24. The method of claim 23, where the particular characteristic is being an electronic commerce site.
-
25. A device for choosing documents of interest from a collection of documents, comprising:
-
(a) means for determining an initial selection criterion;
(b) means for applying the initial selection criterion to each document in the collection, to generate a rank-ordered list of documents;
(c) means for evaluating a subset of the documents on the list to determine whether each document in the subset is relevant, in response to further refinement of the list being desired;
(d) means for modifying the selection criteria by at least one of;
adjusting the weights assigned to each element of the selection criteria in the prior iteration and adding additional elements to the criteria, based upon features of the documents determined to be relevant;
(e) means for applying the modified selection criterion to each document in the collection, to generate a new rank-ordered list of documents;
(f) means for repeating the steps of (c), (d), and (e) until the classification is sufficiently accurate for use. - View Dependent Claims (26, 27, 28, 29, 30)
-
-
31. A device for choosing documents of interest from a collection of documents, comprising:
-
(a) means for determining an initial selection criterion;
(b) means for applying the initial selection criterion to each document in the collection, to generate a rank-ordered list of documents;
(c) means for evaluating a subset of the documents on the list to determine whether each document in the subset is relevant, in response to further refinement of the list being desired;
(d) means for modifying the selection criteria by at least one of means for adjusting weights assigned to each element of the selection criteria in the prior iteration, means for removing elements of the selection criteria in the prior iteration, and means for adding additional elements to the criteria, based upon features of the documents determined not to be relevant as well as features of the documents determined to be relevant. (e) means for applying the modified selection criterion to each document in the collection, to generate a new rank-ordered list of documents;
(f) means for repeating operations of said evaluating means, said modifying means, and said applying means, in the order named, until classification is sufficiently accurate for use, wherein the means for modifying the selection criteria at step d comprises;
(1) means for giving each term found in the collection of documents a score based upon how often the term occurs in documents determined to be relevant, compared to how often the term occurs in the collection of documents as a whole, and based upon how often the term occurs in documents determined to be irrelevant, compared to how often the term occurs in the collection of documents as a whole;
(2) means for choosing terms with the highest positive weights thus determined to be the terms in the selection criteria; and
(3) means for weighing the terms in the selection criteria according to the scores achieved in the above process, and the relative frequency of the terms in the collection. - View Dependent Claims (32, 33, 34, 35, 36, 37)
-
-
33. The device of claim 32, wherein the terms chosen at step b are the terms whose scores WT exceed an average score WT by two or more standard deviations.
-
34. The device of claim 33, wherein weights ST assigned to terms at step c are determined by a formula:
-
35. The device of claim 34, wherein K3 is 0.5, and K4 is 1.0.
-
36. The device of claim 34, wherein in applying the modified selection criterion to each document in the collection, to generate a new rank-ordered list of documents, documents are ranked in order of their scores SD, where:
-
37. The device of claim 36, wherein K1 is 0.5, and K2 is 1.5.
-
38. A device for identifying documents in a collection as having a particular characteristic, comprising:
-
(a) means for choosing an initial list of documents from among the documents in the collection;
(b) means for evaluating a subset of the documents on the list to determine whether each document in the subset has the characteristic;
(c) means for modifying the selection criteria by at least one of;
means for adjusting the weights assigned to each element of the selection criteria in the prior iteration, means for removing elements of the selection criteria in the prior iteration, and means for adding additional elements to the criteria, based upon features of the documents determined to have the characteristic, and based upon features of the documents determined not to have the characteristic;
(d) means for applying the modified selection criterion to each document in the initial list of documents, to generate a new rank-ordered list of documents;
(e) means for repeating the steps of (b), (c), and (d) until the classification is sufficiently accurate;
(f) means for choosing a cutoff score to be applied;
(g) means for concluding that all documents in the collection with scores above the cutoff score have the characteristic.
-
-
39. A device for identifying documents in a collection as having a particular characteristic, comprising:
-
(a) means for choosing an initial list of documents from among the documents in the collection;
(b) means for evaluating a subset of the documents on the list to determine whether each document in the subset has the characteristic;
(c) means for modifying the selection criteria by at least one of;
means for adjusting weights assigned to each element of the selection criteria in the prior iteration, means for removing elements of the selection criteria in the prior iteration, and means for adding additional elements to the criteria, based upon features of the documents determined to have the characteristic, and based upon features of the documents determined not to have the characteristic;
(d) means for applying the modified selection criterion to each document in the initial list of documents, to generate a new rank-ordered list of documents;
(e) means for repeating operations of said evaluating means, said modifying means, and said applying means, in the order named, until the classification is sufficiently accurate;
(f) means for choosing a cutoff score to be applied;
(g) means for concluding that all documents in the collection with scores above the cutoff score have the characteristic, wherein the means for modifying the selection criteria at said modifying means comprises;
(1) means for giving each term found in the subset of documents a score based upon how often the term occurs in documents determined to have the characteristic, compared to how often the term occurs in the subset of documents as a whole, and based upon how often the term occurs in documents determined not to have the characteristic, compared to how often the term occurs in the subset of documents as a whole;
(2) means for choosing terms with the highest positive weights thus determined to be the terms in the selection criteria; and
(3) means for weighing the terms in the selection criteria according to the scores achieved in the above process and their frequency in the subset. - View Dependent Claims (40, 41, 42, 43, 44, 45, 46, 47, 48)
-
-
41. The device of claim 40, wherein the terms chosen at step b are the terms whose scores WT exceed an average score WT by two or more standard deviations.
-
42. The device of claim 41, wherein weights ST assigned to the terms at step c are determined by a formula:
-
ST=WT*IDFT,where;
IDFT=log((N+K3)/NT)/log(N+K4) where;
N is a number of documents in the subset, NT is a number of documents containing the term T in the subset, K3 and K4 are constants.
-
-
43. The device of claim 42, wherein K3 is 0.5, and K4 is 1.0.
-
44. The device of claim 42, wherein in applying the modified selection criterion to each document in the subset, to generate a new rank-ordered list of documents, documents are ranked in order of their scores SD, where:
-
45. The device of claim 44, wherein K1 is 0.5, and K2 is 1.5.
-
46. The device of claim 44, where the documents are Web pages.
-
47. The device of claim 44, where the documents are Web sites.
-
48. The device of claim 47, where the particular characteristic is being an electronic commerce site.
Specification