Semi-automatic index term augmentation in document retrieval
First Claim
Patent Images
1. A method for assigning index terms to a document Di in a collection of documents, where other documents in the collection have previously had index terms assigned by another method, comprising:
- (a) selecting a term Ij from among a set of terms from which the index terms are being assigned, which term Ij has not yet been processed, (b) calculating a likelihood function for the document Di and a document Dk in the collection to which the term Ij has previously been assigned as an index term by another method, which likelihood function is based upon the likelihood that a term occurring in the document Di also occurs in the document Dk, (c) repeating step (b) for a plurality of other documents Dk in the collection to which the term Ij has previously been assigned as an index term by another method, (d) calculating a total score for the Document Di for the Index Term Ij, which total score is based upon the likelihood functions for the document Di and the documents Dk in the collection to which the term Ij has previously been assigned as an index term by another method, (e) repeating steps (a)-(d) for a plurality of other terms Ij from among the set of terms from which index terms are being assigned, and (f) choosing index terms to be assigned to Document Di, from among the set of terms Ij from which index terms are being assigned, based upon the total scores calculated for the Document Di for the Index Terms Ij.
5 Assignments
0 Petitions
Accused Products
Abstract
Disclosed are methods and systems for indexing or retrieving materials accessible through computer networks.
-
Citations
44 Claims
-
1. A method for assigning index terms to a document Di in a collection of documents, where other documents in the collection have previously had index terms assigned by another method, comprising:
-
(a) selecting a term Ij from among a set of terms from which the index terms are being assigned, which term Ij has not yet been processed, (b) calculating a likelihood function for the document Di and a document Dk in the collection to which the term Ij has previously been assigned as an index term by another method, which likelihood function is based upon the likelihood that a term occurring in the document Di also occurs in the document Dk, (c) repeating step (b) for a plurality of other documents Dk in the collection to which the term Ij has previously been assigned as an index term by another method, (d) calculating a total score for the Document Di for the Index Term Ij, which total score is based upon the likelihood functions for the document Di and the documents Dk in the collection to which the term Ij has previously been assigned as an index term by another method, (e) repeating steps (a)-(d) for a plurality of other terms Ij from among the set of terms from which index terms are being assigned, and (f) choosing index terms to be assigned to Document Di, from among the set of terms Ij from which index terms are being assigned, based upon the total scores calculated for the Document Di for the Index Terms Ij. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
where
-
-
3. The method of claim 2, wherein the total score T (Di, Ij) for the Document Di for the Index Term Ij:
- is
where K0=the number of Documents in the collection assigned Index Term Ij by the other method, W(Dk, Ij)=a weight assigned to Index Term Ij for Document Dk.
- is
-
4. The method of claim 3, wherein the documents are Web pages.
-
5. The method of claim 3, wherein the documents are Web sites.
-
6. The method of claim 3, wherein the documents in the collection which have previously had index terms assigned to them by another method have had one and only one index term assigned.
-
7. The method of claim 6, wherein the weight W(Dk, Ij) assigned to the index term Ij previously assigned to a document Dk by another method equals 1.0.
-
8. The method of claim 7, wherein the term Ij whose total score T (Di, Ij) calculated for the Document Di is the highest is assigned as an index term for the Document Di.
-
9. The method of claim 3, wherein a fixed number N of terms Ij whose total scores T (Di, Ij) calculated for the Document Di are the highest are assigned as index terms for the Document Di.
-
10. The method of claim 9, wherein the fixed number N is equal to 1.
-
11. The method of claim 3, wherein all terms Ij whose total scores T (Di, Ij) calculated for the Document Di exceed a fixed cutoff score are assigned as index terms for the Document Di.
-
12. A method for assigning index terms to documents in a collection of documents, comprising:
-
(a) manually pre-assigning index terms to a subset of the documents in the collection, (b) selecting a document Di from among the documents in the collection to which index terms have not yet been assigned, which document Di has not yet been processed, (c) selecting a term Ij from among a set of terms from which index terms are being assigned, which term Ij has not yet been processed, (d) calculating a likelihood function for the document Di and a document Dk in the collection to which the term Ij has previously been assigned as an index term manually, which likelihood function is based upon the likelihood that a term occurring in the document Di also occurs in the document Dk, (e) repeating step (d) for a plurality of other documents Dk in the collection to which the term Ij has previously been assigned as an index term manually, (f) calculating a total score for the Document Di for the Index Term Ij;
which total score is based upon the likelihood functions for the Document Di and the Documents Dk in the collection to which the term Ij has previously been assigned as an index term manually,(g) repeating steps (c)-(f) for a plurality of other terms Ij from among the set of terms from which index terms are being assigned, (h) choosing index terms to be assigned to Document Di, from among the set of terms Ij from which index terms are to be assigned, based upon the total scores calculated for the Document Di for the Index Terms Ij, and (i) repeating steps (b)-(h) for a plurality of other documents in the collection to which index terms have not yet been assigned which have not yet been processed. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
where
-
-
14. The method of claim 13, wherein the total score T (Di, Ij) for the Document Di for the Index Term Ij is:
-
where K0=the number of Documents in the collection assigned Index Term Ij manually, W(Dk, Ij)=a weight assigned to Index Term Ij for Document Dk.
-
-
15. The method of claim 14, wherein a fixed number N of terms Ij whose total scores T (Di, Ij) calculated for the Document Di are the highest are assigned as index terms for the Document Di.
-
16. The method of claim 15, wherein the fixed number N is equal to 1.
-
17. The method of claim 14, wherein all terms Ij whose total scores T (Di, Ij) calculated for the Document Di exceed a fixed cutoff score are assigned as index terms for the Document Di.
-
18. The method of claim 14, wherein the documents are Web pages.
-
19. The method of claim 14, wherein the documents are Web sites.
-
20. The method of claim 14, wherein one and only one index term is assigned to the documents in the collection to which index terms are assigned manually.
-
21. The method of claim 20, wherein the weight W(Dk, Ij) assigned to the index term Ij previously assigned to a document Dk manually equals 1.0.
-
22. The method of claim 18, wherein the term Ij whose total score T (Di, Ij) calculated for the Document Di is the highest is assigned as an index term for the Document Di.
-
23. A device for assigning index terms to a document Di in a collection of documents, where other documents in the collection have previously had index terms assigned by another method, comprising:
-
(a) means for selecting a term Ij from among a set of terms from which the index terms are being assigned, which has not yet been processed, (b) means for calculating a likelihood function for the document Di and a document Dk in the collection to which the term Ij has previously been assigned as an index term by another method, which likelihood function is based upon the likelihood that a term occurring in the document Di also occurs in the document Dk (c) means for repeating step (b) for a plurality of other documents Dk in the collection to which the term Ij has previously been assigned as an index term by another method, (d) means for calculating a total score for the Document Di for the Index Term Ij, which total score is based upon the likelihood functions for the document Di and the documents Dk in the collection to which the term Ij has previously been assigned as an index term by another method, (e) means for repeating steps (a)-(d) for a plurality of other terms Ij from among the set of terms from which index terms are being assigned, and (f) means for choosing index terms to be assigned to Document Di, from among the set of terms from which index terms are being assigned, based upon the total scores calculated for the Document Di for the Index Terms Ij. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
where
-
-
25. The device of claim 24, wherein the total score T (Di, Ij) for the Document Di for the Index Term Ij:
- is
where K0=the number of Documents in the collection assigned Index Tern Ij by the other method, W(Dk, Ij)=a weight assigned to idex Term Ij for Document Dk.
- is
-
26. The device of claim 25, wherein the documents are Web pages.
-
27. The device of claim 25, wherein the documents are Web sites.
-
28. The device of claim 25, wherein the documents in the collection which have previously had index terms assigned to them by another method, have had one and only one index term assigned.
-
29. The device of claim 28, wherein the weight W(Dk, Ij) assigned to the index term Ij previously assigned to a document Dk by another method equals 1.0.
-
30. The device of claim 29, wherein the term Ij whose total score T (Di, Ij) calculated for the Document Di is the highest is assigned as an index term for the Document Di.
-
31. The device of claim 25, wherein a fixed number N of terms Ij whose total scores T (Di, Ij) calculated for the Document Di are the highest are assigned as index terms for the Document Di.
-
32. The device of claim 31, wherein the fixed number N is equal to 1.
-
33. The device of claim 25, wherein all terms Ij whose total scores T (Di, Ij) calculated for the Document Di exceed a fixed cutoff score are assigned as index terms for the Document Di.
-
34. A device for assigning index terms to documents in a collection of documents, comprising:
-
(a) means for manually pre-assigning index terms to a subset of the documents in the collection, (b) means for selecting a document Di from among the documents in the collection to which index terms have not yet been assigned, which document Di has not yet been processed, (c) means for selecting a term Ij from among a set of terms from which index terms are being assigned, which term Ij has not yet been processed, (d) means for calculating a likelihood function for the document Di and a document Dk in the collection to which the term Ij has previously been assigned as an index term manually, which likelihood function is based upon the likelihood that a term occurring in the document Di also occurs in the document Dk, (e) means for repeating step (d) for a plurality of other documents Dk in the collection to which the term Ij has previously been assigned as an index term manually, (f) means for calculating a total score for the Document Di for the Index Term Ij;
which total score is based upon the likelihood functions for the Document Di and the Documents Dk in the collection to which the term Ij has previously been assigned as an index term manually,(g) means for repeating steps (c)-(f) for a plurality of other terms Ij from among the set of terms from which index terms are being assigned, (h) means for choosing index terms to be assigned to Document Di, from among the set of terms from which index terms are to be assigned, based upon the total scores calculated for the Document Di for the Index Terms Ij, and (i) means for repeating steps (b)-(h) for a plurality of other documents in the collection to which index terms have not yet been assigned which have not yet been processed. - View Dependent Claims (35, 36, 37, 38, 39, 40, 41, 42, 43, 44)
where
-
-
36. The device of claim 35, wherein the total score T (Di, Ij) for the Document Di for the Index Term Ij is:
-
where K0=the number of Documents in the collection assigned Index Term Ij manually, W(Dk, Ij)=a weight assigned to Index Term Ij for Document Dk.
-
-
37. The device of claim 36, wherein the documents are Web pages.
-
38. The device of claim 36, wherein the documents are Web sites.
-
39. The device of claim 36, wherein one and only one index term is assigned to the documents in the collection to which index terms are assigned manually.
-
40. The device of claim 39, wherein the weight W(Dk, Ij) assigned to the index term Ij previously assigned to a document Dk manually equals 1.0.
-
41. The device of claim 40, wherein the term Ij whose total score T (Di, Ij) calculated for the Document Di is the highest is assigned as an index term for the Document Di.
-
42. The device of claim 36, wherein a fixed number N of terms Ij whose total scores T (Di, Ij) calculated for the Document Di are the highest are assigned as index terms for the Document Di.
-
43. The device of claim 42, wherein the fixed number N is equal to 1.
-
44. The device of claim 36, wherein all terms Ij whose total scores T (Di, Ij) calculated for the Document Di exceed a fixed cutoff score are assigned as index terms for the Document Di.
Specification