Generalized term frequency scores in information retrieval systems
First Claim
Patent Images
1. A method for selecting documents which may be of interest from among documents in a collection, comprising:
- (a) choosing terms to be used in selecting documents which may be of interest, (b) dividing a plurality of documents D in the collection into S0 segments, (c) determining, for the plurality of documents D in the collection, which of the terms chosen to be used in selecting documents are found in each segment Si of the document D, (d) calculating, for the plurality of documents D in the collection a generalized term frequency score SD;
where;
SD is the total score for the document D, T0 is the number of terms selected to be used in the search, S0 is the number of segments in the document D, and TFSTD is the score for document D based on the occurrence of term T in segment Si of document D, and (e) selecting documents from among the documents in the collection based upon the scores SD achieved by the documents.
5 Assignments
0 Petitions
Accused Products
Abstract
Disclosed are methods and systems for selecting electronic documents, such as Web pages or sites, from among documents in a collection, based upon the occurrence of selected terms in segments of the documents. The method may be applied where index terms have previously been assigned to the documents. The method may be used to select supercategories of banner advertisements from which to choose an advertisement to display for a user.
96 Citations
80 Claims
-
1. A method for selecting documents which may be of interest from among documents in a collection, comprising:
-
(a) choosing terms to be used in selecting documents which may be of interest, (b) dividing a plurality of documents D in the collection into S0 segments, (c) determining, for the plurality of documents D in the collection, which of the terms chosen to be used in selecting documents are found in each segment Si of the document D, (d) calculating, for the plurality of documents D in the collection a generalized term frequency score SD;
where;
SD is the total score for the document D,T0 is the number of terms selected to be used in the search, S0 is the number of segments in the document D, and TFSTD is the score for document D based on the occurrence of term T in segment Si of document D, and (e) selecting documents from among the documents in the collection based upon the scores SD achieved by the documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40)
where;
SD is the total score for the document D,T0 is the number of terms which occur in the collection of terms included in the search, S0 is the number of segments in the document D, TFSTD=Robertson'"'"'s generalized term frequency for Term T in Segment Si of Document D
-
-
16. The method of claim 15, wherein K1=0.5, K2=1.5, K3=0.5, and K4=1.0.
-
17. The method of claim 15, wherein the weights WiD assigned to the ith segment of the documents in the collection are equal.
-
18. The method of claim 17, wherein the weights WSD assigned to the segments of a document D in the collection have the property that
-
S i = 1 S 0 W SD = 1.
-
-
19. The method of claim 17, wherein the weights WiD are selected specifically for the collection of documents from which documents are to be chosen by carrying out test searches with different weights, and selecting for use the weights which yield the most useful results.
-
20. The method of claim 1, wherein
(a) additional terms are assigned to each document D in the collection, beyond the terms that occur in the document D, based upon the characteristics of the document D, without regard to the occurrence of the additional terms in the document D, and (b) the additional terms are placed in a segment Si of the document D with no other terms. -
21. The method of claim 15, wherein
(a) additional terms are assigned to each document D in the collection, beyond the terms that occur in the document D, based upon the characteristics of the document D, without regard to the occurrence of the additional terms in the document D, and (b) the additional terms are placed in a segment Si of the document D with no other terms. -
22. The method of claim 21, wherein the additional terms are assigned automatically by
(a) creating a search query Q comprised of terms in document D; -
(b) applying the search query Q to a collection of documents C0;
(c) selecting the N0 documents from the collection of documents C0 which achieve the highest scores upon application of the search query Q; and
(d) selecting IT terms for automatic assignment from among terms in the N0 documents based upon the co-occurrence of terms in the N0 documents with terms in the document D.
-
-
23. The method of claim 22, further comprising selecting the IT terms for automatic assignment by
(a) calculating, for terms Tk which occur in the N0 documents selected, the co-occurrence Cn (Tj,Tk) of the term Tk with terms Tj in document D: -
24. The method of claim 23, wherein WSTD, the weight assigned to term T in segment SI of document D, is fD (Tk) for all terms T automatically assigned.
-
25. The method of claim 24, wherein WSTD, the weight assigned to term T in segment Si of document D, is 1.0 for all terms T which occur in the document D.
-
26. The method of claim 23, wherein the search query Q which is applied comprises all of the terms in document D.
-
27. The method of claim 23, wherein the search query Q which is applied comprises all of the terms in document D with preselected stop terms eliminated.
-
28. The method of claim 23, wherein the search query Q is applied to select documents from among the documents in the collection C0 by calculating for each document D in the collection C0 a score SD based upon the occurrence in the document D of terms in the search query Q.
-
29. The method of claim 28, wherein in applying the search query Q to the collection of documents C0 the total score SD for a document D in the collection C0 is
-
T = 1 T 0 TF TD * IDF T where;
T0 is the number of terms in the search query Q, andTFTD is Robertson'"'"'s term frequency for the Term T in the Document D
-
-
30. The method of claim 29, wherein K1 equals 0.5, K2 equals 1.5, K3 equals 0.5, and K4 equals 1.0.
-
31. The method of claim 23, wherein the number N0 of documents chosen by application of the search query Q is predetermined.
-
32. The method of claim 31, wherein the number N0 is 50.
-
33. The method of claim 23, wherein all documents whose scores upon application of the search query Q exceed a given cutoff score are selected.
-
34. The method of claim 23, wherein co-occurrences are calculated for all terms contained in the N0 documents selected.
-
35. The method of claim 23, wherein co-occurrences are calculated for all terms contained in the N0 documents selected, except that preselected stop terms are eliminated.
-
36. The method of claim 23, wherein δ
- =0.01.
-
37. The method of claim 23, wherein the number IT of terms automatically assigned is predetermined.
-
38. The method of claim 37, wherein the number IT is 30.
-
39. The method of claim 23, wherein all terms whose scores fD (Tk) exceed a given cutoff score are automatically assigned.
-
40. The method of claim 30, wherein the number N0 of documents chosen by application of the search query Q is 50, δ
- =0.01, and the number IT is 30.
-
41. A device for selecting documents which may be of interest from among documents in a collection, comprising:
-
(a) means for choosing terms to be used in selecting documents which may be of interest, (b) means for dividing a plurality of documents D in the collection into S0 segments, (c) means for determining which of the terms chosen to be used in selecting documents are found in each segment Si of a plurality of documents D in the collection, (d) means for calculating a generalized term frequency score SD for a plurality of documents D in the collection;
where;
SD is the total score for the document D,T0 is the number of terms selected to be used in the search, S0 is the number of segments in the document D, and TFSTD is the score for document D based on the occurrence of term T in segment Si of document D, and (e) means for selecting documents from among the documents in the collection based upon the scores SD achieved by the documents. - View Dependent Claims (42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80)
where;
SD is the total score for the document D,T0 is the number of terms which occur in the collection of terms included in the search, S0 is the number of segments in the document D, TFSTD=Robertson'"'"'s generalized term frequency for Term T in Segment Si of Document D
-
-
56. The device of claim 55, wherein K1=0.5, K2=1.5, K3=0.5, and K4=1.0.
-
57. The device of claim 55, wherein the weights WiD assigned to the ith segment of the documents in the collection are equal.
-
58. The device of claim 57, wherein the weights WSD assigned to the segments of a document D in the collection have the property that
-
S i = 1 S 0 W SD = 1.
-
-
59. The device of claim 57, wherein the weights WiD are selected specifically for the collection of documents from which documents are to be chosen by carrying out test searches with different weights, and selecting for use the weights which yield the most useful results.
-
60. The device of claim 41, further comprising
(a) means for assigning additional terms to each document D in the collection, beyond the terms that occur in the document D, based upon the characteristics of the document D, without regard to the occurrence of the additional terms in the document D, and (b) means for placing the additional terms in a segment Si of the document D with no other terms. -
61. The device of claim 55, further comprising
(a) means for assigning additional terms to each document D in the collection, beyond the terms that occur in the document D, based upon the characteristics of the document D, without regard to the occurrence of the additional terms in the document D, and (b) means for placing the additional terms in a segment Si of the document D with no other terms. -
62. The device of claim 61, wherein the means for assigning additional terms comprise:
-
(a) means for creating a search query Q comprised of terms in document D;
(b) means for applying the search query Q to a collection of documents C0;
(c) means for selecting the N0 documents from the collection of documents C0 which achieve the highest scores upon application of the search query Q; and
(d) means for selecting IT terms for automatic assignment from among terms in the N0 documents based upon the co-occurrence of the terms in the N0 documents with the terms in the document D.
-
-
63. The device of claim 62, wherein the means for selecting the IT terms for automatic assignment further comprise
(a) means for calculating, for terms Tk which occur in the N0 documents selected, the co-occurrence Cn (Tj,Tk) of the term Tk with terms Tj in document D: -
64. The device of claim 63, wherein WSTD, the weight assigned to term T in segment SI of document D, is fD (Tn) for all terms T chosen to be index terms.
-
65. The device of claim 64, wherein WSTD, the weight assigned to term T in segment Si of document D, is 1.0 for all terms T which occur in the document D.
-
66. The device of claim 63, wherein the search query Q which is applied comprises all of the terms in document D.
-
67. The device of claim 63, wherein the search query Q which is applied comprises all of the terms in document D with preselected stop terms eliminated.
-
68. The device of claim 63, wherein the search query Q is applied to select documents from among the documents in the collection C0 by calculating for each document D in the collection C0 a score SD based upon the occurrence in the document D of terms in the search query Q.
-
69. The device of claim 68, wherein in applying the search query Q to the collection of documents C0 the total score SD for a.document D in the collection C0 is
-
T = 1 T 0 TF TD * IDF T where;
T0 is the number of terms in the search query Q, andTFTD is Robertson'"'"'s term frequency for the Term T in the Document D
-
-
70. The device of claim 69, wherein K1 equals 0.5, K2 equals 1.5, K3 equals 0.5, and K4 equals 1.0.
-
71. The device of claim 63, wherein the number N0 of documents chosen by application of the search query Q is predetermined.
-
72. The device of claim 71, wherein the number N0 is 50.
-
73. The device of claim 63, wherein all documents whose scores upon application of the search query Q exceed a given cutoff score are selected.
-
74. The device of claim 63, wherein co-occurrences are calculated for all terms contained in the N0 documents selected.
-
75. The device of claim 63, wherein co-occurrences are calculated for all terms contained in the N0 documents selected, except that preselected stop terms are eliminated.
-
76. The device of claim 63, wherein δ
- =0.01.
-
77. The device of claim 63, wherein the number IT of terms automatically assigned is predetermined.
-
78. The device of claim 77, wherein the number IT is 30.
-
79. The device of claim 63, wherein all terms whose scores fD (Tk) exceed a given cutoff score are automatically assigned.
-
80. The device of claim 70, wherein the number N0 of documents chosen by application of the search query Q is 50, δ
- =0.01, and the number IT is 30.
Specification