Phrase-based indexing in an information retrieval system
First Claim
Patent Images
1. A method of indexing documents in a document collection, each document having an associated identifier, the method comprising:
- providing a list of phrases;
identifying, by operation of a processor adapted to manipulate data within a computer system, for a given document, phrases from the list of phrases that are present in the document;
for each identified phrase in the document;
identifying, by operation of a processor adapted to manipulate data within a computer system, a related phrase also present in the document, wherein for each phrase gj, gk is a related phrase of phrase gj where an information gain I of gk with respect to gj exceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of gj and gk, and E(j,k) is an expected co-occurrence rate gj and gk; and
indexing, by operation of a processor adapted to manipulate data within a computer system, the document by storing the identifier of the document and an indication of each related phrase gk also present in the document, in a posting list of the identified phrase gj.
2 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.
218 Citations
20 Claims
-
1. A method of indexing documents in a document collection, each document having an associated identifier, the method comprising:
-
providing a list of phrases; identifying, by operation of a processor adapted to manipulate data within a computer system, for a given document, phrases from the list of phrases that are present in the document; for each identified phrase in the document; identifying, by operation of a processor adapted to manipulate data within a computer system, a related phrase also present in the document, wherein for each phrase gj, gk is a related phrase of phrase gj where an information gain I of gk with respect to gj exceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of gj and gk, and E(j,k) is an expected co-occurrence rate gj and gk; and indexing, by operation of a processor adapted to manipulate data within a computer system, the document by storing the identifier of the document and an indication of each related phrase gk also present in the document, in a posting list of the identified phrase gj. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method of indexing documents in a document collection, each document having an associated identifier, the method comprising:
-
providing a list of valid phrases, wherein each phrase on the list appears a minimum number of times in the document collection, and predicts at least one other phrase, wherein for each phrase gj, gk predicts gj where an information gain I of gk with respect to gj exceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of gj and gk, and E(j,k) is an expected co-occurrence rate gj and gk; accessing a plurality of documents in the document collection; for each accessed document, identifying, by operation of a processor adapted to manipulate data within a computer system, phrases from the list of valid phrases that are present in the document; and for each identified phrase in the document, indexing, by operation of a processor adapted to manipulate data within a computer system, the document by storing the identifier of the document in a posting list of the phrase. - View Dependent Claims (11)
-
-
12. A computer-implemented system for indexing documents in a document collection, each document having an associated identifier, comprising:
-
a phrase-based index comprising a list of phrases and a plurality of phrase posting lists, each phrase posting list associated with a phrase in the list of phrases; and an indexing system adapted to; identify for a given document, phrases from the list of phrases that are present in the document; for each identified phrase in the document; identify a related phrase also present in the document, wherein for each phrase gj, gk is a related phrase of phrase gj where an information gain I of gk with respect to gj exceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of gj and gk, and E(j,k) is an expected co-occurrence rate gj and gk, and index the document by storing the identifier of the document and an indication of each related phrase gk also present in the document, in the phrase-based index, within a posting list associated with the identified phrase.
-
-
13. A tangible computer readable storage medium storing a computer program executable by a processor for indexing documents in a document collection, each document having an associated identifier, the operations of the computer program comprising:
-
providing a list of phrases; identifying for a given document, phrases from the list of phrases that are present in the document; for each identified phrase in the document; identifying a related phrase also present in the document, wherein for each phrase gj, gk is a related phrase of phrase gj where an information gain I of gk with respect to gj exceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of gj and gk, and E(j,k) is an expected co-occurrence rate gj and gk, and indexing the document by storing the identifier of the document and an indication of each related phrase also present in the document, in a posting list of the identified phrase.
-
-
14. A computer-implemented system for indexing documents in a document collection, each document having an associated identifier, comprising:
-
a phrase-based index comprising; a list of valid phrases, each phrase on the list appearing a minimum number of times in the document collection, and predicting at least one other phrase, wherein for each phrase gj, gk predicts gj where an information gain I of gk with respect to gj exceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of gj and gk, and E(j,k) is an expected co-occurrence rate gj and gk, and a plurality of phrase posting lists, each phrase posting list associated with a phrase in the list of valid phrases; and an indexing system adapted to; access a plurality of documents in the document collection, for each accessed document, identify phrases from the list of valid phrases that are present in the document, and for each identified phrase in the document, store the identifier of the document in the phrase-based index, within a posting list of the identified phrase.
-
-
15. A tangible computer readable storage medium storing a computer program executable by a processor for indexing documents in a document collection, each document having an associated identifier, the operations of the computer program comprising:
-
providing a list of valid phrases, wherein each phrase on the list appears a minimum number of times in the document collection, and predicts at least one other phrase, wherein for each phrase gj, gk predicts gj where an information gain I of gk with respect to gj exceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of gj and gk, and E(j,k) is an expected co-occurrence rate gj and gk; accessing a plurality of documents in the document collection; for each accessed document, identifying phrases from the list of valid phrases that are present in the document; and for each identified phrase in the document, storing the identifier of the document in a posting list of the phrase.
-
-
16. A computer-implemented method of indexing a document in an index, the document having an associated identifier, the method comprising:
-
storing a list of phrases, including related phrases, wherein for each phrase gj, gk is a related phrase of phrase gj where an information gain I of gk with respect to gj exceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of gj and gk, and E(j,k) is an expected co-occurrence rate gj and gk; storing a plurality of phrase posting lists in the index, each phrase posting list associated with a phrase in the list of phrases; identifying, by operation of a processor adapted to manipulate data within a computer system, a phrase in the list of phrases for which there is at least one instance in the document; and storing an indication of a related phrase for which there is also an instance in the document, in the phrase posting list of the identified phrase.
-
-
17. A method of indexing documents in a document collection, each document having an associated identifier, the method comprising:
-
providing a list of phrases; identifying, by operation of a processor adapted to manipulate data within a computer system, for a given document, phrases from the list of phrases that are present in the document; for each identified phrase in the document; identifying, by operation of a processor adapted to manipulate data within a computer system, a related phrase also present in the document; indexing, by operation of a processor adapted to manipulate data within a Computer system, the document by storing the identifier of the document and an indication of each related phrase also present in the document, in a posting list of the identified phrase; determining, by operation of a processor adapted to manipulate data within a computer system, whether the identified phrase appears as anchor text for a hyperlink in the document pointing to a target document; responsive to the identified phrase appearing as anchor text, determining a link score for the identified phrase based upon related phrases of the identified phrase; and storing the link score in the posting list of the identified phrase and in association with the document. - View Dependent Claims (18, 19)
-
-
20. A computer-implemented system for indexing documents in a document collection, each document having an associated identifier, comprising:
-
a phrase-based index comprising a list of phrases and a plurality of phrase posting lists, each phrase posting list associated with a phrase in the list of phrases; and an indexing system adapted to; identify for a given document, phrases from the list of phrases that are present in the document; for each identified phrase in the document; identify a related phrase also present in the document, index the document by storing the identifier of the document and an indication of each related phrase gk also present in the document, in the phrase-based index, within a posting list associated with the identified phrase determine whether the identified phrase appears as anchor text for a hyperlink in the document pointing to a target document; responsive to the identified phrase appearing as anchor text, determine a link score for the identified phrase based upon related phrases of the identified phrase, and store the link score in the posting list for the identified phrase and in association with the document.
-
Specification