Selection of atoms for search engine retrieval
First Claim
Patent Images
1. A method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
- identifying a set of documents to be indexed in a search index;
for each document in the set of documents, identifying a plurality of atoms, the plurality of atoms comprising one or more unigrams, one or more n-grams, and one or more n-tuples;
based on the identified set of documents and the plurality of atoms, generating a list of atom/document pairs;
computing an information metric for each atom/document pair, wherein the information metric represents a pre-computed ranking of the atom used during a search query in relation to the particular document;
based on the information metric for each atom/document pair, selecting a subset of the atom/document pairs that are most relevant to the particular document from which the atoms were identified;
populating the search index using the subset of the atom/document pairs for the particular document, wherein identifying relevant documents for the search query from the search index is based on a pruning algorithm that computes a preliminary score for each of the documents to select a subset of the set of documents based on the preliminary score, wherein the preliminary score is computed using the information metric pre-computed for each atom/document pair and a simplified scoring function that approximates a final ranking algorithm utilized in identifying the relevant documents.
3 Assignments
0 Petitions
Accused Products
Abstract
Methods are provided for populating search indexes with atoms identified in documents. Documents that are to be indexed are identified, and for each document, atoms are identified and are categorized as unigrams, n-grams, and n-tuples. A list of atom/document pairs is generated such that an information metric can be computed for each pair. An information metric represents a ranking of the atom in relation to the particular document. Based on the information metric, some atom/document pairs are discarded and others are indexed.
-
Citations
20 Claims
-
1. A method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
-
identifying a set of documents to be indexed in a search index; for each document in the set of documents, identifying a plurality of atoms, the plurality of atoms comprising one or more unigrams, one or more n-grams, and one or more n-tuples; based on the identified set of documents and the plurality of atoms, generating a list of atom/document pairs; computing an information metric for each atom/document pair, wherein the information metric represents a pre-computed ranking of the atom used during a search query in relation to the particular document; based on the information metric for each atom/document pair, selecting a subset of the atom/document pairs that are most relevant to the particular document from which the atoms were identified; populating the search index using the subset of the atom/document pairs for the particular document, wherein identifying relevant documents for the search query from the search index is based on a pruning algorithm that computes a preliminary score for each of the documents to select a subset of the set of documents based on the preliminary score, wherein the preliminary score is computed using the information metric pre-computed for each atom/document pair and a simplified scoring function that approximates a final ranking algorithm utilized in identifying the relevant documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. One or more hardware computer-storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
-
identifying a plurality of atoms from a first document of a plurality of documents that are to be indexed; classifying each of the plurality of atoms as one or more of a unigram, an n-gram, or an n-tuple; computing an information metric for each of the plurality of atoms in relation to the first document, wherein the information metric for a first atom identified in the first document represents a pre-computed ranking used during a search query for the first atom as to how useful the first atom is in relation to the first document in resolving the search query having the first atom; determining whether the information metric for each of the plurality of atoms meets a predetermined threshold, wherein the atoms that meet the predetermined threshold are those that are most relevant in relation to the first document; discarding the atoms that do not meet the predetermined threshold; and incorporating the atoms that meet the predetermined threshold in relation to the first document into the one or more search indexes, wherein identifying the first document from the one or more search indexes as relevant for the search query is based on a pruning algorithm that computes a preliminary score for the first document to select the first document based on the preliminary score, the first document is selected from a set of documents indexed in the one or more search indexes, wherein the preliminary score is computed using the information metric pre-computed for each atom/document pair and a simplified scoring function that approximates a final ranking algorithm utilized in identifying the first document as relevant for the search query. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 20)
-
-
17. One or more hardware computer-storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
-
extracting a plurality of atoms from a document, the plurality of atoms comprising one or more unigrams, one or more n-grams, and one or more n-tuples; for each of the plurality of atoms, calculating an information metric that represents a pre-computed ranking used in a search query for a particular atom in relation to the document; determining an information metric threshold, wherein the atom/document pairs whose information metric meets or exceeds the information metric threshold are indexed; discarding a portion of the atom/document pairs based on the information metrics, wherein the information metrics corresponding to the discarded atom/document pairs are below the information metric threshold; populating the one or more search indexes by indexing the atom/document pairs whose information metrics meet or exceed the information metric threshold, wherein the unigrams, the n-grams, and the n-tuples are separately indexed each in a different search index, and accessing the one or more search indexes to identify relevant documents for the atoms in a query, wherein identifying relevant documents is based at least in part on a pruning algorithm computes a preliminary score for documents of the atom/documents pair, wherein the preliminary score is computed using the information metric and a simplified scoring function that approximates a final ranking algorithm utilized in identifying the relevant documents. - View Dependent Claims (18, 19)
-
Specification