Selection of atoms for search engine retrieval

US 9,342,582 B2
Filed: 03/10/2011
Issued: 05/17/2016
Est. Priority Date: 11/22/2010
Status: Active Grant

First Claim

Patent Images

1. A method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:

identifying a set of documents to be indexed in a search index;

for each document in the set of documents, identifying a plurality of atoms, the plurality of atoms comprising one or more unigrams, one or more n-grams, and one or more n-tuples;

based on the identified set of documents and the plurality of atoms, generating a list of atom/document pairs;

computing an information metric for each atom/document pair, wherein the information metric represents a pre-computed ranking of the atom used during a search query in relation to the particular document;

based on the information metric for each atom/document pair, selecting a subset of the atom/document pairs that are most relevant to the particular document from which the atoms were identified;

populating the search index using the subset of the atom/document pairs for the particular document, wherein identifying relevant documents for the search query from the search index is based on a pruning algorithm that computes a preliminary score for each of the documents to select a subset of the set of documents based on the preliminary score, wherein the preliminary score is computed using the information metric pre-computed for each atom/document pair and a simplified scoring function that approximates a final ranking algorithm utilized in identifying the relevant documents.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods are provided for populating search indexes with atoms identified in documents. Documents that are to be indexed are identified, and for each document, atoms are identified and are categorized as unigrams, n-grams, and n-tuples. A list of atom/document pairs is generated such that an information metric can be computed for each pair. An information metric represents a ranking of the atom in relation to the particular document. Based on the information metric, some atom/document pairs are discarded and others are indexed.

Citations

20 Claims

1. A method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
- identifying a set of documents to be indexed in a search index;
  
  for each document in the set of documents, identifying a plurality of atoms, the plurality of atoms comprising one or more unigrams, one or more n-grams, and one or more n-tuples;
  
  based on the identified set of documents and the plurality of atoms, generating a list of atom/document pairs;
  
  computing an information metric for each atom/document pair, wherein the information metric represents a pre-computed ranking of the atom used during a search query in relation to the particular document;
  
  based on the information metric for each atom/document pair, selecting a subset of the atom/document pairs that are most relevant to the particular document from which the atoms were identified;
  
  populating the search index using the subset of the atom/document pairs for the particular document, wherein identifying relevant documents for the search query from the search index is based on a pruning algorithm that computes a preliminary score for each of the documents to select a subset of the set of documents based on the preliminary score, wherein the preliminary score is computed using the information metric pre-computed for each atom/document pair and a simplified scoring function that approximates a final ranking algorithm utilized in identifying the relevant documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein the search index comprises one or more search indexes, and wherein the one or more search indexes comprise a unigram index, an n-gram index, and a tuple index.
  - 3. The method of claim 1, wherein a unigram is a single word or symbol.
  - 4. The method of claim 1, wherein an n-gram is a sequence of consecutive or almost consecutive terms extracted from a particular document, wherein n is a quantity of consecutive terms.
  - 5. The method of claim 1, wherein an n-tuple is a set of terms that co-occur in a particular document, wherein an order of the set of terms is independent, and wherein n is a quantity of terms.
  - 6. The method of claim 1, wherein selecting a subset of the atom/document pairs that are most relevant to the particular document further comprises utilizing the pruning algorithm to prune a quantity of the atom/document pairs to a smaller quantity such that the atom/document pairs that are less relevant than other atom/document pairs are not indexed.
  - 7. The method of claim 1, wherein a machine-learning tool is used to compute the information metrics for the atom/document pairs and for selecting the subset of the atom/document pairs that are most relevant to the particular document from which the atoms were identified.
  - 8. The method of claim 1, further comprising:
    - receiving the search query;
      
      reformatting the search query into at least one of one or more unigrams, one or more n-grams, or one or more n-tuples; and
      
      accessing the search index to determine, using the reformatted search query, the most relevant documents for the search query.

9. One or more hardware computer-storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
- identifying a plurality of atoms from a first document of a plurality of documents that are to be indexed;
  
  classifying each of the plurality of atoms as one or more of a unigram, an n-gram, or an n-tuple;
  
  computing an information metric for each of the plurality of atoms in relation to the first document, wherein the information metric for a first atom identified in the first document represents a pre-computed ranking used during a search query for the first atom as to how useful the first atom is in relation to the first document in resolving the search query having the first atom;
  
  determining whether the information metric for each of the plurality of atoms meets a predetermined threshold, wherein the atoms that meet the predetermined threshold are those that are most relevant in relation to the first document;
  
  discarding the atoms that do not meet the predetermined threshold; and
  
  incorporating the atoms that meet the predetermined threshold in relation to the first document into the one or more search indexes, wherein identifying the first document from the one or more search indexes as relevant for the search query is based on a pruning algorithm that computes a preliminary score for the first document to select the first document based on the preliminary score, the first document is selected from a set of documents indexed in the one or more search indexes, wherein the preliminary score is computed using the information metric pre-computed for each atom/document pair and a simplified scoring function that approximates a final ranking algorithm utilized in identifying the first document as relevant for the search query.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 20)
- - 10. The one or more computer-storage media of claim 9, wherein the one or more search indexes comprise a unigram index, an n-gram index, and an n-tuple index.
  - 11. The one or more computer-storage media of claim 9, wherein the information metric is calculated at least by determining whether the terms that comprise the atom have previously been searched by inspecting query logs.
  - 12. The one or more computer-storage media of claim 9, wherein all of the unigrams identified from the first document are incorporated into the one or more search indexes.
  - 13. The one or more computer-storage media of claim 9, wherein a greater percentage of the n-tuples are discarded than the n-grams and the unigrams.
  - 14. The one or more computer-storage media of claim 9, further comprising discarding the n-tuples that are already identified as n-grams.
  - 15. The one or more computer-storage media of claim 9, wherein the computation of the information metric for each of the plurality of atoms is based on one or more of a frequency of the atom in the first document, a proximity of two or more terms of the atom in the first document, a relatedness of the two or more terms of the atom, or whether the two or more terms of the atom have previously been linked together as evidenced by an inspection of query logs.
  - 16. The one or more computer-storage media of claim 9, further comprising:
    - identifying a plurality of atoms from a second document;
      
      classifying each of the plurality of atoms as one or more of a unigram, an n-gram, or an n-tuple;
      
      computing an information metric for each of the plurality of atoms in relation to the second document;
      
      determining whether the information metric for each of the plurality of atoms meets a predetermined threshold, wherein the atoms that meet the predetermined threshold are those that are most relevant in relation to the second document;
      
      discarding the atoms that do not meet the predetermined threshold; and
      
      incorporating the atoms that meet the predetermined threshold in relation to the second document into the one or more search indexes.
  - 20. The one or more computer-storage media of claim 16, wherein less bandwidth is used when identifying most relevant documents for a particular search query and when using an n-gram search index and an n-tuple search index than using only a unigram search index.

17. One or more hardware computer-storage media storing computer-useable instructions that, when used by a computing device, cause the computing device to perform a method for populating one or more search indexes with atoms identified in a plurality of documents, the method comprising:
- extracting a plurality of atoms from a document, the plurality of atoms comprising one or more unigrams, one or more n-grams, and one or more n-tuples;
  
  for each of the plurality of atoms, calculating an information metric that represents a pre-computed ranking used in a search query for a particular atom in relation to the document;
  
  determining an information metric threshold, wherein the atom/document pairs whose information metric meets or exceeds the information metric threshold are indexed;
  
  discarding a portion of the atom/document pairs based on the information metrics, wherein the information metrics corresponding to the discarded atom/document pairs are below the information metric threshold;
  
  populating the one or more search indexes by indexing the atom/document pairs whose information metrics meet or exceed the information metric threshold, wherein the unigrams, the n-grams, and the n-tuples are separately indexed each in a different search index, andaccessing the one or more search indexes to identify relevant documents for the atoms in a query, wherein identifying relevant documents is based at least in part on a pruning algorithm computes a preliminary score for documents of the atom/documents pair, wherein the preliminary score is computed using the information metric and a simplified scoring function that approximates a final ranking algorithm utilized in identifying the relevant documents.
- View Dependent Claims (18, 19)
- - 18. The one or more computer-storage media of claim 17, wherein the information metric threshold is determined based on previous trial runs of determining most relevant documents for a particular search query.
  - 19. The one or more computer-storage media of claim 17, wherein the atom/document pairs are stored in a dictionary using a priority hash index.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Hopcroft, Mike, Risvik, Knut Magne, Bennett, John G., Kalyanaraman, Karthik, Chilimbi, Trishul
Primary Examiner(s)
Shanmugasundaram, Kannan

Application Number

US13/045,278
Publication Number

US 20120130981A1
Time in Patent Office

1,895 Days
Field of Search

707/741, 707999001-999005
US Class Current

1/1
CPC Class Codes

G06F 16/22   Indexing; Data structures t...

G06F 16/313   Selection or weighting of t...

G06F 16/41   Indexing; Data structures t...

Selection of atoms for search engine retrieval

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Selection of atoms for search engine retrieval

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links