Automatic taxonomy generation in search results using phrases

US 7,426,507 B1
Filed: 07/26/2004
Issued: 09/16/2008
Est. Priority Date: 07/26/2004
Status: Active Grant

First Claim

Patent Images

1. A method of presenting documents in response to a search of a document collection, the method comprising:

retrieving a plurality of documents in response to a query, the query comprising at least one query phrase;

determining related phrases that are related to the query phrase, wherein for each query phrase g_j, g_kis a related phrase of phrase g_jwhere an information gain I of g_kwith respect to g_jexceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of g_jand g_k, and E(j,k) is an expected co-occurrence rate g_jand g_k;

determining a plurality of clusters, each cluster associated with one of the related phrases, and having a cluster name corresponding to the related phrase; and

for each cluster, presenting a number of documents containing the related phrase associated with the cluster, along with the cluster name.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.

Citations

17 Claims

1. A method of presenting documents in response to a search of a document collection, the method comprising:
- retrieving a plurality of documents in response to a query, the query comprising at least one query phrase;
  
  determining related phrases that are related to the query phrase, wherein for each query phrase g_j, g_kis a related phrase of phrase g_jwhere an information gain I of g_kwith respect to g_jexceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of g_jand g_k, and E(j,k) is an expected co-occurrence rate g_jand g_k;
  
  determining a plurality of clusters, each cluster associated with one of the related phrases, and having a cluster name corresponding to the related phrase; and
  
  for each cluster, presenting a number of documents containing the related phrase associated with the cluster, along with the cluster name.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - for each cluster, determining a number of documents containing the related phrase associated with the cluster; and
      
      ordering the clusters in declining order of the number of documents in each cluster.
  - 3. The method of claim 1, wherein presenting a number of documents containing the related phrase associated with the cluster comprises:
    - presenting a fixed number of documents from each cluster.
  - 4. The method of claim 1, wherein presenting a number of documents containing the related phrase associated with the cluster comprises:
    - presenting a number of documents from each cluster proportional to a total number of documents in the cluster relative to a total number of retrieved documents.
  - 5. The method of claim 1, wherein determining related phrases that are related to the query phrase comprises:
    - examining a related phrase bit vector of the query phrase, the related phrase bit vector having an ordered set of bits, each bit indicating whether a corresponding related phrase is present in a given document.
  - 6. The method of claim 5, wherein the related phrase bit vector is stored in a posting list of the query phrase.
  - 7. The method of claim 1, wherein for each cluster, presenting a number of documents containing the related phrase associated with the cluster, along with the cluster name, comprises:
    - presenting the documents containing the related phrase associated with the cluster sequentially and visually associable with the cluster name.

8. A computer readable storage medium storing a computer program executable by a processor for presenting documents in response to a search of a document collection, by performing the operations comprising:
- retrieving a plurality of documents in response to a query, the query comprising at least one query phrase;
  
  determining related phrases that are related to the query phrase, wherein for each query phrase g_j, g_kis a related phrase of phrase g_jwhere an information gain I of g_kwith respect to g_jexceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of g_jand g_k, and E(j,k) is an expected co-occurrence rate g_jand g_k;
  
  determining a plurality of clusters, each cluster associated with one of the related phrases, and having a cluster name corresponding to the related phrase; and
  
  for each cluster, presenting a number of documents containing the related phrase associated with the cluster, along with the cluster name.
- View Dependent Claims (9, 10)
- - 9. The computer readable storage medium of claim 8, further comprising computer executable instructions for performing the operations of:
    - for each cluster, determining a number of documents containing the related phrase associated with the cluster; and
      
      ordering the clusters in declining order of the number of documents in each cluster.
  - 10. The computer readable storage medium of claim 8, wherein determining related phrases that are related to the query phrase comprises:
    - examining a related phrase bit vector of the query phrase, the related phrase bit vector having an ordered set of bits, each bit indicating whether a corresponding related phrase is present in a given document.

11. A computer implemented system for presenting documents in response to a search of a document collection, comprising:
- an index stored in a storage medium and comprising related phrase information; and
  
  a query processing system executed by a computer and adapted to;
  
  retrieve a plurality of documents in response to a query, the query comprising at least one query phrase,determine, using the related phrase information, related phrases that are related to the query phrase, wherein for each query phrase g_j, g_kis a related phrase of phrase g_jwhere an information gain I of g_kwith respect to g_jexceeds a predetermined threshold, the information gain I being a function of A(j,k) and E(j,k), where A(j,k) is a measure of an actual co-occurrence rate of g_jand g_k, and E(j,k) is an expected co-occurrence rate g_jand g_k,determine a plurality of clusters, each cluster associated with one of the related phrases, and having a cluster name corresponding to the related phrase, andfor each cluster, present a number of documents containing the related phrase associated with the cluster, along with the cluster name.
- View Dependent Claims (12, 13)
- - 12. The system of claim 11, wherein the query processing system is further adapted to:
    - for each cluster, determine a number of documents containing the related phrase associated with the cluster; and
      
      order the clusters in declining order of the number of documents in each cluster.
  - 13. The system of claim 11, wherein determining, using the related phrase information, related phrases that are related to the query phrase comprises:
    - examining a related phrase bit vector of the query phrase, the related phrase bit vector having an ordered set of bits, each bit indicating whether a corresponding related phrase is present in a given document.

14. A method of presenting documents in response to a search of a document collection, the method comprising:
- retrieving a plurality of documents in response to a query, the query comprising at least one query phrase;
  
  determining related phrases that are related to the query phrase;
  
  determining a plurality of clusters, each cluster associated with one of the related phrases, and having a cluster name corresponding to the related phrase;
  
  for each cluster, determining a number of documents containing the related phrase associated with the cluster, comprising;
  
  for each document in the document collection;
  
  accessing a related phrase bit vector of the query phrase, the related phrase bit vector having an ordered set of bits, each bit indicating whether a corresponding phrase is present in the document, andincrementing a count of the number of documents containing the related phrase if and only if a bit is set in the related phrase bit vector at a location corresponding to the related phrase;
  
  ordering the clusters in declining order of the number of documents in each cluster; and
  
  for each cluster, presenting a number of documents containing the related phrase associated with the cluster, along with the cluster name.

15. A method of presenting documents in response to a search of a document collection, the method comprising:
- retrieving a plurality of documents in response to a query, the query comprising at least one query phrase, the retrieving comprising;
  
  identifying an incomplete phrase in the query;
  
  replacing the incomplete phrase with a phrase extension, wherein the phrase extension of the incomplete phrase is a super-sequence of the incomplete phrase that begins with the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the incomplete phrase exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection, the expected co-occurrence rate of the phrase extension and the incomplete phrase being a function of a plurality of occurrences of the phrase extension and the incomplete phrase in the document collection, andselecting documents from the document collection containing the phrase extension;
  
  determining related phrases that are related to the query phrase;
  
  determining a plurality of clusters, each cluster associated with one of the related phrases, and having a cluster name corresponding to the related phrase; and
  
  for each cluster, presenting a number of documents containing the related phrase associated with the cluster, along with the cluster name.

16. A computer readable storage medium storing a computer program executable by a processor for presenting documents in response to a search of a document collection, the operations of the computer program comprising:
- retrieving a plurality of documents in response to a query, the query comprising at least one query phrase, the retrieving comprising;
  
  identifying an incomplete phrase in the query;
  
  replacing the incomplete phrase with a phrase extension, wherein the phrase extension of the incomplete phrase is a super-sequence of the incomplete phrase that begins with the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the incomplete phrase exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection, the expected co-occurrence rate of the phrase extension and the incomplete phrase being a function of a plurality of occurrences of the phrase extension and the incomplete phrase in the document collection, andselecting documents from the document collection containing the phrase extension;
  
  determining related phrases that are related to the query phrase;
  
  determining a plurality of clusters, each cluster associated with one of the related phrases, and having a cluster name corresponding to the related phrase; and
  
  for each cluster, presenting a number of documents containing the related phrase associated with the cluster, along with the cluster name.

17. A computer implemented system for presenting documents in response to a search of a document collection, comprising:
- an index stored in a storage medium and comprising related phrase information; and
  
  a query processing system executed by a computer and adapted to;
  
  retrieve a plurality of documents in response to a query, the query comprising at least one query phrase, the retrieving comprising;
  
  identifying an incomplete phrase in the query,replacing the incomplete phrase with a phrase extension, wherein the phrase extension of the incomplete phrase is a super-sequence of the incomplete phrase that begins with the incomplete phrase, and wherein the incomplete phrase predicts the phrase extension based on a measure of an actual co-occurrence rate of the phrase extension and the incomplete phrase exceeding an expected co-occurrence rate of the phrase extension and the incomplete phrase in the document collection, the expected co-occurrence rate of the phrase extension and the incomplete phrase being a function of a plurality of occurrences of the phrase extension and the incomplete phrase in the document collection, andselecting documents from the document collection containing the phrase extension;
  
  determine, using the related phrase information, related phrases that are related to the query phrase;
  
  determine a plurality of clusters, each cluster associated with one of the related phrases, and having a cluster name corresponding to the related phrase; and
  
  for each cluster, present a number of documents containing the related phrase associated with the cluster, along with the cluster name.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Patterson, Anna Lynn
Primary Examiner(s)
Vy; Hung T

Application Number

US10/900,259
Time in Patent Office

1,513 Days
Field of Search

707 1- 6, 707/102, 707/101, 715/810
US Class Current

1/1
CPC Class Codes

G06F 16/313   Selection or weighting of t...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99942   Manipulating data structure...

Automatic taxonomy generation in search results using phrases

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic taxonomy generation in search results using phrases

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links