Query-based snippet clustering for search result grouping

US 7,617,176 B2
Filed: 07/13/2004
Issued: 11/10/2009
Est. Priority Date: 07/13/2004
Status: Active Grant

First Claim

Patent Images

1. A computer implemented system that facilitates clustering of search results, comprising:

a memory having stored therein computer-executable instructions configured to implement the clustering system, including;

an input component that receives search results, wherein the search results are a ranked list of titles and snippets associated with documents;

an analysis component that;

utilizes a frequent itemset algorithm to extract keywords from the search results and identifies frequently occurring words as keywords, with keywords occurring in titles being weighted more heavily than keywords occurring in snippets,calculates properties for each keyword, wherein the properties include phrase frequency and inverted document frequency, phrase length, intra-cluster similarity, cluster entropy and phrase independence,applies a regression model learned from training data that is collected in advance to combine the properties into a salience score for each of the keywords,ranks the keywords in descending order according to their associated salience scores andselects the highest ranked keywords as salient keywords; and

a clustering component that;

generates one or more candidate clusters of the search results according to the saliency score, wherein the salient keywords are selected as names of the candidate clusters, the names of the candidate clusters being phrases when the salient keywords are merged with other salient keywords,merges the candidate clusters into one or more final clusters,merges a first cluster and a second cluster into a third cluster, when overlap of the first and second clusters exceeds a predetermined threshold,adjusts cluster names of the one or more final clusters to generate a new cluster name for the third cluster, andoutputs the search results as a ranked list of associated documents.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A clustering architecture that dynamically groups the search result documents into clusters labeled by phrases extracted from the search result snippets. Documents related to the same topic usually share a common vocabulary. The words are first clustered based on their co-occurrences and each cluster forms a potentially interesting topic. Keywords are chosen and then clustered by counting co-occurrences of pairs of keywords. Documents are assigned to relevant topics based on the feature vectors of the clusters.

Citations

13 Claims

1. A computer implemented system that facilitates clustering of search results, comprising:
- a memory having stored therein computer-executable instructions configured to implement the clustering system, including;
  
  an input component that receives search results, wherein the search results are a ranked list of titles and snippets associated with documents;
  
  an analysis component that;
  
  utilizes a frequent itemset algorithm to extract keywords from the search results and identifies frequently occurring words as keywords, with keywords occurring in titles being weighted more heavily than keywords occurring in snippets,calculates properties for each keyword, wherein the properties include phrase frequency and inverted document frequency, phrase length, intra-cluster similarity, cluster entropy and phrase independence,applies a regression model learned from training data that is collected in advance to combine the properties into a salience score for each of the keywords,ranks the keywords in descending order according to their associated salience scores andselects the highest ranked keywords as salient keywords; and
  
  a clustering component that;
  
  generates one or more candidate clusters of the search results according to the saliency score, wherein the salient keywords are selected as names of the candidate clusters, the names of the candidate clusters being phrases when the salient keywords are merged with other salient keywords,merges the candidate clusters into one or more final clusters,merges a first cluster and a second cluster into a third cluster, when overlap of the first and second clusters exceeds a predetermined threshold,adjusts cluster names of the one or more final clusters to generate a new cluster name for the third cluster, andoutputs the search results as a ranked list of associated documents.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the ranked list of associated documents is presented in response to selection of a final cluster.
  - 3. The system of claim 1, wherein a document is assigned to the final cluster based upon its similarity to relevant keywords of the final cluster.
  - 4. The system of claim 1, wherein the final cluster is named according to a keyword or phrase that includes multiple keywords.

5. A processor-executable method for processing search results, comprising:
- employing a processor executing computer executable instructions stored on a computer readable storage medium to implement the following search result processing acts;
  
  receiving a web page of search results from a web search engine, wherein the search results comprise a ranked list of documents;
  
  parsing the search results;
  
  extracting phrases from the search results, wherein the phrases include titles and query-dependent snippets, with phrases occurring in titles being weighted more heavily than keywords occurring in query-dependent snippets;
  
  calculating properties for each phrase, wherein properties include phrase frequency and inverted document frequency, phrase length, cluster entropy and phrase independence;
  
  applying a trained regression model to combine the properties for each phrase into a single salience score;
  
  ranking the phrases by the single salience score in descending order, wherein the top-ranked phrases are salient phrases;
  
  assigning documents to the salient phrases to generate at least one candidate cluster, the name of the at least one candidate cluster being a phrase when a salient keyword is merged with other salient keywords; and
  
  dynamically generating at least one final cluster of the search results by merging the candidate clusters;
  
  generating cluster names for the final clusters based on the salient phrases;
  
  merging a first cluster and a second cluster into a third cluster, when overlap of the first and second clusters exceeds a predetermined threshold; and
  
  adjusting cluster names of the final clusters to generate a new cluster name for the third cluster.
- View Dependent Claims (6, 7, 8, 9, 10)
- - 6. The method of claim 5, wherein an HTML parser is utilized for parsing the search results.
  - 7. The method of claim 5, wherein the titles and snippets are weighted differently in the extraction process.
  - 8. The method of claim 7, wherein the titles are weighted greater than the snippets.
  - 9. The method of claim 5, further comprising automatically presenting an associated document list of a cluster.
  - 10. The method of claim 9, wherein the salient phrases are highlighted in the document list.

11. A computer implemented system that facilitates processing of search results, comprising:
- a memory having stored therein computer-executable instructions configured to implement the clustering system, including;
  
  means for receiving search results that contain titles and snippets of information;
  
  means for extracting phrases from the search results, with phrases occurring in titles being weighted more heavily than keywords occurring in snippets;
  
  means for calculating properties for each of the phrases, wherein the properties include phrase frequency and inverted document frequency, phrase length, cluster entropy and phrase independence;
  
  means for applying a regression model learned from previous training data to process the properties into a salient score for each phrase;
  
  means for ranking the phrases in descending order according to respective salient scores, wherein the top-ranked phrases are salient phrases;
  
  means for dynamically generating one or more candidate clusters of the search results, wherein the candidate clusters are labeled by the salient phrases and the names of the candidate clusters being phrases when a salient keyword is merged with other salient keywords;
  
  means for merging candidate clusters to form final clusters;
  
  means for merging a first cluster and a second cluster into a third cluster, when overlap of the first and second clusters exceeds a predetermined threshold; and
  
  means for adjusting cluster names of the final clusters to generate a new cluster name for the third cluster.
- View Dependent Claims (12, 13)
- - 12. The system of claim 11, wherein the regression model is a linear regression model, a logistic regression model, a trained support vector regression model or an automatic classifier system.
  - 13. The system of claim 11, further comprising means for presenting a list of documents that are associated with the salient phrases in response to selection of the cluster.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
He, Qicai, Liu, Guimei, Ma, Wei-Ying, Chen, Zheng, Zhang, Benyu, Zeng, Hua-Jun
Primary Examiner(s)
Mofiz; Apu M
Assistant Examiner(s)
Le; Jessica N

Application Number

US10/889,841
Publication Number

US 20060026152A1
Time in Patent Office

1,946 Days
Field of Search

707 1- 7, 707/200, 707/100
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

Y10S 707/99931   Database or file accessing

Query-based snippet clustering for search result grouping

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Query-based snippet clustering for search result grouping

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links