Query-based snippet clustering for search result grouping
First Claim
Patent Images
1. A computer implemented system that facilitates clustering of search results, comprising:
- a memory having stored therein computer-executable instructions configured to implement the clustering system, including;
an input component that receives search results, wherein the search results are a ranked list of titles and snippets associated with documents;
an analysis component that;
utilizes a frequent itemset algorithm to extract keywords from the search results and identifies frequently occurring words as keywords, with keywords occurring in titles being weighted more heavily than keywords occurring in snippets,calculates properties for each keyword, wherein the properties include phrase frequency and inverted document frequency, phrase length, intra-cluster similarity, cluster entropy and phrase independence,applies a regression model learned from training data that is collected in advance to combine the properties into a salience score for each of the keywords,ranks the keywords in descending order according to their associated salience scores andselects the highest ranked keywords as salient keywords; and
a clustering component that;
generates one or more candidate clusters of the search results according to the saliency score, wherein the salient keywords are selected as names of the candidate clusters, the names of the candidate clusters being phrases when the salient keywords are merged with other salient keywords,merges the candidate clusters into one or more final clusters,merges a first cluster and a second cluster into a third cluster, when overlap of the first and second clusters exceeds a predetermined threshold,adjusts cluster names of the one or more final clusters to generate a new cluster name for the third cluster, andoutputs the search results as a ranked list of associated documents.
2 Assignments
0 Petitions
Accused Products
Abstract
A clustering architecture that dynamically groups the search result documents into clusters labeled by phrases extracted from the search result snippets. Documents related to the same topic usually share a common vocabulary. The words are first clustered based on their co-occurrences and each cluster forms a potentially interesting topic. Keywords are chosen and then clustered by counting co-occurrences of pairs of keywords. Documents are assigned to relevant topics based on the feature vectors of the clusters.
-
Citations
13 Claims
-
1. A computer implemented system that facilitates clustering of search results, comprising:
a memory having stored therein computer-executable instructions configured to implement the clustering system, including; an input component that receives search results, wherein the search results are a ranked list of titles and snippets associated with documents; an analysis component that; utilizes a frequent itemset algorithm to extract keywords from the search results and identifies frequently occurring words as keywords, with keywords occurring in titles being weighted more heavily than keywords occurring in snippets, calculates properties for each keyword, wherein the properties include phrase frequency and inverted document frequency, phrase length, intra-cluster similarity, cluster entropy and phrase independence, applies a regression model learned from training data that is collected in advance to combine the properties into a salience score for each of the keywords, ranks the keywords in descending order according to their associated salience scores and selects the highest ranked keywords as salient keywords; and a clustering component that; generates one or more candidate clusters of the search results according to the saliency score, wherein the salient keywords are selected as names of the candidate clusters, the names of the candidate clusters being phrases when the salient keywords are merged with other salient keywords, merges the candidate clusters into one or more final clusters, merges a first cluster and a second cluster into a third cluster, when overlap of the first and second clusters exceeds a predetermined threshold, adjusts cluster names of the one or more final clusters to generate a new cluster name for the third cluster, and outputs the search results as a ranked list of associated documents. - View Dependent Claims (2, 3, 4)
-
5. A processor-executable method for processing search results, comprising:
employing a processor executing computer executable instructions stored on a computer readable storage medium to implement the following search result processing acts; receiving a web page of search results from a web search engine, wherein the search results comprise a ranked list of documents; parsing the search results; extracting phrases from the search results, wherein the phrases include titles and query-dependent snippets, with phrases occurring in titles being weighted more heavily than keywords occurring in query-dependent snippets; calculating properties for each phrase, wherein properties include phrase frequency and inverted document frequency, phrase length, cluster entropy and phrase independence; applying a trained regression model to combine the properties for each phrase into a single salience score; ranking the phrases by the single salience score in descending order, wherein the top-ranked phrases are salient phrases; assigning documents to the salient phrases to generate at least one candidate cluster, the name of the at least one candidate cluster being a phrase when a salient keyword is merged with other salient keywords; and dynamically generating at least one final cluster of the search results by merging the candidate clusters; generating cluster names for the final clusters based on the salient phrases; merging a first cluster and a second cluster into a third cluster, when overlap of the first and second clusters exceeds a predetermined threshold; and adjusting cluster names of the final clusters to generate a new cluster name for the third cluster. - View Dependent Claims (6, 7, 8, 9, 10)
-
11. A computer implemented system that facilitates processing of search results, comprising:
a memory having stored therein computer-executable instructions configured to implement the clustering system, including; means for receiving search results that contain titles and snippets of information; means for extracting phrases from the search results, with phrases occurring in titles being weighted more heavily than keywords occurring in snippets; means for calculating properties for each of the phrases, wherein the properties include phrase frequency and inverted document frequency, phrase length, cluster entropy and phrase independence; means for applying a regression model learned from previous training data to process the properties into a salient score for each phrase; means for ranking the phrases in descending order according to respective salient scores, wherein the top-ranked phrases are salient phrases; means for dynamically generating one or more candidate clusters of the search results, wherein the candidate clusters are labeled by the salient phrases and the names of the candidate clusters being phrases when a salient keyword is merged with other salient keywords; means for merging candidate clusters to form final clusters; means for merging a first cluster and a second cluster into a third cluster, when overlap of the first and second clusters exceeds a predetermined threshold; and means for adjusting cluster names of the final clusters to generate a new cluster name for the third cluster. - View Dependent Claims (12, 13)
Specification