Method of clustering electronic documents in response to a search query
First Claim
1. A method for clustering electronic documents in response to a search query, the method comprising the steps of:
- collecting a first set of electronic documents, each containing at least one occurrence of a first keyword from the search query;
collecting a second set of electronic documents each containing at least one occurrence of a second keyword from the search query;
combining said first set of electronic documents and said second set of electronic documents to create a collection of electronic documents, each electronic document in said collection containing at least one occurrence of a keyword from the search query;
analyzing each electronic document in said collection to determine a content characteristic in a predefined neighborhood adjacent to at least one of said keywords from the search query;
comparing content characteristics of each document in said collection of electronic documents to content characteristics of other documents in said collection; and
creating a plurality of clusters of electronic documents, at least one cluster including at least two of said electronic documents in said collection of documents, wherein in a given cluster the electronic documents have overlapping content beyond a commonality of keywords from the search query.
6 Assignments
0 Petitions
Accused Products
Abstract
A method of presenting clusters of documents in response to a search query where the documents within a cluster are determined to be related to one another. This relationship is assessed by comparing documents which match one or more terms in the query to determine the extent to which the documents have commonality with respect to terms appearing infrequently in the collection of documents. As a consequence, the cluster of documents represents a response or query result that is split across multiple documents. In a further variation the cluster can be constituted by a structured document and an unstructured document.
163 Citations
23 Claims
-
1. A method for clustering electronic documents in response to a search query, the method comprising the steps of:
-
collecting a first set of electronic documents, each containing at least one occurrence of a first keyword from the search query; collecting a second set of electronic documents each containing at least one occurrence of a second keyword from the search query; combining said first set of electronic documents and said second set of electronic documents to create a collection of electronic documents, each electronic document in said collection containing at least one occurrence of a keyword from the search query; analyzing each electronic document in said collection to determine a content characteristic in a predefined neighborhood adjacent to at least one of said keywords from the search query; comparing content characteristics of each document in said collection of electronic documents to content characteristics of other documents in said collection; and creating a plurality of clusters of electronic documents, at least one cluster including at least two of said electronic documents in said collection of documents, wherein in a given cluster the electronic documents have overlapping content beyond a commonality of keywords from the search query. - View Dependent Claims (2)
-
-
3. A method for finding responses to a search query that comprises a plurality of keywords, the method comprising the steps of:
-
collecting a set of documents, each document containing at least one of said plurality of keywords; analyzing each document in said set to determine a content characteristic in a predefined neighborhood adjacent to at least one of said plurality of keywords in that document; comparing a content characteristic associated with a document in the set against the content characteristic of other documents in said set and determining a level of similarity of content characteristic for each pair of documents compared; and providing as query responses those document pairs having a level of similarity of content characteristic determined to be greater than a predetermined threshold. - View Dependent Claims (4, 5)
-
-
6. A method for finding responses to a search query that comprises a plurality of keywords, the method comprising the steps of:
-
collecting a set of documents, each document containing at least one of said plurality of keywords; for each document in said set, creating a list of terms appearing within a specified distance from one keyword found in the document where each term has associated therewith a ratio of number of appearances in the collection of documents that falls within a predetermined range; comparing the list of terms generated for each document to discover pairs of lists having commonality exceeding a first threshold; and presenting as responses to the search query document pairs corresponding to those pairs of lists determined to have commonality exceeding said first threshold. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A search engine for finding responses to a search query where responses are split across documents, said search engine comprising computer software operated on a processor and when operated performing the steps of:
-
collecting a set of documents, each document containing at least one of a plurality of keywords from the search query; analyzing each document in said set to determine a content characteristic in a predefined neighborhood adjacent to at least one of said plurality of keywords in that document; comparing a content characteristic associated with a document in the set against the content characteristic of other documents in said set and determining a level of similarity of content characteristic for each pair of documents compared; and providing as query responses those document pairs having a level of similarity of content characteristic determined to be greater than a predetermined threshold. - View Dependent Claims (12, 13)
-
-
14. A method for obtaining responses to a search query containing a plurality of keywords to a universe of electronic documents where the responses are split across documents, the method comprising the steps of:
-
collecting a set of documents out of said universe of documents wherein each document matches at least one of the plurality of keywords; determining similarity between pairs of documents within said set of documents to be able to assign a similarity score to each document pair; creating document clusters using similarity scores assigned to respective document pairs; ranking the created document clusters; and presenting the ranked document clusters as responses to the search query. - View Dependent Claims (15, 16)
-
-
17. A method for providing responses to a search query on the worldwide web where the responses are split across electronic documents, the method comprising the steps of:
-
collecting a set of documents, each document containing at least one of a plurality of keywords from the search query; analyzing each document in said set to determine a content characteristic in a predefined neighborhood adjacent to at least one of said plurality of keywords in that document; comparing a content characteristic associated with a document in the set against the content characteristic of other documents in said set and determining a level of similarity of content characteristic for each pair of documents compared; and providing as query responses URLs for those document pairs having a level of similarity of content characteristic determined to be greater than a predetermined threshold. - View Dependent Claims (18, 19)
-
-
20. A method for finding responses to a search query directed to a universe of documents including a collection of structured documents and a collection of unstructured documents, the method comprising the steps of:
-
collecting a set of unstructured documents each containing a keyword of the search query; collecting a set of structured documents each containing an attribute/value pair of the search query; analyzing each document in the set of unstructured documents to determine a first content characteristic in a predefined neighborhood adjacent to at least one of said keywords from the search query; analyzing each document in the set of structured documents to determine a second content characteristic in a predefined neighborhood adjacent to at least one of said attribute/value pairs from the search query; comparing the first content characteristics of each document in the set of unstructured documents to the second content characteristic of every document in the set of structured documents; and based on the results of the comparing step, joining an unstructured document and a structured document as a response pair when that document pair contains a common keyword.
-
-
21. A method for finding responses to a search query directed to a universe of documents including a collection of structured documents and a collection of unstructured documents, the method comprising the steps of:
-
collecting a set of unstructured documents each containing a keyword of the search query; collecting a set of structured documents each containing an attribute/value pair of the search query; analyzing each document in the set of unstructured documents to determine a first content characteristic in a predefined neighborhood adjacent to at least one of said keywords from the search query; analyzing each document in the set of structured documents to determine a second content characteristic in a predefined neighborhood adjacent to at least one of said attribute/value pairs from the search query; comparing the first content characteristics of each document in the set of unstructured documents to the second content characteristic of every document in the set of structured documents; and based on the results of the comparing step, joining an unstructured document and a structured document as a response pair when the documents in that pair contain terms common to both documents in the pair which terms each have a ratio of number of appearances in the document to the number of appearances in the collection which satisfies a predetermined range. - View Dependent Claims (22, 23)
-
Specification