Inferring hierarchical descriptions of a set of documents
First Claim
1. A method of inferring hierarchical descriptions of a set of documents comprising the steps of:
- providing a first histogram of features from a positive set of documents;
providing a second histogram of features from a collection set of documents; and
determining whether each feature is a self feature, a parent feature or a child feature based on the fraction of the documents in the positive set containing the feature and the fraction of the documents in the collection set containing the feature.
2 Assignments
0 Petitions
Accused Products
Abstract
A method automatically determines groups of words or phrases that are descriptive names of a small set of documents, as well as infers concepts in the small set of documents that are more general and more specific than the descriptive names, without any prior knowledge of the hierarchy or the concepts, in a language independent manner. The descriptive names and the concepts may not even be explicitly contained in the documents. The primary application of the invention is for searching of the World Wide Web, but the invention is not limited solely to use with the World Wide Web and may be applied to any set of documents. Classes of features are identified in order to promote understanding of a set of documents. Preferably, there are three classes of features. “Self” features or terms describe the cluster as a whole. “Parent” features or terms describe more general concepts. “Child” features or terms describe specializations of the cluster. The self features can be used as a recommended name for a cluster, while parents and children can be used to place the clusters in the space of a larger collection. Parent features suggest a more general concept, while children features suggest concepts that describe a specialization of the self feature(s). Automatic discovery of parent, self and child features is useful for several purposes including automatic labeling of web directories and improving information retrieval.
71 Citations
72 Claims
-
1. A method of inferring hierarchical descriptions of a set of documents comprising the steps of:
-
providing a first histogram of features from a positive set of documents; providing a second histogram of features from a collection set of documents; and determining whether each feature is a self feature, a parent feature or a child feature based on the fraction of the documents in the positive set containing the feature and the fraction of the documents in the collection set containing the feature. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A method of inferring hierarchical descriptions of a set of documents comprising web pages comprising the steps of:
-
obtaining a first set of URLs comprising a positive set of documents; obtaining a second set of URLs comprising a collection set of documents; determining in-bound links for each URL in the first set of URLs and for each URL the second set of URLs; creating a virtual document for each URL in the positive set of documents and a virtual document for each URL in the collection set of documents; providing a first histogram of features from the virtual documents associated with the first set of URLs; providing a second histogram of features from the virtual documents associated with the second set of URLs; and determining whether each feature is a self feature, a parent feature or a child feature based on the fraction of the virtual documents associated with the positive set of documents containing the feature and the fraction of the virtual documents associated with the collection set of documents containing the feature. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
-
33. A method of searching an information retrieval system using inferential hierarchical descriptions of a set of documents comprising the steps of:
-
submitting a search query to an information retrieval system; retrieving a first set of documents from the information retrieval system responsive to the search query; providing a second set of documents; determining in-bound links for each retrieved document in the first set documents and for each documents in the second set of documents; creating a virtual document for each document in the first set of documents and for each document in the second set of documents; creating a first histogram of features in the virtual documents associated with the first set of documents; creating a second histogram of features in the virtual documents associated with the second set of documents; determining whether each feature is a self feature, a parent feature or a child feature based on the fraction of the virtual documents associated with the positive set of documents containing the feature and the fraction of the virtual documents associated with the collection set of documents containing the feature; and changing the search query responsive to the classification of a feature. - View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52)
-
-
53. A method of labeling a document directory using inferential descriptions of sets of documents comprising the steps of:
-
(a) providing a hierarchy of sets of documents; (b) providing a collection set of documents; (c) determining in-bound links for each document in a set of documents in the hierarchy and for each document in the collection set of documents; (d) creating a virtual document for each document in the set of documents in the hierarchy and for each document in the collection set of documents; (e) creating a first histogram of features from the virtual documents associated with the set of documents in the hierarchy; (f) creating a second histogram of features from the virtual documents associated with the collection set of documents; (g) determining whether each feature is a self feature, a parent feature or a child feature based on the fraction of the virtual documents associated with the set of documents in the hierarchy containing the feature and the fraction of virtual documents associated with the collection set of documents containing the feature; (h) repeating steps (c) to (g) for each set of documents in the hierarchy of sets of documents; and (i) labeling the hierarchy of sets of documents responsive to the determining of each feature. - View Dependent Claims (54)
-
-
55. A method of labeling a document directory using inferential descriptions of sets of documents comprising the steps of:
-
(a) providing a hierarchy of sets of documents; (b) providing a collection set of documents; (c) creating a first histogram of features from each set of documents in the hierarchy; (d) creating a second histogram of features from the collection set of documents; (e) determining whether each feature is a self feature, a parent feature or a child feature based on the fraction of documents associated with the set of documents in the hierarchy containing the feature and the fraction of documents associated with the collection set of documents containing the feature; (f) repeating steps (c) to (e) for each set of documents in the hierarchy of sets of documents; and (g) labeling the hierarchy of sets of documents responsive to the determining of each feature. - View Dependent Claims (56)
-
-
57. A method of inferring hierarchical descriptions of a set of documents comprising the steps of:
-
obtaining a first set of documents comprising a positive set of documents; obtaining a second set of documents comprising a collection set of documents; determining in-bound links for each document in the first set of documents and for each document in the second set of documents; creating a virtual document for each document in the positive set of documents and a virtual document for each document in the collection set of documents; providing a first histogram of features from the virtual documents associated with the first set of documents; providing a second histogram of features from the virtual documents associated with the second set of documents; and determining whether each feature is a self feature, a parent feature or a child feature based on the fraction of the virtual documents associated with the positive set of documents containing the feature and the fraction of the virtual documents associated with the collection set of documents containing the feature. - View Dependent Claims (58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
-
Specification