Method and apparatus for characterizing documents based on clusters of related words
First Claim
1. A computer-implemented method comprising:
- receiving resource that includes a set of words;
identifying a set of candidate clusters from a probabilistic model that are classified as likely to be active in generating the set of words;
generating a vector that characterizes the resource, wherein, for each cluster of the set of candidate clusters, a component of the vector indicates a degree to which the cluster was active in generating the set of words; and
using the vector in performing an operation related to the resource.
2 Assignments
0 Petitions
Accused Products
Abstract
One embodiment of the present invention provides a system characterizes a document with respect to clusters of conceptually related words. Upon receiving a document containing a set of words, the system selects “candidate clusters” of conceptually related words that are related to the set of words. These candidate clusters are selected using a model that explains how sets of words are generated from clusters of conceptually related words. Next, the system constructs a set of components to characterize the document, wherein the set of components includes components for candidate clusters. Each component in the set of components indicates a degree to which a corresponding candidate cluster is related to the set of words.
37 Citations
21 Claims
-
1. A computer-implemented method comprising:
-
receiving resource that includes a set of words; identifying a set of candidate clusters from a probabilistic model that are classified as likely to be active in generating the set of words; generating a vector that characterizes the resource, wherein, for each cluster of the set of candidate clusters, a component of the vector indicates a degree to which the cluster was active in generating the set of words; and using the vector in performing an operation related to the resource. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
a non-transitory computer readable medium having instructions stored thereon; and data processing apparatus programmed to execute the instructions to perform operations comprising; receiving resource that includes a set of words; identifying a set of candidate clusters from a probabilistic model that are classified as likely to be active in generating the set of words; generating a vector that characterizes the resource, wherein, for each cluster of the set of candidate clusters, a component of the vector indicates a degree to which the cluster was active in generating the set of words; and using the vector in performing an operation related to the resource. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium having instructions stored thereon that, when executed by a processor, cause the processor to perform operations comprising:
-
receiving resource that includes a set of words; identifying a set of candidate clusters from a probabilistic model that are classified as likely to be active in generating the set of words; generating a vector that characterizes the resource, wherein, for each cluster of the set of candidate clusters, a component of the vector indicates a degree to which the cluster was active in generating the set of words; and using the vector in performing an operation related to the resource. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification