Extracting semantic classes and instances from text
First Claim
1. A method performed by data processing apparatus, the method comprising:
- receiving a collection of text;
identifying a first collection of instance-class pairs for the collection of text, wherein the first collection of instance-class pairs are identified by applying one or more template patterns to a collection of documents;
clustering a collection of semantically similar phrases using the collection of text;
determining, for each class in the first collection of instance-class pairs;
whether a threshold number of instances within a cluster in the semantically similar phrase clusters are labeled by the class, andwhether a threshold number of clusters in the semantically similar phrase clusters include at least one instance that is labeled by the class;
in response to determining that a threshold number of instances within a cluster are labeled by a class and a threshold number of clusters in the semantically similar phrase clusters include at least one instance that is labeled by the class, selecting each instance in the first collection of instance-class pairs that are labeled by the class to be included in a second collection of instance-class pairs; and
storing the second collection of instance-class pairs for use in information retrieval.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for extracting semantic classes and corresponding instances from a collection of text. One aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving a collection of text; identifying an initial collection of instance-class pairs for the collection of text; clustering a collection of semantically similar phrases using the collection of text; generating, using one or more processors, an extracted collection of instance-class pairs using the initial collection of instance-class pairs and the semantically similar phrase clusters; and storing the extracted collection of instance-class pairs for use in information retrieval.
-
Citations
18 Claims
-
1. A method performed by data processing apparatus, the method comprising:
-
receiving a collection of text; identifying a first collection of instance-class pairs for the collection of text, wherein the first collection of instance-class pairs are identified by applying one or more template patterns to a collection of documents; clustering a collection of semantically similar phrases using the collection of text; determining, for each class in the first collection of instance-class pairs; whether a threshold number of instances within a cluster in the semantically similar phrase clusters are labeled by the class, and whether a threshold number of clusters in the semantically similar phrase clusters include at least one instance that is labeled by the class; in response to determining that a threshold number of instances within a cluster are labeled by a class and a threshold number of clusters in the semantically similar phrase clusters include at least one instance that is labeled by the class, selecting each instance in the first collection of instance-class pairs that are labeled by the class to be included in a second collection of instance-class pairs; and storing the second collection of instance-class pairs for use in information retrieval. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer-readable storage device storing instructions that, when executed by data processing apparatus, cause the data processing apparatus to perform operations comprising:
-
receiving a collection of text; identifying a first collection of instance-class pairs for the collection of text, wherein the first collection of instance-class pairs are identified by applying one or more template patterns to a collection of documents; clustering a collection of semantically similar phrases using the collection of text; determining, for each class in the first collection of instance-class pairs; whether a threshold number of instances within a cluster in the semantically similar phrase clusters are labeled by the class, and whether a threshold number of clusters in the semantically similar phrase clusters include at least one instance that is labeled by the class; in response to determining that a threshold number of instances within a cluster are labeled by a class and a threshold number of clusters in the semantically similar phrase clusters include at least one instance that is labeled by the class, selecting each instance in the first collection of instance-class pairs that are labeled by the class to be included in a second collection of instance-class pairs; and storing the second collection of instance-class pairs for use in information retrieval. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A system comprising:
one or more processors configured to perform operations comprising; receiving a collection of text; identifying a first collection of instance-class pairs for the collection of text, wherein the first collection of instance-class pairs are identified by applying one or more template patterns to a collection of documents; clustering a collection of semantically similar phrases using the collection of text; determining, for each class in the first collection of instance-class pairs; whether a threshold number of instances within a cluster in the semantically similar phrase clusters are labeled by the class, and whether a threshold number of clusters in the semantically similar phrase clusters include at least one instance that is labeled by the class; in response to determining that a threshold number of instances within a cluster are labeled by a class and a threshold number of clusters in the semantically similar phrase clusters include at least one instance that is labeled by the class, selecting each instance in the first collection of instance-class pairs that are labeled by the class to be included in a second collection of instance-class pairs; and storing the second collection of instance-class pairs for use in information retrieval. - View Dependent Claims (14, 15, 16, 17, 18)
Specification