LEVERAGING CROSS-DOCUMENT CONTEXT TO LABEL ENTITY
First Claim
1. A method of classifying entities, the method comprising:
- recognizing occurrences of an entity in a plurality of documents;
identifying a plurality of features in contexts of said occurrences, a first one of the features being derived from a first context of a first one of the occurrences in a first one of the documents, and a second one of the features being derived from a second context of a second one of the occurrences in a second one of the documents;
calculating a sum of a plurality of weights, wherein each of the features is associated with one of the weights;
making a first determination that said sum exceeds a first threshold;
making a second determination that a label applies to said entity based on said first determination; and
storing or communicating a fact that said label applies to said entity.
2 Assignments
0 Petitions
Accused Products
Abstract
Entities, such as people, places and things, are labeled based on information collected across a possibly large number of documents. One or more documents are scanned to recognize the entities, and features are extracted from the context in which those entities occur in the documents. Observed entity-feature pairs are stored either in an in-memory store or an external store. A store manager optimizes use of the limited amount of space for an in-memory store by determining which store to put an entity-feature pair in, and when to evict features from the in-memory store to make room for new pairs. Feature that may be observed in an entity'"'"'s context may take forms such as specific word sequences or membership in a particular list.
-
Citations
20 Claims
-
1. A method of classifying entities, the method comprising:
-
recognizing occurrences of an entity in a plurality of documents; identifying a plurality of features in contexts of said occurrences, a first one of the features being derived from a first context of a first one of the occurrences in a first one of the documents, and a second one of the features being derived from a second context of a second one of the occurrences in a second one of the documents; calculating a sum of a plurality of weights, wherein each of the features is associated with one of the weights; making a first determination that said sum exceeds a first threshold; making a second determination that a label applies to said entity based on said first determination; and storing or communicating a fact that said label applies to said entity. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system that classifies entities, the system comprising:
-
an entity recognizer that identifies occurrence of a first entity in a document; a feature extractor that identifies a feature that occurs in a context of said occurrence of said first entity; a first entity-feature store that is located in a volatile random-access memory and that stores entity-feature pairs; a second entity-feature store that is not located in said memory and that stores entity-feature pairs; a store manager that receives a first entity-feature pair obtained from a document processed by collective action of said entity recognizer and said feature extractor, and that determines whether to store said first entity-feature pair in said first entity-feature store or said second entity-feature store based on a first estimated frequency of a second entity-feature pair that is stored in said first entity-feature store. - View Dependent Claims (10, 11, 12, 13, 14)
-
-
15. One or more computer-readable storage media that store executable instructions to perform a method of classifying entities, the method comprising:
-
recognizing occurrences of an entity in a plurality of documents; identifying a plurality of first features in contexts of said occurrences; assigning a label to said entity, based on weights of said first features, when probability of correctness of said label reaches a threshold, there being at least some second features in said document that are associated with said entity but that are not included among said first features; and verifying correctness of said label based one or more of said second features. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification