Leveraging cross-document context to label entity
First Claim
1. A method of classifying entities, the method comprising:
- using a processor to perform acts comprising;
recognizing occurrences of an entity in a plurality of documents;
identifying a plurality of features in contexts of said occurrences, a first one of the features being derived from a first context of a first one of the occurrences in a first one of the documents, and a second one of the features being derived from a second context of a second one of the occurrences in a second one of the documents;
calculating a sum of a plurality of weights, wherein each of the features is associated with one of the weights;
making a first determination that said sum exceeds a first threshold;
making a second determination that a label applies to said entity based on said first determination; and
storing or communicating a fact that said label applies to said entity, wherein said features comprise membership in a list, and wherein said acts further comprise;
choosing a subset of members of said members of said list based on members in said subset being estimated to occur more frequently in said documents than other members of said list; and
comparing a string that occurs in at least one of said contexts with members of said subset; and
making a third determination that said string represents a member of said list, said third determination being made (a) based on said string'"'"'s being among said subset, and (b) without use of a filter that accepts all strings that are members of said list and accepts at least one string that is not a member of said list.
2 Assignments
0 Petitions
Accused Products
Abstract
Entities, such as people, places and things, are labeled based on information collected across a possibly large number of documents. One or more documents are scanned to recognize the entities, and features are extracted from the context in which those entities occur in the documents. Observed entity-feature pairs are stored either in an in-memory store or an external store. A store manager optimizes use of the limited amount of space for an in-memory store by determining which store to put an entity-feature pair in, and when to evict features from the in-memory store to make room for new pairs. Feature that may be observed in an entity'"'"'s context may take forms such as specific word sequences or membership in a particular list.
-
Citations
17 Claims
-
1. A method of classifying entities, the method comprising:
-
using a processor to perform acts comprising; recognizing occurrences of an entity in a plurality of documents; identifying a plurality of features in contexts of said occurrences, a first one of the features being derived from a first context of a first one of the occurrences in a first one of the documents, and a second one of the features being derived from a second context of a second one of the occurrences in a second one of the documents; calculating a sum of a plurality of weights, wherein each of the features is associated with one of the weights; making a first determination that said sum exceeds a first threshold; making a second determination that a label applies to said entity based on said first determination; and storing or communicating a fact that said label applies to said entity, wherein said features comprise membership in a list, and wherein said acts further comprise; choosing a subset of members of said members of said list based on members in said subset being estimated to occur more frequently in said documents than other members of said list; and comparing a string that occurs in at least one of said contexts with members of said subset; and making a third determination that said string represents a member of said list, said third determination being made (a) based on said string'"'"'s being among said subset, and (b) without use of a filter that accepts all strings that are members of said list and accepts at least one string that is not a member of said list. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system that classifies entities, the system comprising:
-
a volatile random-access memory; an entity recognizer that identifies occurrence of a first entity in a document; a feature extractor that identifies a feature that occurs in a context of said occurrence of said first entity; a first entity-feature store that is located in said memory and that stores entity-feature pairs; a second entity-feature store that is not located in said memory and that stores entity-feature pairs; and a store manager that receives a first entity-feature pair obtained from a document processed by collective action of said entity recognizer and said feature extractor, wherein said store manager maintains a first list of entity-feature pairs and defines a subset of said first list as being those entity feature pairs that are estimated to occur at least with a first estimated frequency, a second entity-feature pair being stored in said first entity-feature store and being in said subset, and wherein said store manager determines whether to store said first entity-feature pair in said first entity-feature store or said second entity-feature store based on a first estimated frequency of said second entity-feature pair. - View Dependent Claims (9, 10, 11, 12, 13)
-
-
14. One or more computer-readable storage media that store executable instructions to perform a method of classifying entities, the method comprising:
-
recognizing occurrences of an entity in a plurality of documents; identifying a plurality of first features in contexts of said occurrences; assigning a label to said entity, based on weights of said first features, when probability of correctness of said label reaches a threshold, there being at least some second features in said document that are associated with said entity but that are not included among said first features; verifying correctness of said label based one or more of said second features; storing a first pair that comprise said entity and one of said first features in either a first entity-feature store that is in volatile random access memory, or a second entity feature store that is not in said memory; choosing whether to store said first pair in said first entity-feature store based on a first estimate of said frequency of said first pair in said plurality of documents; determining that said first estimate of said frequency is higher than an estimated frequency of a second pair that is stored in said first entity-feature store; and evicting said second pair from said first entity-feature store. - View Dependent Claims (15, 16, 17)
-
Specification