Entity analysis system
First Claim
1. A computer-implemented method of learning related entities, the method comprising:
- receiving a set of entities, the set of entities including a plurality of entities and each entity in the set of entities relating to a first concept;
receiving training content that includes textual content that is organized and that includes the plurality of entities of the set of entities; and
learning additional entities that are related to the first concept by iteratively performing the following steps;
identifying one or more potential word templates from the training content based on occurrences of one or more words in the training content with an entity of the set of entities, wherein each potential word template is one or more words, and wherein each potential word template is tagged with a part-of-speech tag based on grammatical use of the one or more words in the training content;
identifying one or more word templates from the one or more potential word templates based on a frequency of occurrence of the one or more potential word templates and based on the part-of-speech tag of the one or more potential word templates compared to part-of-speech tags of word templates of a set of word templates, wherein the one or more identified word templates are added to the set of word templates;
identifying, for each identified word template, one or more part-of-speech tags of the identified word templates;
adjusting, for each identified word template, a confidence score of the identified word template when the one or more part of speech tags of the identified word template is similar to the part-of-speech tags of word templates of a set of word templates;
adjusting, for each identified word template, the confidence score of the identified word template when the identified word template is identified as being a false positive;
comparing, for each identified word template, the confidence score of the identified word template to a threshold value;
removing the identified word template from the set of word templates when the confidence score of the identified word template is outside the threshold value;
identifying one or more candidate entities that relate to the first concept based on occurrences of each of the one or more candidate entities in the training content with at least one of the word templates of the set of word templates, wherein the one or more candidate entities are added to a set of candidate entities;
identifying a part-of-speech tag for each candidate entity;
removing a candidate entity from the set of candidate entities when the part-of-speech tag of the candidate entity is different from a part-of-speech tag of the set of entities;
receiving an external input selecting candidate entities for removal if the selected candidate entities do not relate to the first concept from the set of candidate entities;
removing candidate entities from the set of candidate entities based on the received external input;
adding one or more candidate entities remaining in the set of candidate entities to the set of entities; and
storing the set of entities in association with the first concept.
7 Assignments
0 Petitions
Accused Products
Abstract
A method for building a factual database of concepts and entities that are related to the concepts through a learning process. Training content (e.g., news articles, books) and a set of entities (e.g., Bill Clinton and Barack Obama) that are related to a concept (e.g., Presidents) is received. Groups of words that co-occur frequently in the textual content in conjunction with the entities are identified as templates. Templates may also be identified by analyzing parts-of-speech patterns of the templates. Entities that co-occur frequently in the textual content in conjunction with the templates are identified as additional related entities (e.g., Ronald Reagan and Richard Nixon). To eliminate erroneous results, the identified entities may be presented to a user who removes any false positives. The entities are then stored in association with the concept.
15 Citations
20 Claims
-
1. A computer-implemented method of learning related entities, the method comprising:
-
receiving a set of entities, the set of entities including a plurality of entities and each entity in the set of entities relating to a first concept; receiving training content that includes textual content that is organized and that includes the plurality of entities of the set of entities; and learning additional entities that are related to the first concept by iteratively performing the following steps; identifying one or more potential word templates from the training content based on occurrences of one or more words in the training content with an entity of the set of entities, wherein each potential word template is one or more words, and wherein each potential word template is tagged with a part-of-speech tag based on grammatical use of the one or more words in the training content; identifying one or more word templates from the one or more potential word templates based on a frequency of occurrence of the one or more potential word templates and based on the part-of-speech tag of the one or more potential word templates compared to part-of-speech tags of word templates of a set of word templates, wherein the one or more identified word templates are added to the set of word templates; identifying, for each identified word template, one or more part-of-speech tags of the identified word templates; adjusting, for each identified word template, a confidence score of the identified word template when the one or more part of speech tags of the identified word template is similar to the part-of-speech tags of word templates of a set of word templates; adjusting, for each identified word template, the confidence score of the identified word template when the identified word template is identified as being a false positive; comparing, for each identified word template, the confidence score of the identified word template to a threshold value; removing the identified word template from the set of word templates when the confidence score of the identified word template is outside the threshold value; identifying one or more candidate entities that relate to the first concept based on occurrences of each of the one or more candidate entities in the training content with at least one of the word templates of the set of word templates, wherein the one or more candidate entities are added to a set of candidate entities; identifying a part-of-speech tag for each candidate entity; removing a candidate entity from the set of candidate entities when the part-of-speech tag of the candidate entity is different from a part-of-speech tag of the set of entities; receiving an external input selecting candidate entities for removal if the selected candidate entities do not relate to the first concept from the set of candidate entities; removing candidate entities from the set of candidate entities based on the received external input; adding one or more candidate entities remaining in the set of candidate entities to the set of entities; and storing the set of entities in association with the first concept. - View Dependent Claims (2)
-
-
3. A computer-implemented method of learning related entities, the method comprising:
-
receiving a set of entities, the set of entities including a plurality of entities and each entity in the set of entities relating to a first concept; receiving training content that includes textual content that is organized and that includes the plurality of entities of the set of entities; and learning additional entities that are related to the first concept by iteratively performing the following steps; identifying one or more word templates from the training content based on occurrences of one or more words in the training content with an entity of the set of entities, wherein each word template is one or more words, and wherein the one or more identified word templates are added to a set of word templates; identifying, for each identified word template, one or more part-of-speech tags of the identified word templates; adjusting, for each identified word template, a confidence score of the identified word template when the one or more part of speech tags of the identified word template is similar to the part-of-speech tags of word templates of a set of word templates; adjusting, for each identified word template, the confidence score of the identified word template when the identified word template is identified as being a false positive; comparing, for each identified word template, the confidence score of the identified word template to a threshold value; removing the identified word template from the set of word templates when the confidence score of the identified word template is outside the threshold value; identifying one or more candidate entities that relate to the first concept based on occurrences of each of the one or more candidate entities in the training content with at least one of the word templates of the set of word templates, wherein the one or more identified candidate entities are added to a set of candidate entities; identifying a part-of-speech tag for each candidate entity; removing a candidate entity from the set of candidate entities when the part-of-speech tag of the candidate entity is different from the part-of-speech tag of the set of entities; receiving an external input selecting candidate entities for removal if the selected candidate entities do not relate to the first concept from the set of candidate entities; removing candidate entities from the set of candidate entities based on the received external input; and adding one or more candidate entities remaining in the set of candidate entities to the set of entities. - View Dependent Claims (4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer product for learning related entities, the computer product comprising a non-transitory computer-readable medium containing computer program code for performing the method comprising:
-
receiving a set of entities, the set of entities including a plurality of entities and each entity in the set of entities relating to a first concept; receiving training content that includes textual content that is organized and that includes the plurality of entities of the set of entities; and learning additional entities that are related to the first concept by iteratively performing the following steps; identifying one or more word templates from the training content based on occurrences of one or more words in the training content with an entity of the set of entities, wherein each word template is one or more words, and wherein the one or more identified word templates are added to a set of word templates; identifying, for each identified word template, one or more part-of-speech tags of the identified word templates; adjusting, for each identified word template, a confidence score of the identified word template when the one or more part of speech tags of the identified word template is similar to the part-of-speech tags of word templates of a set of word templates; adjusting, for each identified word template, the confidence score of the identified word template when the identified word template is identified as being a false positive; comparing, for each identified word template, the confidence score of the identified word template to a threshold value; removing the identified word template from the set of word templates when the confidence score of the identified word template is outside the threshold value; identifying one or more candidate entities that relate to the first concept based on occurrences of each of the one or more candidate entities in the training content with at least one of the word templates of the set of word templates, wherein the one or more identified candidate entities are added to a set of candidate entities; identifying a part-of-speech tag for each candidate entity; removing a candidate entity from the set of candidate entities when the part-of-speech tag of the candidate entity is different from a part-of-speech tag of the set of entities; receiving an external input selecting candidate entities for removal if the selected candidate entities do not relate to the first concept from the set of candidate entities; removing candidate entities from the set of candidate entities based on the received external input; and adding the one or more candidate entities of the set of candidate entities to the set of entities.
-
Specification