Method and apparatus for extracting entity names and their relations

US 8,745,093 B1
Filed: 09/28/2000
Issued: 06/03/2014
Est. Priority Date: 09/28/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

receiving annotated data;

parsing, at least partially, the annotated data, wherein parsing includes identifying syntactic structure of sentences within the annotated data; and

extracting training sets from the parsed annotated data, wherein the training sets are based on a plurality of features, wherein extracting comprises at least one of tagging the annotated data for marking words, and defining and segmenting words based on languages, wherein extracting further comprises extracting entity names and relations between entity names based on the information sets, and wherein extracting further comprises identifying information sets using memory-based Information Gain (IG)-Trees, wherein the IG-Trees are generated based on the plurality of features, wherein the plurality of features comprise one or more of words, phrases, sentences, and objects, and wherein each information set is identified based on a corresponding memory-based IG-Tree including one or more of a person-name IG-Tree, an entity-name IG-Tree, a noun phrase IG-Tree, and a relation IG-Tree.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

According to one embodiment of the invention, a method includes generating a person-name Information Gain (IG)-Tree and a relation IG-Tree from annotated data. The method also includes tagging and partial parsing of an input document. The names of the persons are extracted within the input document using the person-name IG-tree. Additionally, names of organizations are extracted within the input document. The method also includes extracting entity names that are not names of persons and organizations within the input document. Further, the relations between the identified entity names are extracted using the relation-IG-tree.

Citations

15 Claims

1. A method comprising:
- receiving annotated data;
  
  parsing, at least partially, the annotated data, wherein parsing includes identifying syntactic structure of sentences within the annotated data; and
  
  extracting training sets from the parsed annotated data, wherein the training sets are based on a plurality of features, wherein extracting comprises at least one of tagging the annotated data for marking words, and defining and segmenting words based on languages, wherein extracting further comprises extracting entity names and relations between entity names based on the information sets, and wherein extracting further comprises identifying information sets using memory-based Information Gain (IG)-Trees, wherein the IG-Trees are generated based on the plurality of features, wherein the plurality of features comprise one or more of words, phrases, sentences, and objects, and wherein each information set is identified based on a corresponding memory-based IG-Tree including one or more of a person-name IG-Tree, an entity-name IG-Tree, a noun phrase IG-Tree, and a relation IG-Tree.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein the plurality of features comprise local context features that are extracted from syntactical relationship between two or more words of a set of words, and wherein the local context features are further extracted from a semantic feature relating to a semantic category of one or more headwords of the set of words, wherein the plurality of features further comprise one or more of global context features, surface linguistic features, and deep linguistic features, wherein the global context features comprise a broader view of each of the set of words in relation with content of an entire document that contains the set of words.
  - 3. The method of claim 1, wherein memory-based IG-Trees are based on memory-based learning including classification-based supervised learning, wherein the memory-based learning includes one or more of similarity-based learning, example-based learning, analogy-based learning, case-based learning, instance-based learning, and lazy learning.
  - 4. The method of claim 1, wherein the training sets comprise one or more a first training set including words in names, a second training set including entity names, a third training set including phrases, and a fourth training set including relationships amongst the entities.
  - 5. The method of claim 1, wherein the marked words comprise one or more of nouns, verbs, proper nouns, pronouns, adverbs, and adjectives, wherein the marked words are compared with previously-identified marked words to identify missing marked words.

6. A system having a storage device to store instructions, and a processing device to execute the instructions, wherein the execution of the instructions cause the processing device to perform one or more operations comprising:
- receiving annotated data;
  
  parsing, at least partially, the annotated data, wherein parsing includes identifying syntactic structure of sentences within the annotated data; and
  
  extracting training sets from the parsed annotated data, wherein the training sets are based on a plurality of features, wherein extracting comprises at least one of tagging the annotated data for marking words, and defining and segmenting words based on languages, wherein extracting further comprises extracting entity names and relations between entity names based on the information sets, and wherein extracting further comprises identifying information sets using memory-based Information Gain (IG)-Trees, wherein the IG-Trees are generated based on the plurality of features, wherein the plurality of features comprise one or more of words, phrases, sentences, and objects, and wherein each information set is identified based on a corresponding memory-based IG-Tree including one or more of a person-name IG-Tree, an entity-name IG-Tree, a noun phrase IG-Tree, and a relation IG-Tree.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The system of claim 6, wherein the plurality of features comprise local context features that are extracted from syntactical relationship between two or more words of a set of words, and wherein the local context features are further extracted from a semantic feature relating to a semantic category of one or more headwords of the set of words, wherein the plurality of features further comprise one or more of global context features, surface linguistic features, and deep linguistic features, wherein the global context features comprise a broader view of each of the set of words in relation with content of an entire document that contains the set of words.
  - 8. The system of claim 6, wherein memory-based IG-Trees are based on memory-based learning including classification-based supervised learning, wherein the memory-based learning includes one or more of similarity-based learning, example-based learning, analogy-based learning, case-based learning, instance-based learning, and lazy learning.
  - 9. The system of claim 6, wherein the training sets comprise one or more a first training set including words in names, a second training set including entity names, a third training set including phrases, and a fourth training set including relationships amongst the entities.
  - 10. The system of claim 6, wherein the marked words comprise one or more of nouns, verbs, proper nouns, pronouns, adverbs, and adjectives, wherein the marked words are compared with previously-identified marked words to identify missing marked words.

11. A machine-readable medium having stored thereon instructions which when executed by a processing device, cause the computing device to perform one or more operations comprising:
- receiving annotated data;
  
  parsing, at least partially, the annotated data, wherein parsing includes identifying syntactic structure of sentences within the annotated data;
  
  extracting training sets from the parsed annotated data, wherein the training sets are based on a plurality of features, wherein extracting comprises at least one of tagging the annotated data for marking words, and defining and segmenting words based on languages, wherein extracting further comprises extracting entity names and relations between entity names based on the information sets, and wherein extracting further comprises identifying information sets using memory-based Information Gain (IG)-Trees, wherein the IG-Trees are generated based on the plurality of features, wherein the plurality of features comprise one or more of words, phrases, sentences, and objects, and wherein each information set is identified based on a corresponding memory-based IG-Tree including one or more of a person-name IG-Tree, an entity-name IG-Tree, a noun phrase IG-Tree, and a relation IG-Tree.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The machine-readable medium of claim 11, wherein the plurality of features comprise local context features that are extracted from syntactical relationship between two or more words of a set of words, and wherein the local context features are further extracted from a semantic feature relating to a semantic category of one or more headwords of the set of words, wherein the plurality of features further comprise one or more of global context features, surface linguistic features, and deep linguistic features, wherein the global context features comprise a broader view of each of the set of words in relation with content of an entire document that contains the set of words.
  - 13. The machine-readable medium of claim 11, wherein memory-based IG-Trees are based on memory-based learning including classification-based supervised learning, wherein the memory-based learning includes one or more of similarity-based learning, example-based learning, analogy-based learning, case-based learning, instance-based learning, and lazy learning.
  - 14. The machine-readable medium of claim 11, wherein the training sets comprise one or more a first training set including words in names, a second training set including entity names, a third training set including phrases, and a fourth training set including relationships amongst the entities.
  - 15. The machine-readable medium of claim 11, wherein the marked words comprise one or more of nouns, verbs, proper nouns, pronouns, adverbs, and adjectives, wherein the marked words are compared with previously-identified marked words to identify missing marked words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Zhang, Yimin, Zhou, Joe F.
Primary Examiner(s)
BURKE, JEFF A

Application Number

US10/019,879
Time in Patent Office

4,996 Days
Field of Search

706/20, 706/25, 706/26, 706 16- 18, 707/755, 707/797
US Class Current

707/797
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/3344   using natural language anal...

G06F 16/81   Indexing, e.g. XML tags; Da...

G06F 16/94   Hypermedia Hyperlinking G06...

G06N 20/00   Machine learning

Method and apparatus for extracting entity names and their relations

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for extracting entity names and their relations

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links