System and method for extracting information from text using text annotation and fact extraction
First Claim
1. A fact extraction tool set for extracting information from a document, implemented using a client-server hardware architecture, wherein the document includes text, comprising:
- means for breaking the text into tokens;
a plurality of independent means for annotating the text with token attributes, constituent attributes, links, and tree-based attributes, using XML as a basis for representing the annotated text, wherein each of the means for annotating has at least one specific annotating function;
means for resolving conflicting annotation boundaries in the annotated text, to produce a single XML-based representation of the document with well-formed XML, wherein the conflicting annotation boundaries result from annotating the text using a plurality of independent means for annotating; and
means for extracting facts from the single XML-based representation of the document using text pattern recognition rules, wherein each text pattern recognition rule comprises a pattern that describes text of interest, a label that names the pattern for testing and debugging purposes, and an action that indicates what should be done in response to a matching of the pattern, wherein the text pattern recognition rules independently identify constituents by use of regular expression-based functionality, tree traversal functionality based on a language that can navigate XML representations of text, and user-defined matching functionality, and wherein the regular expression-based functionality identifies sequential constituents, and the tree traversal functionality identifies non-contiguous constituents that are distinct from the sequential constituents identified by the regular expression-based functionality.
1 Assignment
0 Petitions
Accused Products
Abstract
A fact extraction tool set (“FEX”) finds and extracts targeted pieces of information from text using linguistic and pattern matching technologies, and in particular, text annotation and fact extraction. Text annotation tools break a text, such as a document, into its base tokens and annotate those tokens or patterns of tokens with orthographic, syntactic, semantic, pragmatic and other attributes. A user-defined “Annotation Configuration” controls which annotation tools are used in a given application. XML is used as the basis for representing the annotated text. A tag uncrossing tool resolves conflicting (crossed) annotation boundaries in an annotated text to produce well-formed XML from the results of the individual annotators. The fact extraction tool is a pattern matching language which is used to write scripts that find and match patterns of attributes that correspond to targeted pieces of information in the text, and extract that information.
113 Citations
21 Claims
-
1. A fact extraction tool set for extracting information from a document, implemented using a client-server hardware architecture, wherein the document includes text, comprising:
-
means for breaking the text into tokens; a plurality of independent means for annotating the text with token attributes, constituent attributes, links, and tree-based attributes, using XML as a basis for representing the annotated text, wherein each of the means for annotating has at least one specific annotating function; means for resolving conflicting annotation boundaries in the annotated text, to produce a single XML-based representation of the document with well-formed XML, wherein the conflicting annotation boundaries result from annotating the text using a plurality of independent means for annotating; and means for extracting facts from the single XML-based representation of the document using text pattern recognition rules, wherein each text pattern recognition rule comprises a pattern that describes text of interest, a label that names the pattern for testing and debugging purposes, and an action that indicates what should be done in response to a matching of the pattern, wherein the text pattern recognition rules independently identify constituents by use of regular expression-based functionality, tree traversal functionality based on a language that can navigate XML representations of text, and user-defined matching functionality, and wherein the regular expression-based functionality identifies sequential constituents, and the tree traversal functionality identifies non-contiguous constituents that are distinct from the sequential constituents identified by the regular expression-based functionality. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory computer usable storage medium storing computer readable program code, which is executed by a processor, where the computer readable program code includes a rule-based information extraction language for use in identifying and extracting potentially interesting pieces of information in aligned annotations in a single XML-based representation of a document including text, the language comprising a plurality of text pattern recognition rules that query for at least one of literal text, attributes, and relationships found in the aligned annotations to define to be extracted, wherein each of the text pattern recognition rules comprises:
-
a pattern that describes text of interest; a label that names the pattern for testing and debugging purposes; and an action that indicates what should be done in response to a matching of the pattern; and wherein the text pattern recognition rules independently identify constituents by use of regular expression-based functionality, tree traversal functionality based on a language that can navigate XML representations of text, and user-defined matching functionality, and wherein the regular expression-based functionality identifies sequential constituents, and the tree traversal functionality identifies non-contiguous constituents that are distinct from the sequential constituents identified by the regular expression-based functionality. - View Dependent Claims (11)
-
-
12. A text annotation tool implemented using a client-server hardware architecture, comprising:
-
means for breaking text into its base tokens; a plurality of independent annotators executed by the client-server hardware architecture for annotating the text with token attributes, constituent attributes, links, and tree-based attributes, using XML as a basis for representing the annotated text, wherein each of the annotators has a least one specific annotation function; means for enabling a user to specify which of the annotators to use and the order of their use; means for associating all annotations assigned to a particular piece of text with the base tokens for that particular piece of text to generate aligned annotations; and means for resolving conflicting annotation boundaries in the annotated text resulting from two or more conflicting independent annotators to produce a single XML-based representation of the document with well-formed XML, wherein the conflicting annotation boundaries result from annotating the text using a plurality of independent means for annotating. - View Dependent Claims (13, 14)
-
-
15. A method of extracting information from a document, implemented using a client-server hardware architecture, wherein the document includes text, comprising the steps of:
-
breaking the text into tokens, using the client-server hardware architecture; annotating the text with token attributes, constituent attributes, links, and tree-based attributes, using XML as a basis for representing the annotated text, using a plurality of independent annotators executed by the client-server hardware architecture, each of the annotators having at least one specific annotation function; resolving conflicting annotation boundaries in the annotated text to produce a single XML-based representation of the document with well-formed XML, using the client-server hardware architecture, wherein the conflicting annotation boundaries result from annotating the text using a plurality of independent annotators; and extracting facts from the annotated text using text pattern recognition rules written in rule-based information extraction language, using the client-server hardware architecture, wherein each text pattern recognition rule comprises a pattern that describes text of interest, a label that names the pattern for testing and debugging purposes, and an action that indicates what should be done in response to a matching of the pattern, and wherein the text pattern recognition rules independently identify constituents by use of regular expression-based functionality, tree traversal functionality based on a language that can navigate XML representation of text, and user-defined matching functionality, and wherein the regular expression-based functionality identifies sequential constituents, and the tree traversal functionality identifies non-contiguous constituents that are distinct fro the sequential constituents identified by the regular expression-based functionality. - View Dependent Claims (16, 17, 18, 19, 20, 21)
-
Specification