System and method for extracting information from text using text annotation and fact extraction

US 7,912,705 B2
Filed: 01/19/2010
Issued: 03/22/2011
Est. Priority Date: 11/19/2003
Status: Expired due to Term

First Claim

Patent Images

1. A fact extraction tool set for extracting information from a document, implemented using a client-server hardware architecture, wherein the document includes text, comprising:

means for breaking the text into tokens;

a plurality of independent means for annotating the text with token attributes, constituent attributes, links, and tree-based attributes, using XML as a basis for representing the annotated text, wherein each of the means for annotating has at least one specific annotating function;

means for resolving conflicting annotation boundaries in the annotated text, to produce a single XML-based representation of the document with well-formed XML, wherein the conflicting annotation boundaries result from annotating the text using a plurality of independent means for annotating; and

means for extracting facts from the single XML-based representation of the document using text pattern recognition rules, wherein each text pattern recognition rule comprises a pattern that describes text of interest, a label that names the pattern for testing and debugging purposes, and an action that indicates what should be done in response to a matching of the pattern, wherein the text pattern recognition rules independently identify constituents by use of regular expression-based functionality, tree traversal functionality based on a language that can navigate XML representations of text, and user-defined matching functionality, and wherein the regular expression-based functionality identifies sequential constituents, and the tree traversal functionality identifies non-contiguous constituents that are distinct from the sequential constituents identified by the regular expression-based functionality.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A fact extraction tool set (“FEX”) finds and extracts targeted pieces of information from text using linguistic and pattern matching technologies, and in particular, text annotation and fact extraction. Text annotation tools break a text, such as a document, into its base tokens and annotate those tokens or patterns of tokens with orthographic, syntactic, semantic, pragmatic and other attributes. A user-defined “Annotation Configuration” controls which annotation tools are used in a given application. XML is used as the basis for representing the annotated text. A tag uncrossing tool resolves conflicting (crossed) annotation boundaries in an annotated text to produce well-formed XML from the results of the individual annotators. The fact extraction tool is a pattern matching language which is used to write scripts that find and match patterns of attributes that correspond to targeted pieces of information in the text, and extract that information.

113 Citations

View as Search Results

21 Claims

1. A fact extraction tool set for extracting information from a document, implemented using a client-server hardware architecture, wherein the document includes text, comprising:
- means for breaking the text into tokens;
  
  a plurality of independent means for annotating the text with token attributes, constituent attributes, links, and tree-based attributes, using XML as a basis for representing the annotated text, wherein each of the means for annotating has at least one specific annotating function;
  
  means for resolving conflicting annotation boundaries in the annotated text, to produce a single XML-based representation of the document with well-formed XML, wherein the conflicting annotation boundaries result from annotating the text using a plurality of independent means for annotating; and
  
  means for extracting facts from the single XML-based representation of the document using text pattern recognition rules, wherein each text pattern recognition rule comprises a pattern that describes text of interest, a label that names the pattern for testing and debugging purposes, and an action that indicates what should be done in response to a matching of the pattern, wherein the text pattern recognition rules independently identify constituents by use of regular expression-based functionality, tree traversal functionality based on a language that can navigate XML representations of text, and user-defined matching functionality, and wherein the regular expression-based functionality identifies sequential constituents, and the tree traversal functionality identifies non-contiguous constituents that are distinct from the sequential constituents identified by the regular expression-based functionality.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The fact extraction tool set of claim 1, wherein the attributes include tokenization, text normalization, part of speech tags, sentence boundaries, parse trees, and semantic attribute tagging.
  - 3. The fact extraction tool set of claim 1, wherein the means for annotating the text comprises a plurality of independent annotators, wherein each of the annotators has at least one specific annotation function, and wherein the fact extraction tool set further comprises user-implemented means for specifying which of the annotators to use the order of their use.
  - 4. The fact extraction tool set of claim 1, wherein:
    - the token attributes have a one-per-base-token alignment, where for the attribute type represented, there is an attempt to assign an attribute to each base token;
      
      the constituent attributes are assigned yes-no values, where the entire pattern of each base token is considered to be a single constituent with respect to some annotation value; and
      
      the links assign common identifiers to coreferring and other related patterns of base tokens.
  - 5. The fact extraction tool set of claim 1, wherein the pattern recognition rules query for at least one of literal text, attributes, and relationships found in the annotated text to define facts to be extracted.
  - 6. The fact extraction tool set of claim 1, wherein the means for extracting uses XPath for traversing XML-based tree representations in the annotated text.
  - 7. The fact extraction tool set of claim 1, further comprising:
    - means for associating all annotations assigned to a particular piece of text with the base tokens for that text to generate aligned annotations; and
      
      wherein each text pattern recognition rule queries for at least one of literal text, attributes, and relationship found in the aligned annotations to define the facts to be extracted.
  - 8. The fact extraction tool set of claim 7, wherein the user-defined matching functions are used to name and define a fragment of a pattern.
  - 9. The fact extraction tool set of claim 7, wherein the means for extracting uses XPath for traversing XML-based tree representations in the annotated text.

10. A non-transitory computer usable storage medium storing computer readable program code, which is executed by a processor, where the computer readable program code includes a rule-based information extraction language for use in identifying and extracting potentially interesting pieces of information in aligned annotations in a single XML-based representation of a document including text, the language comprising a plurality of text pattern recognition rules that query for at least one of literal text, attributes, and relationships found in the aligned annotations to define to be extracted, wherein each of the text pattern recognition rules comprises:
- a pattern that describes text of interest;
  
  a label that names the pattern for testing and debugging purposes; and
  
  an action that indicates what should be done in response to a matching of the pattern; and
  
  wherein the text pattern recognition rules independently identify constituents by use of regular expression-based functionality, tree traversal functionality based on a language that can navigate XML representations of text, and user-defined matching functionality, and wherein the regular expression-based functionality identifies sequential constituents, and the tree traversal functionality identifies non-contiguous constituents that are distinct from the sequential constituents identified by the regular expression-based functionality.
- View Dependent Claims (11)
- - 11. The non-transitory computer usable storage medium of claim 10, wherein the user-defined matching functions are used to name and define a fragment of a pattern.

12. A text annotation tool implemented using a client-server hardware architecture, comprising:
- means for breaking text into its base tokens;
  
  a plurality of independent annotators executed by the client-server hardware architecture for annotating the text with token attributes, constituent attributes, links, and tree-based attributes, using XML as a basis for representing the annotated text, wherein each of the annotators has a least one specific annotation function;
  
  means for enabling a user to specify which of the annotators to use and the order of their use;
  
  means for associating all annotations assigned to a particular piece of text with the base tokens for that particular piece of text to generate aligned annotations; and
  
  means for resolving conflicting annotation boundaries in the annotated text resulting from two or more conflicting independent annotators to produce a single XML-based representation of the document with well-formed XML, wherein the conflicting annotation boundaries result from annotating the text using a plurality of independent means for annotating.
- View Dependent Claims (13, 14)
- - 13. The text annotation tool of claim 12, wherein the attributes include tokenization, text normalization, part of speech tags, sentence boundaries, parse trees, and semantic attribute tagging.
  - 14. The text annotation tool of claim 12, wherein:
    - the token attributes have a one-per-base-token alignment, where for the attribute type represented, there is an attempt to assign an attribute to each base token;
      
      the constituent attributes are assigned yes-no values, where the entire pattern of each base token is considered to be a signal constituent with respect to some annotation value; and
      
      where the links assign common identifiers to coreferring and other related patterns of base tokens.

15. A method of extracting information from a document, implemented using a client-server hardware architecture, wherein the document includes text, comprising the steps of:
- breaking the text into tokens, using the client-server hardware architecture;
  
  annotating the text with token attributes, constituent attributes, links, and tree-based attributes, using XML as a basis for representing the annotated text, using a plurality of independent annotators executed by the client-server hardware architecture, each of the annotators having at least one specific annotation function;
  
  resolving conflicting annotation boundaries in the annotated text to produce a single XML-based representation of the document with well-formed XML, using the client-server hardware architecture, wherein the conflicting annotation boundaries result from annotating the text using a plurality of independent annotators; and
  
  extracting facts from the annotated text using text pattern recognition rules written in rule-based information extraction language, using the client-server hardware architecture, wherein each text pattern recognition rule comprises a pattern that describes text of interest, a label that names the pattern for testing and debugging purposes, and an action that indicates what should be done in response to a matching of the pattern, and wherein the text pattern recognition rules independently identify constituents by use of regular expression-based functionality, tree traversal functionality based on a language that can navigate XML representation of text, and user-defined matching functionality, and wherein the regular expression-based functionality identifies sequential constituents, and the tree traversal functionality identifies non-contiguous constituents that are distinct fro the sequential constituents identified by the regular expression-based functionality.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The method of claim 15, wherein the annotating step, the attributes include orthographic, syntactic, semantic, pragmatic and dictionary-based attributes.
  - 17. The method of claim 15, wherein the annotating step is carried out by a plurality of independent annotators executed by the client-server hardware architecture, wherein each of the annotators has at least one specific annotation function, and wherein the method further comprises the step of allowing a user to specify which of the annotators to use and the order of their use.
  - 18. The method of claim 15, wherein:
    - the token attributes have a one-per-base-token alignment, where for the attribute type represented, there is an attempt to assign an attributed to each base token;
      
      the constituent attributes are assigned yes-no values, where the entire pattern of each base token is considered to be a single constituent with respect to some annotation value; and
      
      the links assign common identifiers to coreferring and other related patters of base tokens.
  - 19. The method of claim 15, wherein the annotating step includes associating all annotations assigned to a particular piece of text with the base tokens for that text to generate aligned annotations, using the client-server hardware architecture.
  - 20. The method of claim 15, wherein the text pattern recognition rules query for at least one of literal text, attributes, and relationships found in the aligned annotations to define the facts to be extracted.
  - 21. The method of claim 15, wherein the extracting in the extracting step, XPath is used for traversing XML-based tree representations in the annotated text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
RELX Inc. (RELX PLC)
Original Assignee
LexisNexis Group Inc. (RELX PLC)
Inventors
Wiltshire, James S. Jr., Templar, Valentina, Wasson, Mark D., Koutsomitopoulou, Eleni, Xu, Steve, Chen, Shian-jung Dick, Loritz, Donald
Primary Examiner(s)
Wozniak; James S
Assistant Examiner(s)
SHAH, PARAS D

Application Number

US12/689,629
Publication Number

US 20100195909A1
Time in Patent Office

427 Days
Field of Search

704/9, 707/3, 715/230, 715/256
US Class Current

704/9
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/367   Ontology

G06F 40/169   Annotation, e.g. comment da...

G06F 40/289   Phrasal analysis, e.g. fini...

System and method for extracting information from text using text annotation and fact extraction

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

113 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for extracting information from text using text annotation and fact extraction

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

113 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links