System, Method, and Apparatus for Information Extraction of Textual Documents

US 20100169309A1
Filed: 12/30/2008
Published: 07/01/2010
Est. Priority Date: 12/30/2008
Status: Active Grant

First Claim

Patent Images

1. A method for extraction of text from a set of text documents, the method comprising the steps of:

a) identifying a plurality of document segments within a given text document;

b) for each given document segment identified in a), generating and storing at least one structured annotation embedded within the document and associated with the given segment, the at least one structured annotation specifying the start and end of the given document segment and a rhetorical relation associated with the given segment;

c) processing the structured annotations generated and stored in b) to generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations;

d) storing the variables generated in c) in a repository;

e) receiving query input from a user that specifies at least one rhetorical relation of interest; and

f) in response to receipt of said query input, querying the variables stored in the repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by said query input for output to the user.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for extraction of text from a set of text document(s). A plurality of document segments are identified within a given text document. For each document segment, at least one structured annotation is embedded within the document and associated with the given segment. The structured annotation specifies the start and end of the given document segment and a rhetorical relation associated with the given segment. The structured annotations are processed generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations, and such variables are stored in a data repository. A user interacts with a computer to define query input that specifies at least one rhetorical relation of interest. The query input specified by the user is processed to query the variables stored in the data repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by the query input. Information corresponding to the zero or more matching document segments is returned to the user. In the preferred embodiment, the rhetorical relations represented by the user supplied query input as well as the variables stored in the data repository include a set of RST relations whose meaning is dictated by nuclearity of the associated text. Such RST relations can include a plurality of mononuclear RST relations each having a nucleus and a satellite and a plurality of multinuclear RST relations each having a plurality of nucleus. The rhetorical relations represented by the user supplied query input as well as the variables stored in the data repository can also include a set of Speech Act relations whose meaning extends beyond the situational semantics of the associated text.

29 Citations

View as Search Results

32 Claims

1. A method for extraction of text from a set of text documents, the method comprising the steps of:
- a) identifying a plurality of document segments within a given text document;
  
  b) for each given document segment identified in a), generating and storing at least one structured annotation embedded within the document and associated with the given segment, the at least one structured annotation specifying the start and end of the given document segment and a rhetorical relation associated with the given segment;
  
  c) processing the structured annotations generated and stored in b) to generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations;
  
  d) storing the variables generated in c) in a repository;
  
  e) receiving query input from a user that specifies at least one rhetorical relation of interest; and
  
  f) in response to receipt of said query input, querying the variables stored in the repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by said query input for output to the user.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. A method according to claim 1, wherein:
    - the rhetorical relations include a set of RST relations whose meaning is dictated by nuclearity of the associated text.
  - 3. A method according to claim 2, wherein:
    - said set of RST relations includes a plurality of mononuclear RST relations each having a nucleus and a satellite.
  - 4. A method according to claim 2, wherein:
    - said set of RST relations include a plurality of multinuclear RST relations each having a plurality of nucleus.
  - 5. A method according to claim 1, wherein:
    - the rhetorical relations include a set of Speech Act relations whose meaning extends beyond the situational semantics of the associated text.
  - 6. A method according to claim 1, wherein:
    - said at least one structured annotation generated and stored in b) is an XML tag.
  - 7. A method according to claim 1, wherein:
    - the repository include first and second sets of variables, said first set of variables representing document segments specified by the structured annotations generated and stored in b), and said second set of variables representing rhetorical relations specified by the structured annotations generated and stored in b) and linked to variables of the first set.
  - 8. A method according to claim 1, wherein:
    - the repository stores ancillary data linked to a given text document;
      
      the query input received from the user specifies ancillary data of interest; and
      
      the querying of f) filters the matched document segments to identify those document segments belonging to a text document linked to ancillary data corresponding to the ancillary data of interest.
  - 9. A method according to claim 1, wherein:
    - the repository stores variables representing one of an actor and role and linked to document segments;
      
      the query input received from the user specifies an actor or role of interest; and
      
      the querying of f) filters the matched document segments to identify those document segments linked to variables representing an actor or role corresponding to the actor or role of interest.
  - 10. A method according to claim 1, wherein:
    - the query input received from the user specifies additional search terms; and
      
      the querying of f) filters the matched document segments to identify those document segments that satisfy the additional search terms.
  - 11. A method according to claim 10, wherein:
    - the additional search terms comprise one or more key word terms.
  - 12. A method according to claim 1, wherein:
    - the query input received from the user specifies a goal or need of the user; and
      
      the method further comprising analyzing matched document segments in accordance with the goal or need specified by the user and outputting the results of such analysis to the user.
  - 13. A method according to claim 1, wherein:
    - the query input received from the user specifies at least one sorting parameter; and
      
      the method further comprises sorting the matched documents in accordance with the at least one sort parameter specified by the query input and outputting the results in order as sorted to the user.
  - 14. A method according to claim 1, wherein:
    - the operations of a) and b) are performed by an automated tool or at least in part by a human operator.
  - 15. A method according to claim 1, further comprising:
    - presenting output of the querying to a user in a view that presents document segments that are connected to a particular document segment by a relation of interest.

16. A system for extraction of text from a set of text documents comprising:
- text document annotation means for identifying a plurality of document segments within a given text document and, for each given document segment, generating and storing at least one structured annotation embedded within the document and associated with the given segment, the at least one structured annotation specifying the start and end of the given document segment and a rhetorical relation associated with the given segment;
  
  annotation processing means for processing the structured annotations generated and stored by the text document annotation means to generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations;
  
  a repository storing the variables generated by the annotation processing means;
  
  user input query means for receiving query input from a user that specifies at least one rhetorical relation of interest; and
  
  query processing logic, operably coupled to the user input query means and the repository, that utilizes said query input to query the variables stored in the repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by said query input for output to the user.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
- - 17. A system according to claim 16, wherein:
    - the rhetorical relations include a set of RST relations whose meaning is dictated by nuclearity of the associated text.
  - 18. A system according to claim 17, wherein:
    - said set of RST relations includes a plurality of mononuclear RST relations each having a nucleus and a satellite.
  - 19. A system according to claim 17, wherein:
    - said set of RST relations include a plurality of multinuclear RST relations each having a plurality of nucleus.
  - 20. A system according to claim 16, wherein:
    - the rhetorical relations include a set of Speech Act relations whose meaning extends beyond the situational semantics of the associated text.
  - 21. A system according to claim 16, wherein:
    - said at least one structured annotation generated and stored by the text document annotation means is an XML tag.
  - 22. A system according to claim 16, wherein:
    - the repository include first and second sets of variables, said first set of variables representing document segments specified by the structured annotations generated and stored by the text document annotation means, and said second set of variables representing rhetorical relations specified by the structured annotations generated and stored by the text document annotation means and linked to variables of the first set.
  - 23. A system according to claim 16, wherein:
    - the repository stores ancillary data linked to a given text document;
      
      the query input received by the user input query means specifies ancillary data of interest; and
      
      the query processing logic filters the matched document segments to identify those document segments belonging to a text document linked to ancillary data corresponding to the ancillary data of interest.
  - 24. A system according to claim 16, wherein:
    - the repository stores variables representing one of an actor and role and linked to document segments;
      
      the query input received by the user input query means specifies an actor or role of interest; and
      
      the query processing logic filters the matched document segments to identify those document segments linked to variables representing an actor or role corresponding to the actor or role of interest.
  - 25. A system according to claim 16, wherein:
    - the query input received by the user input query means specifies additional search terms; and
      
      the query processing logic filters the matched document segments to identify those document segments that satisfy the additional search terms.
  - 26. A system according to claim 25, wherein:
    - the additional search terms comprise one or more key word terms.
  - 27. A system according to claim 16, wherein:
    - the query input received from the user specifies a goal or need of the user; and
      
      the system further comprises result presentation logic for analyzing matched document segments in accordance with the goal or need specified by the user and outputting the results of such analysis to the user.
  - 28. A system according to claim 16, wherein:
    - the query input received from the user specifies at least one sorting parameter; and
      
      the query processing logic sorts the matched documents in accordance with the at least one sort parameter specified by the query input and outputs the results in order as sorted to the user.
  - 29. A system according to claim 16, wherein:
    - the text document annotation means comprises an automated tool or carries out operations that are performed at least in part by a human operator.
  - 30. A system according to claim 16, wherein:
    - the user input query means and query processing logic are realized by a server coupled to users over a network.
  - 31. A system according to claim 16, wherein:
    - the user input query means and query processing logic are realized by a computer processing system accessible by one or more users.
  - 32. A system according to claim 16, further comprising:
    - result presentation logic for presenting output of the query processing logic to a user in a view that presents document segments that are connected to a particular document segment by a relation of interest.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SSCM, LLC
Original Assignee
Complyon Inc.
Inventors
Barrett, Leslie A., Mackof, Morton D.

Granted Patent

US 7,937,386 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/736
CPC Class Codes

G06F 16/3331 Query processing

G06F 16/951 Indexing; Web crawling tech...

System, Method, and Apparatus for Information Extraction of Textual Documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

29 Citations

32 Claims

Specification

Solutions

Use Cases

Quick Links

System, Method, and Apparatus for Information Extraction of Textual Documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

29 Citations

32 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links