System, Method, and Apparatus for Information Extraction of Textual Documents
First Claim
1. A method for extraction of text from a set of text documents, the method comprising the steps of:
- a) identifying a plurality of document segments within a given text document;
b) for each given document segment identified in a), generating and storing at least one structured annotation embedded within the document and associated with the given segment, the at least one structured annotation specifying the start and end of the given document segment and a rhetorical relation associated with the given segment;
c) processing the structured annotations generated and stored in b) to generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations;
d) storing the variables generated in c) in a repository;
e) receiving query input from a user that specifies at least one rhetorical relation of interest; and
f) in response to receipt of said query input, querying the variables stored in the repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by said query input for output to the user.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for extraction of text from a set of text document(s). A plurality of document segments are identified within a given text document. For each document segment, at least one structured annotation is embedded within the document and associated with the given segment. The structured annotation specifies the start and end of the given document segment and a rhetorical relation associated with the given segment. The structured annotations are processed generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations, and such variables are stored in a data repository. A user interacts with a computer to define query input that specifies at least one rhetorical relation of interest. The query input specified by the user is processed to query the variables stored in the data repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by the query input. Information corresponding to the zero or more matching document segments is returned to the user. In the preferred embodiment, the rhetorical relations represented by the user supplied query input as well as the variables stored in the data repository include a set of RST relations whose meaning is dictated by nuclearity of the associated text. Such RST relations can include a plurality of mononuclear RST relations each having a nucleus and a satellite and a plurality of multinuclear RST relations each having a plurality of nucleus. The rhetorical relations represented by the user supplied query input as well as the variables stored in the data repository can also include a set of Speech Act relations whose meaning extends beyond the situational semantics of the associated text.
29 Citations
32 Claims
-
1. A method for extraction of text from a set of text documents, the method comprising the steps of:
-
a) identifying a plurality of document segments within a given text document; b) for each given document segment identified in a), generating and storing at least one structured annotation embedded within the document and associated with the given segment, the at least one structured annotation specifying the start and end of the given document segment and a rhetorical relation associated with the given segment; c) processing the structured annotations generated and stored in b) to generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations; d) storing the variables generated in c) in a repository; e) receiving query input from a user that specifies at least one rhetorical relation of interest; and f) in response to receipt of said query input, querying the variables stored in the repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by said query input for output to the user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system for extraction of text from a set of text documents comprising:
-
text document annotation means for identifying a plurality of document segments within a given text document and, for each given document segment, generating and storing at least one structured annotation embedded within the document and associated with the given segment, the at least one structured annotation specifying the start and end of the given document segment and a rhetorical relation associated with the given segment; annotation processing means for processing the structured annotations generated and stored by the text document annotation means to generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations; a repository storing the variables generated by the annotation processing means;
user input query means for receiving query input from a user that specifies at least one rhetorical relation of interest; andquery processing logic, operably coupled to the user input query means and the repository, that utilizes said query input to query the variables stored in the repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by said query input for output to the user. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
Specification