System, method, and apparatus for information extraction of textual documents
First Claim
1. A method for extraction of text from a set of text documents, the method comprising the steps of:
- a) identifying a plurality of document segments within a given text document;
b) for each given document segment identified in a), generating and storing at least one structured annotation embedded within the document and associated with the given segment, the at least one structured annotation specifying the start and end of the given document segment and a rhetorical relation associated with the given segment;
c) processing the structured annotations generated and stored in b) to generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations;
d) storing the variables generated in c) in a repository;
e) receiving query input from a user that specifies at least one rhetorical relation of interest; and
f) in response to receipt of said query input, querying the variables stored in the repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by said query input for output to the user.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for text extraction employs structured annotations that are embedded within a text document and specify the start and end of a document segment and an associated rhetorical relation. The structured annotations are processed to generate and store variables that represent document segments and associated rhetorical relations. A user interacts with a computer to define query input that specifies at least one rhetorical relation of interest. The query input is processed to query the stored variables to identify document segments associated with a rhetorical relation that matches the rhetorical relation of interest and to return to the user information pertaining to the matching document segments. The rhetorical relation of interest as well as the stored variables can include RST relations whose meaning is dictated by nuclearity of the associated text as well as Speech Act relations whose meaning extends beyond the situational semantics of the associated text.
24 Citations
32 Claims
-
1. A method for extraction of text from a set of text documents, the method comprising the steps of:
-
a) identifying a plurality of document segments within a given text document; b) for each given document segment identified in a), generating and storing at least one structured annotation embedded within the document and associated with the given segment, the at least one structured annotation specifying the start and end of the given document segment and a rhetorical relation associated with the given segment; c) processing the structured annotations generated and stored in b) to generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations; d) storing the variables generated in c) in a repository; e) receiving query input from a user that specifies at least one rhetorical relation of interest; and f) in response to receipt of said query input, querying the variables stored in the repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by said query input for output to the user. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A system for extraction of text from a set of text documents comprising:
-
text document annotation means for identifying a plurality of document segments within a given text document and, for each given document segment, generating and storing at least one structured annotation embedded within the document and associated with the given segment, the at least one structured annotation specifying the start and end of the given document segment and a rhetorical relation associated with the given segment; annotation processing means for processing the structured annotations generated and stored by the text document annotation means to generate a plurality of variables that represent document segments and associated rhetorical relations as specified by the structured annotations; a repository storing the variables generated by the annotation processing means;
user input query means for receiving query input from a user that specifies at least one rhetorical relation of interest; andquery processing logic, operably coupled to the user input query means and the repository, that utilizes said query input to query the variables stored in the repository to identify zero or more document segments that are associated with a rhetorical relation that matches the at least one rhetorical relation of interest specified by said query input for output to the user. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)
-
Specification