Architecture of a framework for information extraction from natural language documents
First Claim
1. An information extraction architecture capable of extracting information from natural language documents, comprising:
- application program interfacing means for receiving a natural language document from an application program and converting the natural language document into raw data, said interfacing means being configurable without altering a source code of said application program;
interfacing extraction means for receiving the raw data from the application program interfacing means and providing the raw data to an extractor, whereby the extractor extracts text and information from the raw data; and
action means for providing an application independent external action on the extracted data and outputting the application independent external action extracted data to the application program.
1 Assignment
0 Petitions
Accused Products
Abstract
A framework for information extraction from natural language documents is application independent and provides a high degree of reusability. The framework integrates different Natural Language/Machine Learning techniques, such as parsing and classification. The architecture of the framework is integrated in an easy to use access layer. The framework performs general information extraction, classification/categorization of natural language documents, automated electronic data transmission (e.g., E-mail and facsimile) processing and routing, and plain parsing. Inside the framework, requests for information extraction are passed to the actual extractors. The framework can handle both pre- and post processing of the application data, control of the extractors, enrich the information extracted by the extractors. The framework can also suggest necessary actions the application should take on the data. To achieve the goal of easy integration and extension, the framework provides an integration (outside) application program interface (API) and an extractor (inside) API.
198 Citations
13 Claims
-
1. An information extraction architecture capable of extracting information from natural language documents, comprising:
-
application program interfacing means for receiving a natural language document from an application program and converting the natural language document into raw data, said interfacing means being configurable without altering a source code of said application program;
interfacing extraction means for receiving the raw data from the application program interfacing means and providing the raw data to an extractor, whereby the extractor extracts text and information from the raw data; and
action means for providing an application independent external action on the extracted data and outputting the application independent external action extracted data to the application program. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
initializing means for initializing a framework in response to the application program;
storing means for storing the extracted information; and
retrieving means for retrieving the stored extracted information.
-
-
3. The architecture of claim 2, further comprising terminating means for terminating the extraction processes and clearing free memory.
-
4. The architecture of claim 1, wherein the action means includes:
action execution means for executing a desired action, the action execution means being application dependent.
-
5. The architecture of claim 1, further comprising
storing means for storing incoming text data received from the application program interfacing means prior to the text data being extracted and processed. -
6. The architecture of claim 1, wherein the application program interfacing means is an input access layer that is capable of interfacing with application programs.
-
7. The architecture of claim 1, wherein the interfacing extraction means includes preprocessing means for cleaning the raw data.
-
8. The architecture of claim 7, wherein the preprocessing means includes at least one (i) stripping means for stripping irrelevant pieces of text, (ii) filter means for filtering out special characters of tags and (iii) converting means for converting between different character sets.
-
9. The architecture of claim 1, further comprising controlling means for controlling the interfacing extractor means during selection of a desired extractor.
-
10. The architecture of claim 1, further comprising recording means for providing a record of all information gathered during extraction of the raw data.
-
11. The architecture of claim 1, further comprising library means associated with the application program interfacing means, the library means providing a library of application programs so that the application program interfacing means interfaces between the application programs and the interfacing extracting means.
-
12. The architecture of claim 1, wherein the application program interfacing means is modular.
-
13. A method of extracting information from natural language documents, comprising:
-
interfacing an application program with a framework for information extraction from natural language documents, said framework being configurable without altering a source code of the application program;
interfacing extracted information from raw data, received in response to interfacing the application program with an extractor, the extracted information being processed text and information representation; and
providing an application independent action specification, associated with the extracted information, and outputting the application independent action specification to the application program for performing an application dependent implementation of the action specification.
-
Specification