Architecture of a framework for information extraction from natural language documents

US 6,553,385 B2
Filed: 09/01/1998
Issued: 04/22/2003
Est. Priority Date: 09/01/1998
Status: Expired due to Fees

First Claim

Patent Images

1. An information extraction architecture capable of extracting information from natural language documents, comprising:

application program interfacing means for receiving a natural language document from an application program and converting the natural language document into raw data, said interfacing means being configurable without altering a source code of said application program;

interfacing extraction means for receiving the raw data from the application program interfacing means and providing the raw data to an extractor, whereby the extractor extracts text and information from the raw data; and

action means for providing an application independent external action on the extracted data and outputting the application independent external action extracted data to the application program.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A framework for information extraction from natural language documents is application independent and provides a high degree of reusability. The framework integrates different Natural Language/Machine Learning techniques, such as parsing and classification. The architecture of the framework is integrated in an easy to use access layer. The framework performs general information extraction, classification/categorization of natural language documents, automated electronic data transmission (e.g., E-mail and facsimile) processing and routing, and plain parsing. Inside the framework, requests for information extraction are passed to the actual extractors. The framework can handle both pre- and post processing of the application data, control of the extractors, enrich the information extracted by the extractors. The framework can also suggest necessary actions the application should take on the data. To achieve the goal of easy integration and extension, the framework provides an integration (outside) application program interface (API) and an extractor (inside) API.

198 Citations

13 Claims

1. An information extraction architecture capable of extracting information from natural language documents, comprising:
- application program interfacing means for receiving a natural language document from an application program and converting the natural language document into raw data, said interfacing means being configurable without altering a source code of said application program;
  
  interfacing extraction means for receiving the raw data from the application program interfacing means and providing the raw data to an extractor, whereby the extractor extracts text and information from the raw data; and
  
  action means for providing an application independent external action on the extracted data and outputting the application independent external action extracted data to the application program.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The architecture of claim 1, further comprising:
3. The architecture of claim 2, further comprising terminating means for terminating the extraction processes and clearing free memory.
4. The architecture of claim 1, wherein the action means includes:
- action execution means for executing a desired action, the action execution means being application dependent.
5. The architecture of claim 1, further comprisingstoring means for storing incoming text data received from the application program interfacing means prior to the text data being extracted and processed.
6. The architecture of claim 1, wherein the application program interfacing means is an input access layer that is capable of interfacing with application programs.
7. The architecture of claim 1, wherein the interfacing extraction means includes preprocessing means for cleaning the raw data.
8. The architecture of claim 7, wherein the preprocessing means includes at least one (i) stripping means for stripping irrelevant pieces of text, (ii) filter means for filtering out special characters of tags and (iii) converting means for converting between different character sets.
9. The architecture of claim 1, further comprising controlling means for controlling the interfacing extractor means during selection of a desired extractor.
10. The architecture of claim 1, further comprising recording means for providing a record of all information gathered during extraction of the raw data.
11. The architecture of claim 1, further comprising library means associated with the application program interfacing means, the library means providing a library of application programs so that the application program interfacing means interfaces between the application programs and the interfacing extracting means.
12. The architecture of claim 1, wherein the application program interfacing means is modular.

13. A method of extracting information from natural language documents, comprising:
- interfacing an application program with a framework for information extraction from natural language documents, said framework being configurable without altering a source code of the application program;
  
  interfacing extracted information from raw data, received in response to interfacing the application program with an extractor, the extracted information being processed text and information representation; and
  
  providing an application independent action specification, associated with the extracted information, and outputting the application independent action specification to the application program for performing an application dependent implementation of the action specification.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Hampp-Bahnmueller, Thomas, Johnson, David E.
Primary Examiner(s)
Metjahic, Safet
Assistant Examiner(s)
LE, UYEN T

Application Number

US09/145,408
Publication Number

US 20020007358A1
Time in Patent Office

1,694 Days
Field of Search

707/1, 707/2, 707/3-6, 707/104.1, 706/45, 704/9, 709/328
US Class Current

1/1
CPC Class Codes

G06F 40/268   Morphological analysis

G06F 40/284   Lexical analysis, e.g. toke...

Y10S 707/99945   Object-oriented database st...

Y10S 707/99948   Application of database or ...

Architecture of a framework for information extraction from natural language documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

198 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Architecture of a framework for information extraction from natural language documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

198 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links