Electronic document source ingestion for natural language processing systems

US 9,053,086 B2
Filed: 12/12/2012
Issued: 06/09/2015
Est. Priority Date: 12/10/2012
Status: Expired due to Fees

First Claim

Patent Images

1. A method, comprising:

receiving a plurality of electronic documents, wherein each electronic document is arranged according to a different, respective format comprising a plurality of headers;

identifying a properties file associated with one of the electronic documents, the properties file defining a particular header of the respective format in the one electronic document, an action corresponding to a text portion associated with the particular header, and an extension class;

instantiating a preprocessor for parsing the one electronic document based on the extension class, wherein the preprocessor is configured to parse only electronic documents arranged using the respective format;

parsing the one electronic document to identify the particular header using one or more processors and the preprocessor;

upon identifying the text portion associated with the particular header, performing the action to the text portion by assigning the text portion to a formatting element of a normalized format; and

storing the text portion into a natural language processing (NLP) object based on the formatting element of the normalized format, wherein text in the NLP object is arranged based on the normalized format.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The data store for a natural-language computing system may include information that originates from a plurality of different data sources—e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single, shared format and stored as objects in a data store. In order to ingest the different documents with their respective formats, a natural language processing system may perform preprocessing to change the different formats into a normalized format. When a new text document is received, the text may be correlated to a particular properties file which includes instructions specifying how the preprocessor should interpret the received text. Based on these instructions, a preprocessor identifies relevant portions of the text document and assigns these portions to formatting elements in the normalized format. The text may then be stored in the objects based on this assignment.

96 Citations

View as Search Results

7 Claims

1. A method, comprising:
- receiving a plurality of electronic documents, wherein each electronic document is arranged according to a different, respective format comprising a plurality of headers;
  
  identifying a properties file associated with one of the electronic documents, the properties file defining a particular header of the respective format in the one electronic document, an action corresponding to a text portion associated with the particular header, and an extension class;
  
  instantiating a preprocessor for parsing the one electronic document based on the extension class, wherein the preprocessor is configured to parse only electronic documents arranged using the respective format;
  
  parsing the one electronic document to identify the particular header using one or more processors and the preprocessor;
  
  upon identifying the text portion associated with the particular header, performing the action to the text portion by assigning the text portion to a formatting element of a normalized format; and
  
  storing the text portion into a natural language processing (NLP) object based on the formatting element of the normalized format, wherein text in the NLP object is arranged based on the normalized format.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the properties file is one of a plurality of properties files, wherein each properties file is associated with one of the respective formats of the electronic documents.
  - 3. The method of claim 1, wherein the property file includes a plurality of formatting elements of the respective format, the plurality of formatting elements comprises a title and a section in the one electronic document.
  - 4. The method of claim 1, wherein the NLP object comprises text portions retrieved from other ones of the plurality of electronic documents, wherein the text portions are assigned to the formatting element of the normalized format.
  - 5. The method of claim 4, wherein the NLP object is a common analysis system (CAS) data structure.
  - 6. The method of claim 1, further comprising:
    - annotating the text in the NLP object for use in a natural-language computing system where the natural-language computing system uses the annotated text to communicate with a user.
  - 7. The method of claim 1, wherein instantiating the preprocessor comprises:
    - selecting a type of preprocessor based on the extension class, wherein each type of preprocessor corresponds to a different data source transmitting the plurality of electronic documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Dubbels, Joel C.
Primary Examiner(s)
Cao, Phuong Thao

Application Number

US13/711,788
Publication Number

US 20140164408A1
Time in Patent Office

909 Days
Field of Search

707/755, 707/756
US Class Current

1/1
CPC Class Codes

G06F 40/154 Tree transformation for tre...

G06F 40/205 Parsing

Electronic document source ingestion for natural language processing systems

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

96 Citations

7 Claims

Specification

Solutions

Use Cases

Quick Links

Electronic document source ingestion for natural language processing systems

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

96 Citations

7 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links