ELECTRONIC DOCUMENT SOURCE INGESTION FOR NATURAL LANGUAGE PROCESSING SYSTEMS
First Claim
1. A method, comprising:
- receiving a plurality of electronic documents, wherein the electronic documents are arranged according to different, respective formats;
identifying a properties file associated with one of the electronic documents, the properties file defining a formatting element of the respective format in the one electronic document and an action corresponding to a text portion associated with the formatting element;
parsing the one electronic document to identify the formatting element using one or more processors;
upon identifying the text portion associated with the identified formatting element, performing the action to the text portion by assigning the text portion to a formatting element of a normalized format; and
storing the text portion into a natural language processing (NLP) object based on the formatting element of the normalized format, wherein text in the NLP object is arranged based on the normalized format.
1 Assignment
0 Petitions
Accused Products
Abstract
The data store for a natural-language computing system may include information that originates from a plurality of different data sources—e.g., journals, websites, magazines, reference books, and the like. In one embodiment, the information or text from the data sources are converted into a single, shared format and stored as objects in a data store. In order to ingest the different documents with their respective formats, a natural language processing system may perform preprocessing to change the different formats into a normalized format. When a new text document is received, the text may be correlated to a particular properties file which includes instructions specifying how the preprocessor should interpret the received text. Based on these instructions, a preprocessor identifies relevant portions of the text document and assigns these portions to formatting elements in the normalized format. The text may then be stored in the objects based on this assignment.
92 Citations
7 Claims
-
1. A method, comprising:
-
receiving a plurality of electronic documents, wherein the electronic documents are arranged according to different, respective formats; identifying a properties file associated with one of the electronic documents, the properties file defining a formatting element of the respective format in the one electronic document and an action corresponding to a text portion associated with the formatting element; parsing the one electronic document to identify the formatting element using one or more processors; upon identifying the text portion associated with the identified formatting element, performing the action to the text portion by assigning the text portion to a formatting element of a normalized format; and storing the text portion into a natural language processing (NLP) object based on the formatting element of the normalized format, wherein text in the NLP object is arranged based on the normalized format. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
Specification