Preprocessing of string inputs in natural language processing
First Claim
1. A computer system comprising:
- a processing unit in communication with a memory; and
a functional unit in communication with the processing unit having a tool for natural language processing, the tool to;
determine optimal sentence boundary placement with a received string input comprising;
identify two or more preliminary sentence boundaries within the input;
identify two or more first potential sentences within the input utilizing the two or more preliminary sentence boundaries;
assign a first score to each first potential sentence, wherein each assigned first score corresponds to a probability of each potential sentence of the two or more first potential sentences being an actual sentence; and
selectively identify a grouping comprising at least two adjacent potential sentences based on a relationship to the assigned first scores;
categorize each of the two adjacent sentences as one of ill-formed prose (IFP) and semi-structure entity constructs (SSECs);
upon determining there are IFPs for further processing;
merge the at least two adjacent first potential sentences to create a second potential sentence; and
iteratively assign a second score to the created second potential sentence and merge at least one additional sentence adjacent to the created second potential sentence until there are no further IFPs to process, any SSECs are normalized, and a sentence boundary optimized output is created as a function of the iteratively assigned second score; and
output the sentence boundary optimized output to replace the adjacent first and second potential sentences.
1 Assignment
0 Petitions
Accused Products
Abstract
Natural language processing of raw text data for optimal sentence boundary placement. Raw text is extracted from a document and subject to cleaning. The extracted raw text is examined to identify preliminary sentence boundaries, which are used to identify potential sentences in the raw text. One or more potential sentences are assigned a well-formedness score. A value of the score correlates to whether the potential sentence is a truncated/ill-formed sentence or a well-formed sentence. One or more preliminary sentence boundaries are optimized depending on the value of the score of the potential sentence(s). Accordingly, the processing herein is an optimization that creates a sentence boundary optimized output.
-
Citations
20 Claims
-
1. A computer system comprising:
-
a processing unit in communication with a memory; and a functional unit in communication with the processing unit having a tool for natural language processing, the tool to; determine optimal sentence boundary placement with a received string input comprising; identify two or more preliminary sentence boundaries within the input; identify two or more first potential sentences within the input utilizing the two or more preliminary sentence boundaries; assign a first score to each first potential sentence, wherein each assigned first score corresponds to a probability of each potential sentence of the two or more first potential sentences being an actual sentence; and selectively identify a grouping comprising at least two adjacent potential sentences based on a relationship to the assigned first scores; categorize each of the two adjacent sentences as one of ill-formed prose (IFP) and semi-structure entity constructs (SSECs); upon determining there are IFPs for further processing; merge the at least two adjacent first potential sentences to create a second potential sentence; and iteratively assign a second score to the created second potential sentence and merge at least one additional sentence adjacent to the created second potential sentence until there are no further IFPs to process, any SSECs are normalized, and a sentence boundary optimized output is created as a function of the iteratively assigned second score; and output the sentence boundary optimized output to replace the adjacent first and second potential sentences. - View Dependent Claims (2, 3, 4, 13, 14, 15, 16)
-
-
5. A computer program product for natural language processing, the computer program product comprising a computer readable storage medium having program code embodied therewith, the program code executable by a processor to:
determine optimal sentence boundary placement with a received string input comprising; identify two or more preliminary sentence boundaries within the input; identify two or more first potential sentences within the input utilizing the two or more preliminary sentence boundaries; determine a confidence score for each potential sentence using a parser; assign a first score to each first potential sentence, wherein each assigned first score corresponds to a probability of each potential sentence of the two or more first potential sentences being an actual sentence; and selectively identify a grouping comprising at least two adjacent potential sentences based on a relationship to the assigned first scores; categorize each of the two adjacent sentences as one of ill-formed prose (IFP) and semi-structure entity constructs (SSECs); upon determining there are IFPs for further processing; merge the at least two adjacent first potential sentences to create a second potential sentence; and iteratively assign a second score to the created second potential sentence and merge at least one additional sentence adjacent to the created second potential sentence until there are no further IFPs to process, any SSECs are normalized, and a sentence boundary optimized output is created as a function of the iterative second score; and output the sentence boundary optimized output to replace the adjacent first and second potential sentences. - View Dependent Claims (6, 7, 8, 17, 18)
-
9. A method for natural language processing comprising:
determining optimal sentence boundary placement with a received string input comprising; identifying two or more preliminary sentence boundaries within the input; identifying two or more first potential sentences within the input utilizing the two or more preliminary sentence boundaries; determine a confidence score for each potential sentence using a parser; assigning a first score to each first potential sentence, wherein each assigned first score corresponds to a probability of each potential sentence of the two or more first potential sentences being an actual sentence; and selectively identifying a grouping comprising at least two adjacent potential sentences based on a relationship to the assigned first scores; categorizing each of the two adjacent sentences as one of ill-formed prose (IFP) and semi-structure entity constructs (SSECs); and upon determining there are IFPs for further processing; merging the at least two adjacent first potential sentences to create a second potential sentence; and iteratively assigning a second score to the created second potential sentence and merging at least one additional sentence adjacent to the created second potential sentence until there are no further IFPs to process, any SSECs are normalized, and a sentence boundary optimized output is created as a function of the iterative second score; and outputting the sentence boundary optimized output to replace the adjacent first and second potential sentences. - View Dependent Claims (10, 11, 12, 19, 20)
Specification