Preprocessing of string inputs in natural language processing

US 10,372,816 B2
Filed: 12/13/2016
Issued: 08/06/2019
Est. Priority Date: 12/13/2016
Status: Active Grant

First Claim

Patent Images

1. A computer system comprising:

a processing unit in communication with a memory; and

a functional unit in communication with the processing unit having a tool for natural language processing, the tool to;

determine optimal sentence boundary placement with a received string input comprising;

identify two or more preliminary sentence boundaries within the input;

identify two or more first potential sentences within the input utilizing the two or more preliminary sentence boundaries;

assign a first score to each first potential sentence, wherein each assigned first score corresponds to a probability of each potential sentence of the two or more first potential sentences being an actual sentence; and

selectively identify a grouping comprising at least two adjacent potential sentences based on a relationship to the assigned first scores;

categorize each of the two adjacent sentences as one of ill-formed prose (IFP) and semi-structure entity constructs (SSECs);

upon determining there are IFPs for further processing;

merge the at least two adjacent first potential sentences to create a second potential sentence; and

iteratively assign a second score to the created second potential sentence and merge at least one additional sentence adjacent to the created second potential sentence until there are no further IFPs to process, any SSECs are normalized, and a sentence boundary optimized output is created as a function of the iteratively assigned second score; and

output the sentence boundary optimized output to replace the adjacent first and second potential sentences.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Natural language processing of raw text data for optimal sentence boundary placement. Raw text is extracted from a document and subject to cleaning. The extracted raw text is examined to identify preliminary sentence boundaries, which are used to identify potential sentences in the raw text. One or more potential sentences are assigned a well-formedness score. A value of the score correlates to whether the potential sentence is a truncated/ill-formed sentence or a well-formed sentence. One or more preliminary sentence boundaries are optimized depending on the value of the score of the potential sentence(s). Accordingly, the processing herein is an optimization that creates a sentence boundary optimized output.

Citations

20 Claims

1. A computer system comprising:
- a processing unit in communication with a memory; and
  
  a functional unit in communication with the processing unit having a tool for natural language processing, the tool to;
  
  determine optimal sentence boundary placement with a received string input comprising;
  
  identify two or more preliminary sentence boundaries within the input;
  
  identify two or more first potential sentences within the input utilizing the two or more preliminary sentence boundaries;
  
  assign a first score to each first potential sentence, wherein each assigned first score corresponds to a probability of each potential sentence of the two or more first potential sentences being an actual sentence; and
  
  selectively identify a grouping comprising at least two adjacent potential sentences based on a relationship to the assigned first scores;
  
  categorize each of the two adjacent sentences as one of ill-formed prose (IFP) and semi-structure entity constructs (SSECs);
  
  upon determining there are IFPs for further processing;
  
  merge the at least two adjacent first potential sentences to create a second potential sentence; and
  
  iteratively assign a second score to the created second potential sentence and merge at least one additional sentence adjacent to the created second potential sentence until there are no further IFPs to process, any SSECs are normalized, and a sentence boundary optimized output is created as a function of the iteratively assigned second score; and
  
  output the sentence boundary optimized output to replace the adjacent first and second potential sentences.
- View Dependent Claims (2, 3, 4, 13, 14, 15, 16)
- - 2. The system of claim 1, wherein the merge of potential sentences further comprising the tool to:
    - create the second potential sentence utilizing movement of at least one of the preliminary sentence boundaries.
  - 3. The system of claim 2, further comprising the tool to:
    - correspond the second score to a probability of the second potential sentence being an actual sentence as determined through rules-based grammar usage; and
      
      determine the assigned second score of the created second potential sentence is greater than the assigned first score of the potential sentences utilized to create the second potential sentence.
  - 4. The system of claim 1, further comprising the tool to:
    - determine a quantity of consecutive potential sentences in the grouping, wherein the creation of the sentence boundary optimized output utilizes the determined quantity.
  - 13. The system of claim 1, wherein the probability is within a numerical range corresponding to rules-based grammar usage, the numerical range further corresponds to ill-formed prose and well-formed prose as a function of the grammar usage rules.
  - 14. The system of claim 1, wherein replacing the adjacent first and second potential sentences with the sentence optimized output comprises optimization of sentence boundary output, thereby increasing the efficiency of downstream processing of raw text data.
  - 15. The computer system of claim 1, wherein iterative assignment of the second score comprises a determination that the assigned score exceeds a predetermined threshold, thereby creating the sentence boundary optimized output.
  - 16. The computer system of claim 1, wherein iterative assignment of the second score comprises a determination that a predetermined number of iterations are performed and the iteration with the highest score is selected as the sentence boundary optimized output.

5. A computer program product for natural language processing, the computer program product comprising a computer readable storage medium having program code embodied therewith, the program code executable by a processor to:
- determine optimal sentence boundary placement with a received string input comprising;
  
  identify two or more preliminary sentence boundaries within the input;
  
  identify two or more first potential sentences within the input utilizing the two or more preliminary sentence boundaries;
  
  determine a confidence score for each potential sentence using a parser;
  
  assign a first score to each first potential sentence, wherein each assigned first score corresponds to a probability of each potential sentence of the two or more first potential sentences being an actual sentence; and
  
  selectively identify a grouping comprising at least two adjacent potential sentences based on a relationship to the assigned first scores;
  
  categorize each of the two adjacent sentences as one of ill-formed prose (IFP) and semi-structure entity constructs (SSECs);
  
  upon determining there are IFPs for further processing;
  
  merge the at least two adjacent first potential sentences to create a second potential sentence; and
  
  iteratively assign a second score to the created second potential sentence and merge at least one additional sentence adjacent to the created second potential sentence until there are no further IFPs to process, any SSECs are normalized, and a sentence boundary optimized output is created as a function of the iterative second score; and
  
  output the sentence boundary optimized output to replace the adjacent first and second potential sentences.
- View Dependent Claims (6, 7, 8, 17, 18)
- - 6. The computer program product of claim 5, wherein the merge of potential sentences includes program code to:
    - create the second potential sentence utilizing movement of at least one of the preliminary sentence boundaries.
  - 7. The computer program product of claim 6, further comprising program code to:
    - correspond the second score to a probability of the second potential sentence being an actual sentence as determined through rules-based grammar usage; and
      
      determine the assigned second score of the created second potential sentence is greater than the assigned first score of the potential sentences utilized to create the second potential sentence.
  - 8. The computer program product of claim 5, further comprising program code to:
    - determine a quantity of consecutive potential sentences in the grouping, wherein the creation of the sentence boundary optimized output utilizes the determined quantity.
  - 17. The computer program product of claim 5, wherein iterative assignment of the second score comprises a determination that the assigned score exceeds a predetermined threshold, thereby creating the sentence boundary optimized output.
  - 18. The computer program product of claim 5, wherein, iterative assignment of the second score comprises a determination that a predetermined number of iterations are performed and the iteration with the highest score is selected as the sentence boundary optimized output.

9. A method for natural language processing comprising:
- determining optimal sentence boundary placement with a received string input comprising;
  
  identifying two or more preliminary sentence boundaries within the input;
  
  identifying two or more first potential sentences within the input utilizing the two or more preliminary sentence boundaries;
  
  determine a confidence score for each potential sentence using a parser;
  
  assigning a first score to each first potential sentence, wherein each assigned first score corresponds to a probability of each potential sentence of the two or more first potential sentences being an actual sentence; and
  
  selectively identifying a grouping comprising at least two adjacent potential sentences based on a relationship to the assigned first scores;
  
  categorizing each of the two adjacent sentences as one of ill-formed prose (IFP) and semi-structure entity constructs (SSECs); and
  
  upon determining there are IFPs for further processing;
  
  merging the at least two adjacent first potential sentences to create a second potential sentence; and
  
  iteratively assigning a second score to the created second potential sentence and merging at least one additional sentence adjacent to the created second potential sentence until there are no further IFPs to process, any SSECs are normalized, and a sentence boundary optimized output is created as a function of the iterative second score; and
  
  outputting the sentence boundary optimized output to replace the adjacent first and second potential sentences.
- View Dependent Claims (10, 11, 12, 19, 20)
- - 10. The method of claim 9, wherein the merging of potential sentences includes:
    - creating the second potential sentence utilizing movement of at least one of the preliminary sentence boundaries.
  - 11. The method of claim 10, further comprising:
    - corresponding the second score to a probability of the second potential sentence being an actual sentence as determined through rules-based grammar usage; and
      
      determining the assigned second score of the created second potential sentence is greater than the assigned first score of the potential sentence utilized to create the second potential sentence.
  - 12. The method of claim 9, further comprising:
    - determining a quantity of consecutive potential sentences in the grouping, wherein the creating of the sentence boundary optimized output utilizes the determined quantity.
  - 19. The method of claim 9, wherein iteratively assigning a second score comprises determining that the assigned score exceeds a predetermined threshold, thereby creating the sentence boundary optimized output.
  - 20. The method of claim 9, wherein iteratively assigning the second score comprises determining a predetermined number of iterations are performed and the iteration with the highest score is selected as the sentence boundary optimized output.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Beller, Charles E., Ding, Chengmin, Ginsberg, Allen, Shek, Elinna
Primary Examiner(s)
He, Jialong

Application Number

US15/376,923
Publication Number

US 20180165270A1
Time in Patent Office

966 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 40/205   Parsing

G06F 40/253   Grammatical analysis; Style...

G06F 40/289   Phrasal analysis, e.g. fini...

G06F 40/30   Semantic analysis

Preprocessing of string inputs in natural language processing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Preprocessing of string inputs in natural language processing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links