Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system

US 9,842,096 B2
Filed: 05/12/2016
Issued: 12/12/2017
Est. Priority Date: 05/12/2016
Status: Expired due to Fees

First Claim

Patent Images

1. A method, in a data processing system, for identifying nonsense passages in documents being ingested into a corpus, the method comprising:

receiving, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus;

dividing, by the natural language processing pipeline, the input document into a plurality of input passages;

identifying, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises;

annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage;

counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts;

determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and

comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold;

filtering, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and

adding, by the natural language processing pipeline, the filtered plurality of input passages into the corpus.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A mechanism is provided in a data processing system for identifying nonsense passages in documents being ingested into a corpus. A natural language processing pipeline configured to execute in the data processing system receives an input document to be ingested into a corpus. The natural language processing pipeline divides the input document into a plurality of input passages. A filter component of the natural language processing pipeline identifies whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts. The natural language processing pipeline filters each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages. The natural language processing pipeline adds the filtered plurality of input passages into the corpus.

37 Citations

View as Search Results

20 Claims

1. A method, in a data processing system, for identifying nonsense passages in documents being ingested into a corpus, the method comprising:
- receiving, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus;
  
  dividing, by the natural language processing pipeline, the input document into a plurality of input passages;
  
  identifying, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises;
  
  annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage;
  
  counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts;
  
  determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and
  
  comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold;
  
  filtering, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and
  
  adding, by the natural language processing pipeline, the filtered plurality of input passages into the corpus.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein filtering each input passage comprises removing a given input passage responsive to the given input passage being identified as a nonsense passage.
  - 3. The method of claim 1, wherein filtering each input passage comprises marking a given input passage responsive to the given input passage being identified as a nonsense passage.
  - 4. The method of claim 1, wherein annotating the input passage comprises annotating the input passage for linguistic part-of-speech features.
  - 5. The method of claim 1, wherein the metric comprises a ratio of a number of instances of a first pert-of-speech to a number of instances of a second part-of-speech in the input passage.
  - 6. The method of claim 1, wherein the metric and the predetermined model threshold are defined in a profile data structure.
  - 7. The method of claim 1, further comprising:
    - responsive to receiving a candidate evidence passage for a candidate answer in a question answering system, determining whether the candidate evidence passage is a nonsense passage; and
      
      filtering the candidate evidence passage out of natural language processing responsive to determining the candidate evidence passage is a nonsense passage.

8. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program comprises a natural language processing pipeline configured to execute on a data processing system to:
- receive an input document to be ingested into a corpus;
  
  divide the input document into a plurality of input passages;
  
  identify whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises;
  
  annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage;
  
  counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts;
  
  determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and
  
  comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold;
  
  filter each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and
  
  add the filtered plurality of input passages into the corpus.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer program product of claim 8, wherein filtering each input passage comprises removing a given input passage responsive to the given input passage being identified as a nonsense passage.
  - 10. The computer program product of claim 8, wherein filtering each input passage comprises marking a given input passage responsive to the given input passage being identified as a nonsense passage.
  - 11. The computer program product of claim 8, wherein annotating the input passage comprises annotating the input passage for linguistic part-of-speech features.
  - 12. The computer program product of claim 8, wherein the metric comprises a ratio of a number of instances of a first part-of-speech to a number of instances of a second part-of-speech in the input passage.
  - 13. The computer program product of claim 8, wherein the metric and the predetermined model threshold are defined in a profile data structure.
  - 14. The computer program product of claim 8, wherein the natural language processing pipeline further causes the data processing system to:
    - responsive to receiving a candidate evidence passage for a candidate answer in a question answering system, determine whether the candidate evidence passage is a nonsense passage; and
      
      filter the candidate evidence passage out of natural language processing responsive to determining the candidate evidence passage is a nonsense passage.

15. An apparatus comprising:
- a processor, anda memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to;
  
  receive, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus;
  
  divide, by the natural language processing pipeline, the input document into a plurality of input passages;
  
  identify, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises;
  
  annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage;
  
  counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts;
  
  determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and
  
  comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold;
  
  filter, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and
  
  add, by the natural language processing pipeline, the filtered plurality of input passages into the corpus.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The apparatus of claim 15, wherein filtering each input passage comprises removing a given input passage responsive to the given input passage being identified as a nonsense passage.
  - 17. The apparatus of claim 15, wherein filtering each input passage comprises marking a given input passage responsive to the given input passage being identified as a nonsense passage.
  - 18. The apparatus of claim 15, wherein annotating the input passage comprises annotating the input passage for linguistic part-of-speech features.
  - 19. The apparatus of claim 15, wherein the metric comprises a ratio of a number of instances of a first part-of-speech to a number of instances of a second part-of-speech in the input passage.
  - 20. The apparatus of claim 15, wherein the metric and the predetermined model threshold are defined in a profile data structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Beller, Charles E., Drzewucki, Michael, Phipps, Christopher, Summers, Kristen M., Yu, Julie T.
Primary Examiner(s)
RIES, LAURIE ANNE

Application Number

US15/152,826
Publication Number

US 20170329754A1
Time in Patent Office

579 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/3329   Natural language query form...

G06F 16/367   Ontology

G06F 40/169   Annotation, e.g. comment da...

G06F 40/268   Morphological analysis

G06F 40/30   Semantic analysis

G06F 40/40   Processing or translation o...

Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

37 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

37 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links