Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system
First Claim
1. A method, in a data processing system, for identifying nonsense passages in documents being ingested into a corpus, the method comprising:
- receiving, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus;
dividing, by the natural language processing pipeline, the input document into a plurality of input passages;
identifying, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises;
annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage;
counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts;
determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and
comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold;
filtering, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and
adding, by the natural language processing pipeline, the filtered plurality of input passages into the corpus.
1 Assignment
0 Petitions
Accused Products
Abstract
A mechanism is provided in a data processing system for identifying nonsense passages in documents being ingested into a corpus. A natural language processing pipeline configured to execute in the data processing system receives an input document to be ingested into a corpus. The natural language processing pipeline divides the input document into a plurality of input passages. A filter component of the natural language processing pipeline identifies whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts. The natural language processing pipeline filters each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages. The natural language processing pipeline adds the filtered plurality of input passages into the corpus.
37 Citations
20 Claims
-
1. A method, in a data processing system, for identifying nonsense passages in documents being ingested into a corpus, the method comprising:
-
receiving, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus; dividing, by the natural language processing pipeline, the input document into a plurality of input passages; identifying, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises; annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage; counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts; determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold; filtering, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and adding, by the natural language processing pipeline, the filtered plurality of input passages into the corpus. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer program product comprising a computer readable storage medium having a computer readable program stored therein, wherein the computer readable program comprises a natural language processing pipeline configured to execute on a data processing system to:
-
receive an input document to be ingested into a corpus; divide the input document into a plurality of input passages; identify whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises; annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage; counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts; determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold; filter each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and add the filtered plurality of input passages into the corpus. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. An apparatus comprising:
-
a processor, and a memory coupled to the processor, wherein the memory comprises instructions which, when executed by the processor, cause the processor to; receive, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus; divide, by the natural language processing pipeline, the input document into a plurality of input passages; identify, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises; annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage; counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts; determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold; filter, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and add, by the natural language processing pipeline, the filtered plurality of input passages into the corpus. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification