×

Pre-processing for identifying nonsense passages in documents being ingested into a corpus of a natural language processing system

  • US 9,842,096 B2
  • Filed: 05/12/2016
  • Issued: 12/12/2017
  • Est. Priority Date: 05/12/2016
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method, in a data processing system, for identifying nonsense passages in documents being ingested into a corpus, the method comprising:

  • receiving, by a natural language processing pipeline configured to execute in the data processing system, an input document to be ingested into a corpus;

    dividing, by the natural language processing pipeline, the input document into a plurality of input passages;

    identifying, by a filter component of the natural language processing pipeline, whether each input passage is a nonsense passage based on a value of a metric determined according to a set of feature counts, wherein identifying whether a given input passage is a nonsense passage comprises;

    annotating, by an annotator in the natural language processing pipeline, the given input passage within the plurality of input passages with linguistic features to form an annotated passage;

    counting, by metric counters component in the natural language processing pipeline, a number of instances of each type of linguistic feature in the annotated passage to form a set of feature counts;

    determining, by the metric counters component of the natural language processing pipeline, a value for a metric based on the set of feature counts; and

    comparing, by a comparator component of the natural language processing pipeline, the value for the metric to a predetermined model threshold;

    filtering, by the natural language processing pipeline, each input passage in the plurality of input passages based on whether the input passage is identified as a nonsense passage or not identified as a nonsense passage to form a filtered plurality of input passages; and

    adding, by the natural language processing pipeline, the filtered plurality of input passages into the corpus.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×