×

Document data classification using a noise-to-content ratio

  • US 10,275,523 B1
  • Filed: 08/03/2017
  • Issued: 04/30/2019
  • Est. Priority Date: 09/13/2012
  • Status: Active Grant
First Claim
Patent Images

1. A method comprising:

  • classifying, based on user input, a plurality of portions of each markup language document in a training set of markup language documents, wherein each portion of each markup language document is classified as one of substantive content or noise;

    applying a machine learning algorithm to the plurality of portions of each markup language documents in the training set;

    determining, based on an outcome of the machine learning algorithm, a model for classifying document portions comprising a set of features of the portions that are indicative of substantive content and a set of features of the portions that are indicative of noise;

    classifying, using the model, a second plurality of portions of a markup language document that is not part of the training set, wherein each of the second plurality of portions is classified as one of substantive content or noise;

    determining a noise-to-content ratio of each of the second plurality of portions of the markup language document; and

    removing a first portion of the second plurality of portions that is classified as noise and has a higher noise-to-content ratio than other portions of the second plurality of portions.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×