Document data classification using a noise-to-content ratio
First Claim
Patent Images
1. A method comprising:
- classifying, based on user input, a plurality of portions of each markup language document in a training set of markup language documents, wherein each portion of each markup language document is classified as one of substantive content or noise;
applying a machine learning algorithm to the plurality of portions of each markup language documents in the training set;
determining, based on an outcome of the machine learning algorithm, a model for classifying document portions comprising a set of features of the portions that are indicative of substantive content and a set of features of the portions that are indicative of noise;
classifying, using the model, a second plurality of portions of a markup language document that is not part of the training set, wherein each of the second plurality of portions is classified as one of substantive content or noise;
determining a noise-to-content ratio of each of the second plurality of portions of the markup language document; and
removing a first portion of the second plurality of portions that is classified as noise and has a higher noise-to-content ratio than other portions of the second plurality of portions.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for classifying document data is described. The method may include classifying a first portion of an electronic document as substantive content or noise, classifying a second portion of the electronic document as substantive content or noise, determining a first feature of the first portion of the electronic document indicative of substantive content using a machine learning algorithm, and determining a second feature of the second portion of the electronic document indicative of noise using the machine learning algorithm.
18 Citations
20 Claims
-
1. A method comprising:
-
classifying, based on user input, a plurality of portions of each markup language document in a training set of markup language documents, wherein each portion of each markup language document is classified as one of substantive content or noise; applying a machine learning algorithm to the plurality of portions of each markup language documents in the training set; determining, based on an outcome of the machine learning algorithm, a model for classifying document portions comprising a set of features of the portions that are indicative of substantive content and a set of features of the portions that are indicative of noise; classifying, using the model, a second plurality of portions of a markup language document that is not part of the training set, wherein each of the second plurality of portions is classified as one of substantive content or noise; determining a noise-to-content ratio of each of the second plurality of portions of the markup language document; and removing a first portion of the second plurality of portions that is classified as noise and has a higher noise-to-content ratio than other portions of the second plurality of portions. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system comprising:
-
a memory device to store an electronic document; a processing device operatively coupled to the memory device, the processing device to; classify a first portion of the electronic document as substantive content; classify a second portion of the electronic document as noise; determine, using a machine learning algorithm, a first feature of the first portion of the electronic document indicative of substantive content; determine, using the machine learning algorithm, a second feature of the second portion of the electronic document indicative of noise; classify, using a model comprising the first feature and the second feature, a plurality of portions of a second electronic document, wherein each of the plurality of portions of the second electronic document is classified as one of substantive content or noise; determine a noise-to-content ratio of each of the plurality of portions of the second electronic document; and remove a first portion of the plurality of portions that is classified as noise and has a higher noise-to-content ratio than other portions of the plurality of portions. - View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer readable storage medium including instructions that, when executed by a processing device, cause the processing device to:
-
classify a first portion of a first electronic document as substantive content or noise; classify a second portion of the first electronic document as substantive content or noise; classify a third portion of the first electronic document as substantive content or noise; determine a first noise-to-content ratio for the first portion; determine a second noise-to-content ratio for the second portion; determine a third noise-to-content ratio for the third portion; sort a list of the first portion, the second portion, and the third portion in order from a highest noise-to-content ratio to a lowest noise-to-content ratio using a machine learning algorithm; and remove at least one of the first portion, the second portion, and the third portion from the first electronic document, starting with the highest noise-to-content ratio in the list. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification