Document data classification using a noise-to-content ratio
First Claim
Patent Images
1. A system comprising:
- at least one memory;
at least one processing device, operatively coupled to the at least one memory, to;
receive an electronic document having a first portion, a second portion, and a third portion;
determine a first content-to-noise ratio for the first portion;
determine a second content-to-noise ratio for the second portion;
determine a third content-to-noise ratio for the third portion;
sort a list of the first portion, the second portion, and the third portion in order from a highest content-to-noise ratio to a lowest content-to-noise ratio using the first content-to-noise ratio, the second content-to-noise ratio, and the third content-to-noise ratio; and
remove, from the electronic document, a predetermined number of the first portion, the second portion, and the third portion starting with the highest content-to-noise ratio in the list.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for classifying document data is described. An exemplary method includes identifying a markup language document having a plurality of portions, determining a set of substantive content metrics and a set of noise metrics for each of the plurality of portions, calculating a noise-to-content ratio for each of the plurality of portions based on a corresponding set of substantive content metrics and a corresponding set of noise metrics, and removing noise from the markup language document using the noise-to-content ratio.
20 Citations
20 Claims
-
1. A system comprising:
-
at least one memory; at least one processing device, operatively coupled to the at least one memory, to;
receive an electronic document having a first portion, a second portion, and a third portion;determine a first content-to-noise ratio for the first portion; determine a second content-to-noise ratio for the second portion; determine a third content-to-noise ratio for the third portion; sort a list of the first portion, the second portion, and the third portion in order from a highest content-to-noise ratio to a lowest content-to-noise ratio using the first content-to-noise ratio, the second content-to-noise ratio, and the third content-to-noise ratio; and remove, from the electronic document, a predetermined number of the first portion, the second portion, and the third portion starting with the highest content-to-noise ratio in the list. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A non-transitory computer readable storage medium including instructions that, when executed by at least one processing device, cause the at least one processing device to perform operations comprising:
-
receiving, by the at least one processing device, an electronic document having a first portion, a second portion, and a third portion; determining, by the at least one processing device, a first content-to-noise ratio for the first portion; determining, by the at least one processing device, a second content-to-noise ratio for the second portion; determining, by the at least one processing device, a third content-to-noise ratio for the third portion; sort a list of the first portion, the second portion, and the third portion in order from a highest content-to-noise ratio to a lowest content-to-noise ratio using the first content-to-noise ratio, the second content-to-noise ratio, and the third content-to-noise ratio; and removing, from the electronic document, a predetermined number of the first portion, the second portion, and the third portion starting with the highest content-to-noise ratio in the list. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method comprising:
-
receiving, by at least one, an electronic document having a first portion, a second portion, and a third portion; determining, by the at least one processing device, a first content-to-noise ratio for the first portion; and determining, by the at least one processing device, a second content-to-noise ratio for the second portion; determining, by the at least one processing device, a third content-to-noise ratio for the third portion; sort, by the at least one processing device, a list of the first portion, the second portion, and the third portion in order from a highest content-to-noise ratio to a lowest content-to-noise ratio using the first content-to-noise ratio, the second content-to-noise ratio, and the third content-to-noise ratio; and removing, from the electronic document, a predetermined number of the first portion, the second portion, and the third portion starting with the highest content-to-noise ratio in the list. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification