Document data classification using a noise-to-content ratio

US 10,275,523 B1
Filed: 08/03/2017
Issued: 04/30/2019
Est. Priority Date: 09/13/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

classifying, based on user input, a plurality of portions of each markup language document in a training set of markup language documents, wherein each portion of each markup language document is classified as one of substantive content or noise;

applying a machine learning algorithm to the plurality of portions of each markup language documents in the training set;

determining, based on an outcome of the machine learning algorithm, a model for classifying document portions comprising a set of features of the portions that are indicative of substantive content and a set of features of the portions that are indicative of noise;

classifying, using the model, a second plurality of portions of a markup language document that is not part of the training set, wherein each of the second plurality of portions is classified as one of substantive content or noise;

determining a noise-to-content ratio of each of the second plurality of portions of the markup language document; and

removing a first portion of the second plurality of portions that is classified as noise and has a higher noise-to-content ratio than other portions of the second plurality of portions.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for classifying document data is described. The method may include classifying a first portion of an electronic document as substantive content or noise, classifying a second portion of the electronic document as substantive content or noise, determining a first feature of the first portion of the electronic document indicative of substantive content using a machine learning algorithm, and determining a second feature of the second portion of the electronic document indicative of noise using the machine learning algorithm.

18 Citations

View as Search Results

20 Claims

1. A method comprising:
- classifying, based on user input, a plurality of portions of each markup language document in a training set of markup language documents, wherein each portion of each markup language document is classified as one of substantive content or noise;
  
  applying a machine learning algorithm to the plurality of portions of each markup language documents in the training set;
  
  determining, based on an outcome of the machine learning algorithm, a model for classifying document portions comprising a set of features of the portions that are indicative of substantive content and a set of features of the portions that are indicative of noise;
  
  classifying, using the model, a second plurality of portions of a markup language document that is not part of the training set, wherein each of the second plurality of portions is classified as one of substantive content or noise;
  
  determining a noise-to-content ratio of each of the second plurality of portions of the markup language document; and
  
  removing a first portion of the second plurality of portions that is classified as noise and has a higher noise-to-content ratio than other portions of the second plurality of portions.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, wherein each portion of each markup language document is a node or a sub-tree in a document object model (DOM) tree of a respective markup language document, the sub-tree having a parent node and one or more child nodes, wherein the applying the machine learning algorithm comprising applying the machine learning algorithm to respective node or sub-tree in the DOM tree of the respective markup language document.
  - 3. The method of claim 1, further comprising:
    - selecting, from the set of features that are indicative of noise, a subset of features that are significant for noise classification, wherein each feature in the subset of features that are significant for noise classification was encountered in at least a threshold number of markup language document portions classified as substantive content; and
      
      selecting, from the set of features that are indicative of substantive content, a subset of features that are significant for substantive content classification, wherein each feature in the subset of features that are significant for substantive content classification was encountered in at least a threshold number of markup language document portions classified as noise.
  - 4. The method of claim 3, wherein the subset of features that are significant for noise classification comprises at least one of:
    - a node visibility during rendering of the markup document, a node comprising an advertisement, a node associated with a comment provider, a ratio of a number of link nodes in a respective portion to a number of text nodes in the respective portion, a number of characters in links within the respective portion, a number of images in the respective portion, wherein the images comprise advertisements, a number of link nodes in the respective portion, wherein the link nodes are links to nodes comprising advertisements, a number of valid containers in the respective portion, or a number of small images in the respective portion, wherein a small image is an image has a size smaller than a threshold.
  - 5. The method of claim 3, wherein the subset of features that are significant for substantive content classification comprises at least one of:
    - a number of plain text characters in a respective portion, a number of inline tag nodes in the respective portion, a number of lines in the respective portion, a number of images in the respective portion, or a number of paragraphs in the respective portion.

6. A system comprising:
- a memory device to store an electronic document;
  
  a processing device operatively coupled to the memory device, the processing device to;
  
  classify a first portion of the electronic document as substantive content;
  
  classify a second portion of the electronic document as noise;
  
  determine, using a machine learning algorithm, a first feature of the first portion of the electronic document indicative of substantive content;
  
  determine, using the machine learning algorithm, a second feature of the second portion of the electronic document indicative of noise;
  
  classify, using a model comprising the first feature and the second feature, a plurality of portions of a second electronic document, wherein each of the plurality of portions of the second electronic document is classified as one of substantive content or noise;
  
  determine a noise-to-content ratio of each of the plurality of portions of the second electronic document; and
  
  remove a first portion of the plurality of portions that is classified as noise and has a higher noise-to-content ratio than other portions of the plurality of portions.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14)
- - 7. The system of claim 6, wherein the machine learning algorithm is a decision tree algorithm, a random forest algorithm, or a support vector machine (SVM) algorithm.
  - 8. The system of claim 6, wherein, to determine the first feature of the first portion, the processing device is further to determine that the first portion of the electronic document is invisible during rendering of the electronic document;
    - determine that the first portion of the electronic document refers to an advertisement provider;
      
      or determine that the first portion of the electronic document refers to a comment provider.
  - 9. The system of claim 6, wherein the processing device is further to:
    - determine the first portion includes a first threshold number of features that are the first feature; and
      
      determine the second portion includes a second threshold number of features that are the second feature.
  - 10. The system of claim 9, wherein the first threshold number is 90 percent of features indicative of substantive content are the first feature, and wherein the second threshold number is 50 percent of features indicative of noise are the second feature.
  - 11. The system of claim 6, wherein the electronic document is part of a set of training documents for classifying a plurality of features of electronic documents as substantive content or noise.
  - 12. The system of claim 6, wherein the first feature indicates a number of words in the first portion, wherein the first feature indicates a number of plain text characters in the first portion, wherein the first feature indicates a number of lines or paragraphs in the first portion, or wherein the first feature indicates an average lengths of lines in the first portion.
  - 13. The system of claim 6, wherein the second feature indicates the second portion is invisible when the electronic document is displayed by a display, wherein the second feature indicates the second portion is an advertisement, wherein the second feature indicates the second portion is a comment, wherein the second feature indicates the second portion is a link to text, wherein the second feature indicates the second portion is a link to plain text, wherein the second feature indicates the second portion is an image for an advertisement, wherein the second feature indicates the second portion is a link to an advertisement provider;
    - wherein the second feature indicates the second portion is a set of images;
      
      or wherein the second feature indicates the second portion is text that is are unrelated to a type of the electronic document.
  - 14. The system of claim 6, wherein the processing device is further to:
    - determine a first ranking of the first feature among a first set of features indicative of substantive content; and
      
      determine a second ranking of the second feature among a second set of features indicative of noise.

15. A non-transitory computer readable storage medium including instructions that, when executed by a processing device, cause the processing device to:
- classify a first portion of a first electronic document as substantive content or noise;
  
  classify a second portion of the first electronic document as substantive content or noise;
  
  classify a third portion of the first electronic document as substantive content or noise;
  
  determine a first noise-to-content ratio for the first portion;
  
  determine a second noise-to-content ratio for the second portion;
  
  determine a third noise-to-content ratio for the third portion;
  
  sort a list of the first portion, the second portion, and the third portion in order from a highest noise-to-content ratio to a lowest noise-to-content ratio using a machine learning algorithm; and
  
  remove at least one of the first portion, the second portion, and the third portion from the first electronic document, starting with the highest noise-to-content ratio in the list.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer readable storage medium of claim 15 wherein the processing device is further to:
    - determine a first set of features of the first portion of the first electronic document indicative of substantive content using the machine learning algorithm; and
      
      determine a second set of features of the second portion of the first electronic document indicative of noise using the machine learning algorithm.
  - 17. The non-transitory computer readable storage medium of claim 16 wherein the processing device is further to:
    - generate, in view of the first set of features, a first decision tree predicting a first combination of features that indicates whether a second electronic document includes substantive content; and
      
      generate, in view of the second set of features, a second decision tree predicting a second combination of features that indicates whether the second electronic document includes noise.
  - 18. The non-transitory computer readable storage medium of claim 15, wherein the machine learning algorithm is a decision tree algorithm, a random forest algorithm, or a support vector machine (SVM) algorithm.
  - 19. The non-transitory computer readable storage medium of claim 15, wherein the first portion and the second portion are stored as nodes within a document object model (DOM) tree.
  - 20. The non-transitory computer readable storage medium of claim 15, wherein the processing device is further to remove one or more parts of the second portion that are indicative of noise.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Wolkerstorfer, Bernhard, Li, Lei, Parihar, Narendra S.
Primary Examiner(s)
Hong, Stephen S
Assistant Examiner(s)
Robinson, Marshon L

Application Number

US15/668,537
Time in Patent Office

635 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/353   into predefined classes

G06F 18/24   Classification techniques

G06F 18/28   Determining representative ...

G06F 40/258   Heading extraction; Automat...

Document data classification using a noise-to-content ratio

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Document data classification using a noise-to-content ratio

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links