System and method for electronic document classification
First Claim
Patent Images
1. A method of classifying electronic documents, comprising:
- converting a hypertext markup language (HTML) candidate electronic document comprising character data to a single candidate image, the converting including extracting a body section of the HTML candidate electronic document and converting the entire body section of the HTML candidate electronic document into the single candidate image;
scaling the entire single candidate image to a size substantially smaller than an original size of the candidate image to provide a single scaled candidate image;
obtaining a representation of a degree of visual similarity of the entire single scaled candidate image to a reference image by performing a single comparison of the entire single scaled candidate image to the entire reference image, the reference image having been obtained by identifying a reference electronic document containing character data representative of a specified classification;
automatically classifying the candidate electronic document under the specified classification when the degree of visual similarity exceeds a predetermined threshold and, in response to the degree of visual similarity exceeding the predetermined threshold, converting the reference electronic document to a reference image; and
determining an efficiency of the classifying by comparing a number of candidate electronic documents that are automatically classified under the specified classification to a number of candidate electronic documents that a user classifies under the specified classification.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method for electronic document classification are provided. A method in accordance with an embodiment of the present invention includes: converting a candidate electronic document comprising character data to a candidate image; obtaining a representation of a degree of visual similarity of the candidate image to a reference image, the reference image having been obtained by identifying a reference electronic document containing character data representative of a specified classification; and converting the reference electronic document to a reference image.
34 Citations
18 Claims
-
1. A method of classifying electronic documents, comprising:
-
converting a hypertext markup language (HTML) candidate electronic document comprising character data to a single candidate image, the converting including extracting a body section of the HTML candidate electronic document and converting the entire body section of the HTML candidate electronic document into the single candidate image; scaling the entire single candidate image to a size substantially smaller than an original size of the candidate image to provide a single scaled candidate image; obtaining a representation of a degree of visual similarity of the entire single scaled candidate image to a reference image by performing a single comparison of the entire single scaled candidate image to the entire reference image, the reference image having been obtained by identifying a reference electronic document containing character data representative of a specified classification; automatically classifying the candidate electronic document under the specified classification when the degree of visual similarity exceeds a predetermined threshold and, in response to the degree of visual similarity exceeding the predetermined threshold, converting the reference electronic document to a reference image; and determining an efficiency of the classifying by comparing a number of candidate electronic documents that are automatically classified under the specified classification to a number of candidate electronic documents that a user classifies under the specified classification. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product loaded on a non-transitory computer readable medium, which when executed, classifies electronic documents, comprising program code for:
-
converting a hypertext markup language (HTML) candidate electronic document comprising character data to a single candidate image, the converting including extracting a body section of the HTML candidate electronic document and converting the entire body section of the HTML candidate electronic document into the single candidate image; scaling the entire single candidate image to a size substantially smaller than an original size of the candidate image to provide a single scaled candidate image; obtaining a representation of a degree of visual similarity of the single entire scaled candidate image to a reference image by performing a single comparison of the entire single scaled candidate image to the entire reference image, the reference image having been obtained by identifying a reference electronic document containing character data representative of a specified classification; automatically classifying the candidate electronic document under the specified classification when the degree of visual similarity exceeds a predetermined threshold and, in response to the degree of visual similarity exceeding the predetermined threshold, converting the reference electronic document to a reference image; and determining an efficiency of the classifying by comparing a number of candidate electronic documents that are automatically classified under the specified classification to a number of candidate electronic documents that a user classifies under the specified classification. - View Dependent Claims (10)
-
-
11. A computer-implemented method for classifying electronic documents comprising:
-
converting a hypertext markup language (HTML) candidate electronic document comprising character data to a single candidate image, the converting including extracting a body section of the HTML candidate electronic document and converting the entire body section of the HTML candidate electronic document into the single candidate image; scaling the entire single candidate image to a size substantially smaller than an original size of the candidate image to provide a single scaled candidate image; obtaining a representation of a degree of visual similarity of the single entire scaled candidate image to a reference image by performing a single comparison of the entire single scaled candidate image to the entire reference image, the reference image having been obtained by identifying a reference electronic document containing character data representative of a specified classification; automatically classifying the candidate electronic document under the specified classification when the degree of visual similarity exceeds a predetermined threshold and, in response to the degree of visual similarity exceeding the predetermined threshold, converting the reference electronic document to a reference image; and determining an efficiency of the classifying by comparing a number of candidate electronic documents that are automatically classified under the specified classification to a number of candidate electronic documents that a user classifies under the specified classification. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18)
-
Specification