Method and system for using OCR data for grouping and classifying documents
First Claim
Patent Images
1. A system for classifying digitized documents, the system comprising:
- a processor-based document management system executed on a computer system and configured to;
create and store a plurality of templates associated with a plurality of document classes, each template comprising a plurality of keywords;
receive a digitized document to be classified;
compare each template with the digitized document to be classified, wherein the comparison comprises;
comparing a first area value associated with a template with a second area value associated with the digitized document,the first area value associated with a keyword indicating an area occupied by the keyword in the template, andthe second area value that indicates an area occupied by a word in the digitized document to be classified;
determine that a difference between the first and second area values is below a threshold value; and
upon the determination that a difference is below a threshold value, identify the keyword as being a keyword for a word pair, and identify the word in the digitized document to be classified as being a corresponding word for the word pair.
11 Assignments
0 Petitions
Accused Products
Abstract
A document template for classifying documents is created for each document class. The document template includes a set of keywords and the spatial relations of the keywords. A document to be classified is received. The spatial relations of the template keywords of a template are compared with the spatial relations of corresponding words in the document. If the spatial relations are the same, the document may be classified in the document class of the template.
41 Citations
18 Claims
-
1. A system for classifying digitized documents, the system comprising:
-
a processor-based document management system executed on a computer system and configured to; create and store a plurality of templates associated with a plurality of document classes, each template comprising a plurality of keywords; receive a digitized document to be classified; compare each template with the digitized document to be classified, wherein the comparison comprises; comparing a first area value associated with a template with a second area value associated with the digitized document, the first area value associated with a keyword indicating an area occupied by the keyword in the template, and the second area value that indicates an area occupied by a word in the digitized document to be classified; determine that a difference between the first and second area values is below a threshold value; and upon the determination that a difference is below a threshold value, identify the keyword as being a keyword for a word pair, and identify the word in the digitized document to be classified as being a corresponding word for the word pair. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method comprising:
-
creating and storing a plurality of templates associated with a plurality of document classes, each template comprising a plurality of keywords; receiving a digitized document to be classified; comparing each template with the digitized document to be classified, wherein the comparison comprises; comparing a first area value associated with a template with a second area value associated with the digitized document, the first area value associated with a keyword indicating an area occupied by the keyword in the template, and the second area value that indicates an area occupied by a word in the digitized document to be classified; determine that a difference between the first and second area values is below a threshold value; and upon the determination that a difference is below a threshold value, identify the keyword as being a keyword for a word pair, and identify the word in the digitized document to be classified as being a corresponding word for the word pair. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer program product, comprising a non-transitory computer-readable medium having a computer-readable program code embodied therein, the computer-readable program code adapted to be executed by one or more processors, the program code including instructions to:
-
create and store a plurality of templates associated with a plurality of document classes, each template comprising a plurality of keywords; receive a digitized document to be classified; compare each template with the digitized document to be classified, wherein the comparison comprises; comparing a first area value associated with a template with a second area value associated with the digitized document, the first area value associated with a keyword indicating an area occupied by the keyword in the template, and the second area value that indicates an area occupied by a word in the digitized document to be classified; determine that a difference between the first and second area values is below a threshold value; and upon the determination that a difference is below a threshold value, identify the keyword as being a keyword for a word pair, and identify the word in the digitized document to be classified as being a corresponding word for the word pair. - View Dependent Claims (16, 17, 18)
-
Specification