System and method for identifying document genres
First Claim
1. A computer-implemented method for generating genre models used to identify genres of a document, comprising:
- on a computer system having one or more processors executing one or more programs stored on memory of the computer system;
for each document image in a set of document images that are associated with one or more genres,segmenting the document image into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; and
computing features of the document image and the plurality of tiles; and
training at least one genre classifier to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images, wherein training the at least one genre classifier to classify document images as being associated with a respective genre in the one or more genres includes;
training a first genre classifier corresponding to the respective genre based on the features of a first subset of the set of document images and the features of the plurality of tiles associated with the first subset of the set of document images;
tuning parameters of the first genre classifier using a second subset of the set of document images, wherein the first subset and the second subset of the set of document images are mutually-exclusive sets of document images;
training a second genre classifier corresponding to the respective genre based on the features of a second subset of the set of document images and the features of the plurality of tiles associated with the second subset of the set of document images; and
tuning parameters of the second genre classifier using the first subset of the set of document images.
2 Assignments
0 Petitions
Accused Products
Abstract
A system, a computer readable storage medium including instructions, and method for generating genre models used to identify genres of a document. For each document image in a set of document images that are associated with one or more genres, the document image is segmented into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable, and features of the document image and the plurality of tiles are computed. At least one genre classifier is trained to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images.
26 Citations
23 Claims
-
1. A computer-implemented method for generating genre models used to identify genres of a document, comprising:
-
on a computer system having one or more processors executing one or more programs stored on memory of the computer system; for each document image in a set of document images that are associated with one or more genres, segmenting the document image into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; and computing features of the document image and the plurality of tiles; and training at least one genre classifier to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images, wherein training the at least one genre classifier to classify document images as being associated with a respective genre in the one or more genres includes; training a first genre classifier corresponding to the respective genre based on the features of a first subset of the set of document images and the features of the plurality of tiles associated with the first subset of the set of document images; tuning parameters of the first genre classifier using a second subset of the set of document images, wherein the first subset and the second subset of the set of document images are mutually-exclusive sets of document images; training a second genre classifier corresponding to the respective genre based on the features of a second subset of the set of document images and the features of the plurality of tiles associated with the second subset of the set of document images; and tuning parameters of the second genre classifier using the first subset of the set of document images. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10)
-
-
2. A computer-implemented method for generating genre models used to identify genres of a document, comprising:
-
on a computer system having one or more processors executing one or more programs stored on memory of the computer system; for each document image in a set of document images that are associated with one or more genres, segmenting the document image into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; and computing features of the document image and the plurality of tiles; and training at least one genre classifier to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images, wherein training the at least one genre classifier to classify document images as being associated with a respective genre in the one or more genres includes; for each genre in at least a subset of genres associated with the document images in the set of document images, selecting a subset of tiles from the set of document images, wherein each tile in the subset of tiles is associated with the genre; and clustering tiles in the subset of tiles based on the features of the tiles; and generating a probability model for the genre, wherein the probability model for the genre indicates a likelihood that a respective feature of a respective tile is a member of a cluster of the genre, wherein the probability model is included in a set of probability models, each of which corresponds to a genre in the subset of genres; for at least a subset of document images in the set of document images, applying probability models to the subset of document images and the plurality of tiles associated with the subset of document images to produce a set of probabilities that respective document images in the subset of document images are members of one or more genres; and training the respective genre classifier to classify a respective document image as being associated with the respective genre based on the set of probabilities and the one or more genres associated with each document image in the subset of document images.
-
-
11. A computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions to:
-
for each document image in a set of document images that are associated with one or more genres, segment the document image into a plurality of tiles, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; and compute features of the document image and the plurality of tiles; and train at least one genre classifier to classify document images as being associated with one or more genres based on the features of the document images in the set of document images, the features of the plurality of tiles of the set of documents images, and the one or more genres associated with each document image in the set of documents images, wherein the instructions to train the at least one genre classifier to classify document images as being associated with a respective genre in the one or more genres include instructions to; train a first genre classifier corresponding to the respective genre based on the features of a first subset of the set of document images and the features of the plurality of tiles associated with the first subset of the set of document images; tune parameters of the first genre classifier using a second subset of the set of document images, wherein the first subset and the second subset of the set of document images are mutually-exclusive sets of document images; train a second genre classifier corresponding to the respective genre based on the features of a second subset of the set of document images and the features of the plurality of tiles associated with the second subset of the set of document images; and tune parameters of the second genre classifier using the first subset of the set of document images.
-
-
12. A computer-implemented method for identifying genres of a document, comprising:
-
on a computer system having one or more processors executing one or more programs stored on memory of the computer system; receiving a document image of the document; segmenting the document image into a plurality of tiles of the document image, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; computing features of the document image and the plurality of tiles; and identifying one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles, wherein identifying the one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image includes; applying a first set of genre classifiers to the features of the document image and the features of the plurality of tiles associated with the document image to produce a first set of scores, wherein the first set of genre classifiers is trained based on a first subset of training document images, and wherein parameters of the first set of genre classifiers are tuned based on a second subset of the training document images; applying a second set of genre classifiers to the features of the document image and the plurality of tiles associated with the document image to produce a second set of scores, wherein the second set of genre classifiers is trained based on the second subset of the training document images and wherein parameters of the second set of genre classifiers are tuned based on the first subset of the training document images; combining the first set of scores and the second set of scores to produce a combined set of scores; and identifying the one or more genres associated with the document image based on the combined set of scores. - View Dependent Claims (13, 14, 15, 16, 17, 18)
-
-
19. A computer readable storage medium storing one or more programs configured for execution by a computer, the one or more programs comprising instructions to:
-
receive a document image of the document; segment the document image into a plurality of tiles of the document image, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; compute features of the document image and the plurality of tiles of the document image; and identify one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image, wherein the instructions to identify the one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image include instructions to; apply a first set of genre classifiers to the features of the document image and the features of the plurality of tiles associated with the document image to produce a first set of scores, wherein the first set of genre classifiers is trained based on a first subset of training document images, and wherein parameters of the first set of genre classifiers are tuned based on a second subset of the training document images; apply a second set of genre classifiers to the features of the document image and the plurality of tiles associated with the document image to produce a second set of scores, wherein the second set of genre classifiers is trained based on the second subset of the training document images, and wherein parameters of the second set of genre classifiers are tuned based on the first subset of the training document images; combine the first set of scores and the second set of scores to produce a combined set of scores; and identify the one or more genres associated with the document image based on the combined set of scores.
-
-
20. An imaging system, comprising:
-
one or more processors; memory; and one or more programs stored in the memory, the one or more programs comprising instructions to; receive a document image of a document; segment the document image into a plurality of tiles of the document image, wherein the tiles in the plurality of tiles are sized so that document page features are identifiable; compute features of the document image and the plurality of tiles of the document image; and identify one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image, wherein the instructions to identify the one or more genres associated with the document image based on the features of the document image and the features of the plurality of tiles of the document image include instructions to; apply a first set of genre classifiers to the features of the document image and the features of the plurality of tiles associated with the document image to produce a first set of scores, wherein the first set of genre classifiers is trained based on a first subset of training document images, and wherein parameters of the first set of genre classifiers are tuned based on a second subset of the training document images; apply a second set of genre classifiers to the features of the document image and the plurality of tiles associated with the document image to produce a second set of scores, wherein the second set of genre classifiers is trained based on the second subset of the training document images, and wherein parameters of the second set of genre classifiers are tuned based on the first subset of the training document images; combine the first set of scores and the second set of scores to produce a combined set of scores; and identify the one or more genres associated with the document image based on the combined set of scores. - View Dependent Claims (21, 22, 23)
-
Specification