Apparatus, method, and computer program for analyzing document layout

US 7,627,176 B2
Filed: 07/05/2005
Issued: 12/01/2009
Est. Priority Date: 03/04/2005
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-readable medium storing a program for analyzing layout of text on a document image to extract text blocks for character recognition purposes, the program causing a computer to function as:

an extraction condition memory storing a plurality of extraction conditions for use in extracting text blocks from a given document image;

a text block extractor to extract a first set of non-overlapping text blocks from the given document image in accordance with one of the extraction conditions stored in said extraction condition memory, the text block extractor to also extract a second set of non-overlapping text blocks from the same document image in a different way from the first set, in accordance with another of the extraction conditions; and

a text block consolidator to produce a consolidated set of text blocks by performing character recognition on each text block extracted by said text block extractor, evaluating validity of each text block based on a result of the character recognition, creating a consolidation source set by finding a text block of the first set which overlaps with a text block of the second set, adding both of those text blocks to the consolidation source set, and repeating operations of finding a text block of the first and second sets which overlaps with any of the text blocks belonging to the consolidated set and adding the found text block to the consolidation source set, and selecting a most valid combination of non-overlapping text blocks from among the text blocks belonging to the consolidation source set, based on the validity of each text block that has been evaluated.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document layout analysis program capable of extracting an appropriate set of text blocks from a given document image even in the case where the document layout is so complicated that conventional extraction methods with a single extraction condition would not work well. A plurality of different extraction conditions are stored in an extraction condition memory for use in extracting text blocks from a given document image. In accordance with those extraction conditions, a text block extractor extracts a plurality of sets of text blocks from the document image. A text block consolidator produces a consolidated set of text blocks by performing character recognition on each extracted text block, evaluating validity of each text block based on a result of the character recognition, and selecting most valid text blocks from among the plurality of sets of text blocks.

Citations

17 Claims

1. A computer-readable medium storing a program for analyzing layout of text on a document image to extract text blocks for character recognition purposes, the program causing a computer to function as:
- an extraction condition memory storing a plurality of extraction conditions for use in extracting text blocks from a given document image;
  
  a text block extractor to extract a first set of non-overlapping text blocks from the given document image in accordance with one of the extraction conditions stored in said extraction condition memory, the text block extractor to also extract a second set of non-overlapping text blocks from the same document image in a different way from the first set, in accordance with another of the extraction conditions; and
  
  a text block consolidator to produce a consolidated set of text blocks by performing character recognition on each text block extracted by said text block extractor, evaluating validity of each text block based on a result of the character recognition, creating a consolidation source set by finding a text block of the first set which overlaps with a text block of the second set, adding both of those text blocks to the consolidation source set, and repeating operations of finding a text block of the first and second sets which overlaps with any of the text blocks belonging to the consolidated set and adding the found text block to the consolidation source set, and selecting a most valid combination of non-overlapping text blocks from among the text blocks belonging to the consolidation source set, based on the validity of each text block that has been evaluated.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-readable medium according to claim 1, wherein said text block consolidator finds a text block more valid if the character recognition performed on that text block exhibits a higher recognition accuracy.
  - 3. The computer-readable medium according to claim 1, wherein said text block consolidator finds a text block more valid if the character recognition performed on that text block has produced text that sounds more natural from linguistic perspectives.
  - 4. The computer-readable medium according to claim 1, wherein said text block consolidator forms a plurality of combinations of non-overlapping text blocks from among the test blocks belonging to the consolidation source set, then evaluates the validity of each of the combinations, based on the result of the character recognition, and then selects text blocks belonging to one of the combinations that exhibits highest validity among others.
  - 5. The computer-readable medium according to claim 4, wherein said text block consolidator evaluates the validity of each combination in terms of a normalized sum of recognition accuracy and linguistic naturalness, the recognition accuracy representing accuracy of the result of the character recognition, and the linguistic naturalness representing naturalness of the result of the character recognition from linguistic perspectives.
  - 6. The computer-readable medium according to claim 4, wherein:
    - the validity of each of the combinations is represented in numerical form;
      
      said text block consolidator compares the validity numbers of every two combinations and gives a point to a superior combination whose validity number exceeds the other combination'"'"'s validity number by a predetermined difference; and
      
      said text block consolidator selects text blocks belonging to one of the combinations that has earned a highest total point.
  - 7. The computer-readable medium according to claim 1, wherein:
    - separators are defined as blank areas on the given document image that separate one text block from another; and
      
      the extraction conditions stored in said extraction condition memory include a minimum size of the separators.

8. A document layout analyzing apparatus for analyzing layout of text on a document image to extract text blocks for character recognition purposes, the apparatus comprising:
- an extraction condition memory to store a plurality of extraction conditions for use in extracting text blocks from a given document image;
  
  a text block extractor to extract a first set of non-overlapping text blocks from the given document image in accordance with one of the extraction conditions stored in said extraction condition memory, as well as extracting a second set of non-overlapping text blocks from the same document image in a different way from the first set, in accordance with another of the extraction conditions; and
  
  a text block consolidator to produce a consolidated set of text blocks by performing character recognition on each text block extracted by said text block extractor, evaluating validity of each text block based on a result of the character recognition, creating a consolidation source set by finding a text block of the first set which overlaps with a text block of the second set, adding both of those text blocks to the consolidation source set, and repeating operations of finding a text block of the first and second sets which overlaps with any of the text blocks belonging to the consolidated set and adding the found text block to the consolidation source set, and selecting a most valid combination of non-overlapping text blocks from among the text blocks belonging to the consolidation source set, based on the validity of each text block that has been evaluated.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The document layout analyzing apparatus according to claim 8, wherein said text block consolidator finds a text block more valid if the character recognition performed on that text block exhibits a higher recognition accuracy.
  - 10. The document layout analyzing apparatus according to claim 8, wherein said text block consolidator finds a text block more valid if the character recognition performed on that text block has produced text that sounds more natural from linguistic perspectives.
  - 11. The document layout analyzing apparatus according to claim 8, wherein said text block consolidator forms a plurality of combinations of non-overlapping text blocks from among the test blocks belonging to the consolidation source set, then evaluates the validity of each of the combinations, based on the result of the character recognition, and then selects text blocks belonging to one of the combinations that exhibits highest validity among others.
  - 12. The document layout analyzing apparatus according to claim 8, wherein:
    - separators are defined as blank areas on the given document image that separate one text block from another; and
      
      the extraction conditions stored in said extraction condition memory include a minimum size of the separators.

13. A document layout analyzing method for analyzing layout of text on a document image to extract text blocks for character recognition purposes, comprising:
- storing a plurality of extraction conditions;
  
  extracting a first set of non-overlapping text blocks from the document image in accordance with one of the stored extraction conditions;
  
  extracting a second set of non-overlapping text blocks from the same document image in a different way from the first set, in accordance with another of the extraction conditions;
  
  performing character recognition on each extracted text block of the first and second sets;
  
  evaluating validity of each text block of the first and second sets, based on a result of the character recognition;
  
  creating a consolidation source set by finding a text block of the first set which overlaps with a text block of the second set, adding both of those text blocks to the consolidation source set, and repeating operations of finding a text block of the first and second sets which overlaps with any of the text blocks belonging to the consolidated set and adding the found text block to the consolidation source set; and
  
  producing a consolidated set of text blocks by selecting a most valid combination of non-overlapping text blocks from among the text blocks belonging to the consolidation source set, based on the validity of each text block that has been evaluated.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The document layout analyzing method according to claim 13, wherein said evaluating finds a text block more valid if the character recognition performed on that text block exhibits a higher recognition accuracy.
  - 15. The document layout analyzing method according to claim 13, wherein said evaluating finds a text block more valid if the character recognition performed on that text block has produced text that sounds more natural from linguistic perspectives.
  - 16. The document layout analyzing method according to claim 13, wherein said producing comprises:
    - forming a plurality of combinations of non-overlapping text blocks from among the text blocks belonging to the consolidation source set and then evaluates the validity of each of the combinations, based on the result of the character recognition; and
      
      selecting text blocks belonging to one of the combinations that exhibits highest validity among others.
  - 17. The document layout analyzing method according to claim 13, wherein:
    - separators are defined as blank areas on the given document image that separate one text block from another; and
      
      the extraction conditions include a minimum size of the separators.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fujitsu Limited
Original Assignee
Fujitsu Limited
Inventors
Takebe, Hiroaki, Naoi, Satoshi, Fujimoto, Katsuhito
Primary Examiner(s)
Mehta; Bhavesh M
Assistant Examiner(s)
VANCHY JR, MICHAEL J

Application Number

US11/175,127
Publication Number

US 20060204096A1
Time in Patent Office

1,610 Days
Field of Search

382/190, 382/180
US Class Current

382/180
CPC Class Codes

G06V 30/414 Extracting the geometrical ...

Apparatus, method, and computer program for analyzing document layout

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus, method, and computer program for analyzing document layout

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links