Automatic method of identifying sentence boundaries in a document image
First Claim
1. A method of identifying sentence boundaries within a document image without performing character recognition, the document image including a multiplicity of connected components, each connected component having a bounding box, the method being implemented by a processor coupled to a memory storing instructions representing the method, the method comprising the steps of:
- a) selecting a connected component from the multiplicity of connected components;
b) determining whether the selected connected component might represent a period based upon a shape of the selected connected component;
c) determining whether the selected connected component might represent a colon; and
d) labeling the selected connected component as a sentence boundary if the selected connected component might be a period and is not part of a colon.
4 Assignments
0 Petitions
Accused Products
Abstract
A method of automatically identifying sentence boundaries in a document image without performing character recognition to generate an ASCII representation of the document text. The identification process begins by selecting a connected component from the multiplicity of connected components of a text line. Next, it is determined whether the selected connected component might represent a period based upon its shape. If the selected connected component is dot shaped, then it is determined whether the selected connected component might represent a colon. Finally, if the selected connected component is dot shaped and not part of a colon, the selected connected component is labeled as a sentence boundary.
-
Citations
24 Claims
-
1. A method of identifying sentence boundaries within a document image without performing character recognition, the document image including a multiplicity of connected components, each connected component having a bounding box, the method being implemented by a processor coupled to a memory storing instructions representing the method, the method comprising the steps of:
-
a) selecting a connected component from the multiplicity of connected components; b) determining whether the selected connected component might represent a period based upon a shape of the selected connected component; c) determining whether the selected connected component might represent a colon; and d) labeling the selected connected component as a sentence boundary if the selected connected component might be a period and is not part of a colon. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A method of identifying sentence boundaries within a document image without performing character recognition, the document image including a multiplicity of connected components, each connected component having a bounding box, the method being implemented by a processor coupled to a memory storing instructions representing the method, the method comprising the steps of:
-
a) selecting a connected component from the multiplicity of connected components; b) determining whether the selected connected component might be a period based upon a shape of the selected connected component; c) determining whether the selected connected component is part of a colon; d) determining whether the selected connected component is part of an ellipsis; e) determining whether the selected connected component is part of an exclamation mark or a question mark; f) determining whether the selected connected component is part of an intra-sentence abbreviation; and g) labeling the selected connected component as a sentence boundary if the selected connected component might be a period, is not part of a colon, is not part of an ellipsis, is not part of a question mark or exclamation mark, and is not part of an intra-sentence abbreviation. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
-
-
24. An article of manufacture comprising:
-
a) a memory; and b) instructions stored in the memory for a method of identifying sentence boundaries within a document image without performing character recognition, the document image including a multiplicity of connected components, each connected component representing a character and having a bounding box, the method being implemented by a processor coupled to the memory, the method including the steps of; 1) selecting a connected component from the multiplicity of connected components, each connected component; 2) determining whether the selected connected component might be a period based upon a shape of the selected connected component; 3) determining whether the selected connected component is part of a colon; 4) determining whether the selected connected component is part of an ellipsis; 5) determining whether the selected connected component is part of an exclamation mark or a question mark; 6) determining whether the selected connected component is part of an intra-sentence abbreviation; and 7) labeling the selected connected component as a sentence boundary if the connected component might be a period, is not part of a colon, is not part of an ellipsis, is not part of a question mark or exclamation mark, and is not part of an intra-sentence abbreviation.
-
Specification