Natural-language information processor with association searches limited within blocks
First Claim
1. An information extracting block finding method for determining the structure of documents represented by one-dimensional text files, said method comprising the steps of:
- extracting from said text files at least some symbols representing two-dimensional spatial information;
using said spatial information, at least temporarily storing said text files in a memory having a two-dimensional structure defining multiple grid cells;
for at least some of said grid cells, examining at least two grid cells orthogonally adjacent to the grid cell under examination, and assigning from 0 to 4 of (a) left, (b) right, (c) top, and (d) bottom edge attributes to those boundaries between the grid cell under examination in which one of (a) the grid cell under examination includes a text symbol and the adjacent cell to the left lacks a text symbol, (b) the grid cell under examination includes a text symbol and the adjacent cell to the right lacks a text symbol, (c) the grid cell under examination includes a text symbol and the adjacent cell above the cell under examination lacks a text symbol, and (d) the grid cell under examination includes a text symbol, and the adjacent cell below the cell under examination lacks a text symbol, respectively, to thereby generate a list of cell edges, each defined by its edge attribute and its end locations;
combining cell edges having the same left, right, top, or bottom edge attributes and an identical end location, to thereby form left, right, top and bottom block edges, respectively, defined by at least one of said left, right, top, and bottom attributes, and having locations defined by their end points;
associating each top and bottom block edge with those left and right edges having common end points therewith, to form closed two-dimensional regions; and
determining the spatial coordinates of a bounding box about each of said closed two-dimensional regions.
9 Assignments
0 Petitions
Accused Products
Abstract
A block finder operates on one-dimensional digitized text, to group together those portions which are likely to contain related subjects matter, and to mark the grouped portions so that, at a later step of natural-language processing, only that block within which a word is found is searched for a related words or words. The block finder (210) extracts two-dimensional symbols, such as paragraph symbols (314), from the text, and stores the text in a grid of cells of a “two-dimensional” memory. All of the grid locations are classified as text (T) or space (white or W). Prefiltering (316) of the text deletes white spaces between sentences. Boundaries of the block are identified by examining each cell, and at least some of its adjoining cells, and identifying as an edge those boundaries between cells in which a transition between T and W occurs. This results in a list of unit-length boundary edges with top, bottom, left or right attributes. The unit-length boundary edges of each attribute are formed into longer edges identified by attribute and end points. Closed regions are formed by joining each top with those two left and right boundaries having identical end points, and those in turn are joined with top or bottom boundaries having common end points. The result is a closed region, which may have jagged edges. The block is simplified by determining two corner locations of a bounding box. Searching is performed for associated entities (words or phrases) within the same bounding box.
-
Citations
9 Claims
-
1. An information extracting block finding method for determining the structure of documents represented by one-dimensional text files, said method comprising the steps of:
-
extracting from said text files at least some symbols representing two-dimensional spatial information;
using said spatial information, at least temporarily storing said text files in a memory having a two-dimensional structure defining multiple grid cells;
for at least some of said grid cells, examining at least two grid cells orthogonally adjacent to the grid cell under examination, and assigning from 0 to 4 of (a) left, (b) right, (c) top, and (d) bottom edge attributes to those boundaries between the grid cell under examination in which one of (a) the grid cell under examination includes a text symbol and the adjacent cell to the left lacks a text symbol, (b) the grid cell under examination includes a text symbol and the adjacent cell to the right lacks a text symbol, (c) the grid cell under examination includes a text symbol and the adjacent cell above the cell under examination lacks a text symbol, and (d) the grid cell under examination includes a text symbol, and the adjacent cell below the cell under examination lacks a text symbol, respectively, to thereby generate a list of cell edges, each defined by its edge attribute and its end locations;
combining cell edges having the same left, right, top, or bottom edge attributes and an identical end location, to thereby form left, right, top and bottom block edges, respectively, defined by at least one of said left, right, top, and bottom attributes, and having locations defined by their end points;
associating each top and bottom block edge with those left and right edges having common end points therewith, to form closed two-dimensional regions; and
determining the spatial coordinates of a bounding box about each of said closed two-dimensional regions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
selecting (i) that point which represents the spatial coordinates of the intersection of (a) the projection of the topmost upper block edge of each of said closed regions with (b) the projection of the leftmost of said left block edges of said closed region, and (ii) that point which represents the spatial coordinates of the intersection of (a) the projection of the lowermost lower block edge of said closed region with (b) the projection of the rightmost of said right block edges of said closed region, to thereby define said coordinates of said bounding box.
-
-
3. A method according to claim 1, further comprising the step of:
when searching in said text for an information element which is associated with a particular other information element, performing said search in that one of said bounding boxes which contains said other information element.
-
4. A method according to claim 3, wherein said step of limiting said search includes the step of limiting the search for the information element associated with a pronoun to that one of said bounding boxes including said pronoun.
-
5. A method according to claim 1, further comprising, before said step of examining at least two grid calls and assigning edge attributes, the step of:
prefiltering said text to eliminate spaces between sentences.
-
6. A method according to claim 5, wherein said step of prefiltering further comprises, before said step of examining the four grid cells and assigning edge attributes, the step of:
deeming to be cells occupied by text all grid cells which are occupied by space symbols and which are (a) bounded on the left by grid cells including text characters and (b) bounded on the right by grid cells including space symbols.
-
7. A method according to claim 6, further comprising the step of:
deeming to be a set of four text cells all right-left space symbol grid cell pairs bounded to the right and left by text characters.
-
8. A method according to claim 1, further comprising the step of reordering tokens to put multi-column text into logical reading order.
-
9. A method according to claim 1, further comprising the step of ending sentence and paragraph structure components when an end of block structure component is reached.
Specification