Method and apparatus for identifying words described in a portable electronic document
First Claim
1. A method for identifying words in a document comprising:
- (a) retrieving a text segment including its x,y position from a portable electronic document that has a page including a plurality of characters that have been identified as characters but not identified as words and a plurality of text segments and associated position data;
(b) creating a text object from each text segment and entering the text object into a linked list of text objects;
(c) identifying words from the linked list by analyzing the text object for word breaks and by analyzing a gap between the text object with a prior text object using the associated position data;
(d) adding identified words to a word list; and
(e) repeating steps (a) to (e) until the end of the page is reached.
0 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for identifying words stored in a portable electronic document. A digital computation apparatus stores a page of a document including characters in text segments that have not been identified as words. A word identifying mechanism analyzes the text segments of the page and stores the text segments as text objects in a linked list. The word identifying mechanism identifies words from the text objects in the linked list by analyzing the text objects for word breaks and by analyzing gaps between text objects using position data associated with the text segments. The identified words are stored in a word list and are sorted if necessary. A method of the present invention receives a text segment from a page of a document having multiple text segments and associated position data, including x and y coordinates for each text segment. A text object is created for each text segment, and the text objects are entered into a linked list. Words are then identified from the linked list by analyzing the text objects for word breaks and by analyzing gaps between text objects using the associated position data. Words that are identified in the text objects are added to a word list. The above steps are repeated until the end of the page is reached. The method and apparatus can be used for searching for words in a portable electronic document.
231 Citations
22 Claims
-
1. A method for identifying words in a document comprising:
-
(a) retrieving a text segment including its x,y position from a portable electronic document that has a page including a plurality of characters that have been identified as characters but not identified as words and a plurality of text segments and associated position data; (b) creating a text object from each text segment and entering the text object into a linked list of text objects; (c) identifying words from the linked list by analyzing the text object for word breaks and by analyzing a gap between the text object with a prior text object using the associated position data; (d) adding identified words to a word list; and (e) repeating steps (a) to (e) until the end of the page is reached. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer program product for programming a data processing apparatus to identify words on a page of a portable electronic document, comprising instructions to:
-
read from the portable electronic document a description of the intended appearance of a page of the document when rendered, the description including text segments having one or more characters to be rendered on the pages each text segment associated with position information on the page, the position information of the text segments defining a display order that is independent of a storage order of the text segments; using the position information of the text segments to distinguish characters that are part of the same word from characters that are not part of the same word; and collect characters that are part of the same word. - View Dependent Claims (7, 8, 9)
-
-
10. A system for identifying words in a page of a portable electronic document, comprising:
-
a data processing apparatus operable to store a page of a portable electronic document as a file including a plurality of characters that have not been identified as words, wherein each character is at least part of a text segment that has associated position information indicating where the text segment is to be displayed; and a word identifying program implemented on the data processing apparatus for analyzing the text segments together with their position data to distinguish characters that are part of the same word from characters that are not part of the same word to create a list of words in the page. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
Specification