Retrieving electronic documents by converting them to synthetic text
First Claim
1. A computer system for generating synthetic words for parts of documents, the computer system comprising:
- an indexing module for receiving an electronic document comprising a first part and a second part, identifying a first bounding box for the first part, identifying a second bounding box for the second part, generating a synthetic word of the first bounding box, the synthetic word including a location of the first bounding box, a first quantized angle measured between a line originating from the first bounding box and a path joining a center of the first bounding box to a center of the second bounding box and a second quantized angle defining a width of the second bounding box, and producing an association between the electronic document and the synthetic word; and
a data storage for storing the electronic document, the synthetic word and the association between the electronic document and the synthetic word,wherein the first part and the second part are at least one of text, a character and an image.
1 Assignment
0 Petitions
Accused Products
Abstract
The present invention relies on the two-dimensional information in documents and encodes two-dimensional structures into a one-dimensional synthetic language such that two-dimensional documents can be searched at text search speed. The system comprises: an indexing module, a retrieval module, an encoder, a quantization module, a retrieval engine and a control module coupled by a bus. A number of electronic documents are first indexed by the indexing module and stored as a synthetic text library. The retrieval module then converts and input image to synthetic text and searches for matches to the synthetic text in the synthetic text library. The matches can be in turn used to retrieve the corresponding electronic documents. It should be noted that a plurality of matches and corresponding electronic documents may be retrieves ranked by order according the similarity of the synthetic text. In one or more embodiments, the present invention includes a method for indexing documents by converting them to synthetic text, and a method for retrieving documents by converting an image to synthetic text and comparing the synthetic text to documents that have been converted to synthetic text for a match.
-
Citations
19 Claims
-
1. A computer system for generating synthetic words for parts of documents, the computer system comprising:
-
an indexing module for receiving an electronic document comprising a first part and a second part, identifying a first bounding box for the first part, identifying a second bounding box for the second part, generating a synthetic word of the first bounding box, the synthetic word including a location of the first bounding box, a first quantized angle measured between a line originating from the first bounding box and a path joining a center of the first bounding box to a center of the second bounding box and a second quantized angle defining a width of the second bounding box, and producing an association between the electronic document and the synthetic word; and a data storage for storing the electronic document, the synthetic word and the association between the electronic document and the synthetic word, wherein the first part and the second part are at least one of text, a character and an image. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for generating synthetic words for parts of documents, the method comprising:
using a computer to perform steps comprising; receiving an electronic document comprising a first part and a second part; identifying a first bounding box for the first part; identifying a second bounding box for the second part; generating a synthetic word of the first bounding box, the synthetic word including a location of the first bounding box, a first quantized angle measured between a line originating from the first bounding box and a path joining a center of the first bounding box to a center of the second bounding box and a second quantized angle defining a width of the second bounding box; and producing an association between the electronic document and the synthetic word, wherein the first part and the second part are at least one of text, a character and an image. - View Dependent Claims (16, 17, 18, 19)
Specification