Image-based document indexing and retrieval
First Claim
Patent Images
1. A machine-implemented system for document retrieval and/or indexing comprising:
- a processor;
a component that receives a captured image of at least a portion of a physical document;
a search component that locates a match to the physical document, the search is performed over word-level topological properties of generated images, the word-level topological properties comprise at least respective widths of words on the generated images, and the generated images being images of at least a portion of one or more electronic documents; and
a comparison component that iteratively compares a portion of a signature associated with the captured image based at least in part on word-level topological properties with corresponding portions of signatures respectively associated with the generated images based at least in part on word-level topological properties and excludes each generated image whose portion of the signature does not match the portion of the signature of the captured image to facilitate location of a match to the physical document,the portion of the signature associated with the captured image and the portion of the signatures respectively associated with the generated images that are compared become progressively smaller with each iteration, where one or more iterations are performed until a predetermined threshold number of generated images remain,wherein each portion of signature respectively associated with a generated image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the generated image is entered into a respective table location for each segment of the generated image, andwherein the portion of the signature associated with the captured image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the captured image is entered into a respective table location for each segment of the captured image.
3 Assignments
0 Petitions
Accused Products
Abstract
A system that facilitates document retrieval and/or indexing is provided. A component receives an image of a document, and a search component searches data store(s) for a match to the document image. The match is performed over word-level topological properties of images of documents stored in the data store(s).
-
Citations
41 Claims
-
1. A machine-implemented system for document retrieval and/or indexing comprising:
-
a processor; a component that receives a captured image of at least a portion of a physical document; a search component that locates a match to the physical document, the search is performed over word-level topological properties of generated images, the word-level topological properties comprise at least respective widths of words on the generated images, and the generated images being images of at least a portion of one or more electronic documents; and a comparison component that iteratively compares a portion of a signature associated with the captured image based at least in part on word-level topological properties with corresponding portions of signatures respectively associated with the generated images based at least in part on word-level topological properties and excludes each generated image whose portion of the signature does not match the portion of the signature of the captured image to facilitate location of a match to the physical document, the portion of the signature associated with the captured image and the portion of the signatures respectively associated with the generated images that are compared become progressively smaller with each iteration, where one or more iterations are performed until a predetermined threshold number of generated images remain, wherein each portion of signature respectively associated with a generated image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the generated image is entered into a respective table location for each segment of the generated image, and wherein the portion of the signature associated with the captured image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the captured image is entered into a respective table location for each segment of the captured image. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 41)
-
-
24. A method that facilitates indexing and/or retrieval of a document, comprising:
-
generating a plurality of images of electronic documents, at least one of the images of electronic documents corresponding to a printed document; capturing an image of a printed document after such document has been printed; receiving a query requesting retrieval of an electronic document corresponding to the image of the printed document; generating one or more signatures corresponding to at least a portion of one or more of the generated images, the signatures generated at least in part upon word-layout within the image(s), the one or more signatures is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the generated image is entered into a respective table location for each segment of the generated image; generating a signature corresponding to at least a portion of the captured image, the signature is generated based at least in part upon word-layout within the captured image, the signature is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the captured image is entered into a respective table location for each segment of the captured image; and comparing the one or more signatures corresponding to the one or more generated images to the signature corresponding to the captured image; and identifying a generated image that has a highest number of table locations that have respective values that match values in corresponding table locations associated with the captured image.
-
-
25. A method that facilitates indexing and/or retrieval of a document, comprising:
-
receiving a captured image of at least a portion of a document; and searching at least one data store for an electronic document corresponding to the captured image, the search performed via comparing topological word properties within the captured image with topological word properties of generated images corresponding to a plurality of electronic documents, the respective topological word properties comprising at least width of each word; generating signatures corresponding to the generated images, each of the signatures is a hash table that contains a plurality of table locations where a respective value corresponding to a respective portion of a particular generated image is entered into a respective table location for each portion of the particular generated image; generating a signature corresponding to the captured image of the document, the signature is a hash table that contains a plurality of table locations where a respective value corresponding to a respective portion of the captured image is entered into a respective table location for each portion of the captured image; comparing a portion of the signatures corresponding to respective generated images to a portion of the signature corresponding to the captured image; and identifying at least one generated image that has a highest number of table locations that have respective values that match values in corresponding table locations associated with the captured image. - View Dependent Claims (26, 27, 28, 29, 30, 31, 32)
-
-
33. A machine-implemented system for indexing and/or retrieval of a document, comprising:
-
a processor; means for generating an image of an electronic document when the electronic document is printed; means for capturing an image of the document after the document has been printed; means for generating a signature corresponding with the generated image; means for generating a signature corresponding to the captured image; means for storing the electronic document; means for iteratively comparing location of respective words and width of respective words within a portion of a signature associated with the captured image to the location of respective words and width of respective words within respective portions of signatures associated with the generated images and excluding each generated image whose signature portion does not match the signature portion of the captured image, the portion of the signature associated with the captured image and the corresponding portions of the signatures respectively associated with the generated images that are compared become progressively smaller with each iteration, where one or more iterations are performed until a predetermined threshold number of generated images remain, wherein each portion of signature respectively associated with a generated image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the generated image is entered into a respective table location for each segment of the generated image, and wherein the portion of the signature associated with the captured image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the captured image is entered into a respective table location for each segment of the captured image; and means for retrieving the electronic document. - View Dependent Claims (34, 35, 36, 37)
-
-
38. A machine-implemented system that facilitates indexing and/or retrieval of a document, comprising:
-
a processor; a query component that receives an image of a printed document; a caching component that generates and stores an image corresponding to the image of the printed document prior to the query component receiving the image of the printed document; and a comparison component that determines and retrieves the stored image via comparing location of words and width of words within the stored image to location of words and width of words within the received image of the printed document, the comparison component iteratively compares a portion of a signature associated with the received image with corresponding portions of signatures respectively associated with the stored images and excludes each stored image whose signature does not match the signature of the received image to facilitate identification of a match to the printed document, the portion of the signature associated with the received image and the portion of the signatures respectively associated with the stored images that are compared become progressively smaller with each iteration, where one or more iterations are performed until a predetermined threshold number of signatures associated with stored images remain, wherein location of words and width of words within at least a portion of a stored image is represented as a portion of a signature of the stored image, and location of words and width of words within at least a portion of the received image of the printed document is represented as a portion of a signature of the received image, wherein each portion of signature respectively associated with a stored image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the stored image is entered into a respective table location for each segment of the stored image, and wherein the portion of the signature associated with the received image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the received image is entered into a respective table location for each segment of the received image.
-
-
39. A computer storage medium having computer executable instructions stored thereon to facilitate document retrieval and/or indexing comprising:
-
return at least one stored image of an electronic document to a user based at least in part upon topological word properties of at least one captured image corresponding to the electronic document; and an iterative comparison of a portion of a signature associated with the at least one captured image with corresponding portions of signatures respectively associated with the at least one stored image and excludes each stored image whose signature does not match the signature of the at least one captured image to facilitate identification of a match to the electronic document, wherein the portion of the signature associated with the at least one captured image and the portion of the signatures respectively associated with the at least one stored image that are compared become progressively smaller with each iteration, where one or more iterations are performed until a predetermined threshold number of signatures associated with the at least one stored image remains, wherein the topological word properties comprise at least width of respective words, wherein the portion of the signature associated with the at least one captured image is based at least in part on topological word properties of a corresponding portion of the at least one captured image, and the portion of signature associated with the at least one stored image is based at least in part on topological word properties of a corresponding portion of the at least one stored image, wherein each portion of a signature respectively associated with a stored image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the stored image is entered into a respective table location for each segment of the stored image, and wherein the portion of the signature associated with the captured image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the captured image is entered into a respective table location for each segment of the captured image.
-
-
40. A computer storage medium having a data structure thereon to facilitate document retrieval and/or indexing comprising, the data structure comprising:
-
a component that receives image(s) of at least a portion of a printed document; and a search component that facilitates retrieval of an electronic document, the electronic document corresponding to the image(s) associated with the printed document, the retrieval based at least in part upon corresponding word-level topological properties when comparing the image(s) associated with the printed document and generated image(s) of the electronic document, the word-level topological properties comprise at least width of words; and a comparison component that is associated with the search component and iteratively compares a portion of a signature associated with the received image associated with the printed document with corresponding portions of signatures respectively associated with the generated images and excludes each generated image whose signature does not match the signature of the received image to facilitate location of a match to the printed document, wherein the portion of the signature associated with the received image and the portion of the signatures respectively associated with the generated images that are compared become progressively smaller with each iteration, where one or more iterations are performed until a predetermined threshold number of signatures associated with generated images remain, wherein the portion of the signature associated with the received image is based at least in part on word-level topological properties of a corresponding portion of the received image, and portions of signature respectively associated with the generated images is based at least in part on word-level topological properties of corresponding portions of the generated images, wherein each portion of a signature respectively associated with a generated image is a hash table that contains a plurality of table locations where respective value corresponding to a respective segment of the generated image is entered into a respective table location for each segment of the generated image, and wherein the portion of the signature associated with the received image is a hash table that contains a plurality of table locations where a respective value corresponding to a respective segment of the received image is entered into a respective table location for each segment of the received image.
-
Specification