Using an ID domain to improve searching
First Claim
1. A computer-implemented method comprising:
- under control of a computing device having one or more processors with executable instructions,segmenting text in an image of a document into elements, each element representing a character in the text;
grouping similar elements into clusters and assigning each cluster an identifier;
replacing each element in a cluster of similar elements with the identifier allocated to the cluster of similar elements;
ordering the identifiers within the document according to an order of the characters in the text;
creating an index of identifiers in the document;
receiving a text query and converting the text query into an image of the text query;
segmenting the image of the text query into elements and matching each element to at least one cluster using a cluster table, the cluster table comprising mappings between identifiers and element characteristics, at least a first element matching to at least two clusters;
replacing each element in the image of the text query with at least one identifier based on the matching to formulate a query defined in terms of identifiers, replacing each element in the image of the text query including replacing the first element with at least two identifiers based on the matching; and
searching the index of identifiers using the query defined in terms of identifiers.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods which use an ID domain to improve searching are described. An embodiment describes an index phase in which an image of a document is converted into the ID domain. This is achieved by dividing the text in the image into elements and mapping each element to an identifier. Similar elements are mapped to the same identifier. Each element in the text is then replaced by the appropriate identifier to create a version of the document in the ID domain. This version may be indexed and searched. Another embodiment describes a query phase in which a query is converted into the ID domain and then used to search an index of identifiers which has been created from collections of documents which have been converted into the ID domain. The conversion of the query may use mappings which were created during the index phase or alternatively may use pre-existing mappings.
29 Citations
20 Claims
-
1. A computer-implemented method comprising:
-
under control of a computing device having one or more processors with executable instructions, segmenting text in an image of a document into elements, each element representing a character in the text; grouping similar elements into clusters and assigning each cluster an identifier; replacing each element in a cluster of similar elements with the identifier allocated to the cluster of similar elements; ordering the identifiers within the document according to an order of the characters in the text; creating an index of identifiers in the document; receiving a text query and converting the text query into an image of the text query; segmenting the image of the text query into elements and matching each element to at least one cluster using a cluster table, the cluster table comprising mappings between identifiers and element characteristics, at least a first element matching to at least two clusters; replacing each element in the image of the text query with at least one identifier based on the matching to formulate a query defined in terms of identifiers, replacing each element in the image of the text query including replacing the first element with at least two identifiers based on the matching; and searching the index of identifiers using the query defined in terms of identifiers. - View Dependent Claims (2, 3, 4)
-
-
5. A computer-implemented method comprising:
under control of a computing device having one or more processors with executable instructions, receiving a text query; converting the text query into an image by drawing the text query using a font; performing a comparison between elements in the image to a cluster table associated with the font, the cluster table defining mappings between image elements and identifiers associated with the image elements; and creating a query defined in terms of identifiers associated with clusters of elements based on the comparison between elements of the image and the cluster table associated with the font, a first element in the image being matched to two clusters of elements, creating the query including replacing the first element with two identifiers associated with the two clusters of elements in the cluster table. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13)
-
14. One or more tangible storage media, the one or more tangible storage media being hardware, having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
-
receiving a text query; converting the text query into an image by drawing the text query using a font; comparing elements in the image to a cluster table associated with the font, the cluster table defining mappings between image elements and identifiers associated with the image elements; and creating a query defined in terms of identifiers associated with clusters of elements based on the comparison between elements of the image and the cluster table associated with the font, a first element in the image being matched to two clusters of elements, creating the query including replacing the first element with two identifiers associated with the two clusters of elements in the cluster table.
-
-
15. The one or more tangible storage media as claimed in 14 having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
searching an index of identifiers created from at least one document image using the query defined in terms of identifiers based on the comparison between elements of the image and the cluster table associated with the font.
-
16. The one or more tangible storage media as claimed in 14 having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
creating the query by replacing each element in the image with at least one identifier based on the comparison.
-
17. The one or more tangible storage media as claimed in 16 having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
creating the query by replacing each element in the image with N identifiers corresponding to the N most similar image elements in the cluster table.
-
18. The one or more tangible storage media as claimed in 17 having device-executable instructions, wherein each of said N identifiers has an associated weight and wherein the search of an index of identifiers uses the query defined in terms of identifiers and the weight associated with each of the identifiers.
-
19. The one or more tangible storage media as claimed in 14 having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
-
converting the text query into another image by drawing the text query using another font different from the font; creating another query defined in terms of identifiers based on a comparison between elements of the other image and a cluster table associated with the other font different from the font; and searching another index of identifiers created from at least one document image using the other query defined in terms of identifiers based on the comparison between elements of the other image and the cluster table associated with the other font different from the font.
-
-
20. The one or more tangible storage media as claimed in 14 having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
creating the query by replacing each element in the image with at least one identifier based on the comparison, and creating a query comprising a restricted sequence of identifiers.
Specification