Scalable indexing for layout based document retrieval and ranking
First Claim
1. A computer-implemented method for creating a set of indexes for a collection of documents according to document layout, comprising:
- providing a plurality of documents to computer memory;
extracting layout blocks from the provided documents;
using a computer processor, clustering the layout blocks into a plurality of layout block clusters;
computing a representative block for each of the layout block clusters;
generating a document index for each provided document based on the layout blocks of the document and the computed representative blocks;
clustering the created document indexes into a plurality of document index clusters;
generating a representative cluster index for each of the document index clusters; and
outputting the generated document indexes, representative blocks, document index clusters, and representative cluster indexes to memory.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer-based method and a system for indexing, querying, and ranking documents based on layout are provided. The method includes providing a plurality of documents to computer memory, extracting layout blocks from the provided documents, clustering the layout blocks into a plurality of layout block clusters, computing a representative block for each of the layout block clusters, generating a document index for each provided document based on the layout blocks of the document and the computed representatives blocks, clustering the created document indexes into a plurality of document index clusters, and generating a representative cluster index for each of the document index clusters. The indexes generated, together with the representative blocks and document index clusters, can be stored and used for retrieval of documents responsive to a layout query.
-
Citations
25 Claims
-
1. A computer-implemented method for creating a set of indexes for a collection of documents according to document layout, comprising:
-
providing a plurality of documents to computer memory; extracting layout blocks from the provided documents; using a computer processor, clustering the layout blocks into a plurality of layout block clusters; computing a representative block for each of the layout block clusters; generating a document index for each provided document based on the layout blocks of the document and the computed representative blocks; clustering the created document indexes into a plurality of document index clusters; generating a representative cluster index for each of the document index clusters; and outputting the generated document indexes, representative blocks, document index clusters, and representative cluster indexes to memory. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 21)
-
-
17. A computer-implemented method for querying a collection of documents according to document layout, comprising:
-
providing in computer memory; a set of representative blocks, each representative block being representative of a respective cluster of layout blocks extracted from documents in the collection, a set of document indexes, each document index being representative of layout blocks of a respective document in the collection, each of the document indexes being assigned to a respective one of a set of document index clusters, and a representative document index for each document index cluster, each representative document index being derived from the document indexes assigned to the document index cluster; extracting layout blocks from an input document layout query; projecting the extracted layout blocks from the document layout query onto the set of representative blocks to generate a query document index; with a computer processor, computing a measure of similarity between the query document index and each of the representative document indexes in the set of representative document indexes; identifying a top document index cluster which includes the representative document index determined to have the greatest similarity to the query document index; computing a measure of similarity between each document index in the top document index cluster and the query index to identify a set of most similar documents; and outputting information on the documents in the set of most similar documents. - View Dependent Claims (18, 19, 20)
-
-
22. A computer-based system for querying a collection of documents according to a document layout, comprising:
-
(i) a data input module configured to receive into memory; a set of document indexes, a set of representative blocks, a set of document index clusters, a set of representative indexes, and a document layout query; (ii) a document query module configured to; extract layout blocks from the document layout query, project the extracted layout blocks from the document layout query onto the set of representative blocks to generate a query index, compute the similarity between the query index and each of the representative indexes in the set of document index clusters by using a ranking function, identify the document index cluster containing the representative index having the greatest similarity as the top document index cluster, and compute the similarity between each document index in the top document index cluster and the query index by using the ranking function; and (iii) a document ranking module configured to; maintain an ordered list of the top k documents based on the computed similarity between each document index in the top document index cluster and the query index, and output the ordered list. - View Dependent Claims (23, 24, 25)
-
Specification