System and method for portable document indexing using n-gram word decomposition
First Claim
1. A computer-implemented method for indexing stored documents, each document containing at least one page and containing a plurality of words, and searching for at least one document matching an input search query containing at least one query word, comprising the steps of:
- for each document;
identifying non-stop words on each page of the document;
determining for each non-stop word at least one n-gram;
for each n-gram, storing a map having a plurality of positions, each position corresponding to a page, and each position indicating whether or not the corresponding page contains the n-gram;
determining at least one query word n-gram for the at least one query word; and
is retrieving documents having n-grams that match selected ones of the query word n-grams, by performing the steps of;
determining a map corresponding to the query word n-gram;
determining from the map at least one page containing the query word n-gram; and
retrieving the page, and the document associated therewith.
1 Assignment
0 Petitions
Accused Products
Abstract
A system and method provides for indexing and retrieval of stored documents using a decomposition of words in the documents in n-grams, or linear word subunits. The documents are indexed as pages in a number of banks. For each bank there is a bank index. The individual n-grams are identified for each page are stored in the bank index. Each bank index further contains an entry map that indicates whether a given n-gram is present in any of the pages of the bank, and then provides an index to a page map that further indicates which page in the bank contains the n-gram. When a search query is input, the query words are decomposed into their n-grams. The query word n-grams are compared first with entry maps to determine if the query word n-grams appear on any page in the bank. If so, the associated page map is traversed to determine which page in the bank contains the query word n-grams. The n-grams on the page are compared with the query word n-grams to determine the presence of an match therebetween. Matching pages are flagged. When all pages in all banks have been processed, the pages are consolidated with respect to the documents to which they belong, resulting in a list of documents that match the search query. The results are displayed to a user.
161 Citations
15 Claims
-
1. A computer-implemented method for indexing stored documents, each document containing at least one page and containing a plurality of words, and searching for at least one document matching an input search query containing at least one query word, comprising the steps of:
-
for each document; identifying non-stop words on each page of the document; determining for each non-stop word at least one n-gram; for each n-gram, storing a map having a plurality of positions, each position corresponding to a page, and each position indicating whether or not the corresponding page contains the n-gram; determining at least one query word n-gram for the at least one query word; and is retrieving documents having n-grams that match selected ones of the query word n-grams, by performing the steps of; determining a map corresponding to the query word n-gram; determining from the map at least one page containing the query word n-gram; and retrieving the page, and the document associated therewith.
-
-
2. A computer readable memory including a storage structure for indexing documents by n-grams, each document having a document number, and a document name, and at least one page, each page having a page number, comprising:
-
a bank comprising a list of page entries, each page entry identifying a page by the document number of the document containing the page, and a page number within the document; and
,a bank index associated with the bank comprising; i) a plurality of n-gram entry maps, each n-gram entry map associated with a single n-gram, selected n-gram entry maps having an index to an index entry map where at least one page identified in the bank includes the n-gram associated with the n-gram entry map; ii) a plurality of index entry maps, each index entry map indexed by one of the n-gram entry maps, each index entry map having a plurality of positions, each position corresponding to a page entry in the bank, and each position indicating whether or not the corresponding page entry in the bank identifies a page containing the n-gram associated with the n-gram entry map that indexes the index entry map. - View Dependent Claims (3, 4, 5, 6, 7)
-
-
8. A computer implemented method of indexing a plurality of documents, each document having at least one page, each page having less than a maximum amount of data, and having a plurality of words, comprising:
-
a) storing a list of pages, each page associated with a document; b) determining a list of n-grams; and c) for selected ones of the n-grams, storing a map of pages that contain the n-gram by; i) retrieving a current page from the documents; and ii) for each non-stop word of the current page; 1) determining the n-grams in the word; and 2) for each n-gram in the word; in a map associated with the n-gram and having a plurality of positions, each position corresponding to a page, and each position indicating whether or not the corresponding page contains the n-gram, updating the position for the current page as indicating that the page contains the n-gram. - View Dependent Claims (9, 10)
-
-
11. A computer readable memory for controlling a processor to index a plurality of documents, each document containing at least one page comprising:
-
a list of indexed pages; a set of index maps, each index map associated with one n-gram and having a plurality of positions, each position uniquely associated with a page in the list of indexed pages, and each position indicating whether or not the corresponding page includes the n-gram associated with the index map; and a page indexing module that; i) receives a current page to be indexed; ii) creates an entry for the current page in the list of indexed pages; iii) stores for each non-stop word of the current page a list of n-grams in the word; and iv) for each n-gram, updates in the index map associated with the n-gram, the entry for the current page to indicate that the current page includes the n-gram. - View Dependent Claims (12)
-
-
13. A computer readable memory for controlling a processor to index a plurality of documents, each document containing at least one page, the memory comprising:
-
a list of indexed pages; a set of index maps, each index map associated with one n-gram and having a plurality of positions, each position uniquely associated with a page in the list of indexed pages, and each position indicating whether or not the corresponding page includes the n-gram associated with the index map; and a page indexing module that; i) receives a current page to be indexed; ii) creates an entry for the current page in the list of indexed pages; iii) stores for each non-stop word of the current page a list of n-grams in the word by; iii.1) determining an n-gram number for each n-gram in the word by the equation;
##EQU6## where NG is the n-gram number of the word;
x is an n-gram character number of the ith character of the word;Cmax is total number of indexable characters; and Np is the desired number of letters in the n-gram; iii.2) storing the n-gram number of each n-gram in the word; and iii.3) associating the stored n-grams numbers with the current page; and iv) for each n-gram, updates in the index map associated with the n-gram, the entry for the current page to indicate that the current page includes the n-gram.
-
-
14. A computer readable memory for controlling a processor to index a document including a query term from a plurality of documents, each document containing at least one page, comprising:
-
a list of indexed pages, each page associated with a document; a set of index maps, each index map associated with one n-gram and having a plurality of positions, each position uniquely associated with a page in the list of indexed pages, and each position indicating whether or not the corresponding page includes the n-gram associated with the index map; and a search module that;
receives a query term;for each of a number of n-grams in the query term; determines whether there is an index map associated with the n-gram; and responsive to an existing index map, determines from the index map each page in the list of indexed pages that contains the n-gram associated with the map; for each page in the list of indexed pages, determines whether the page contains a sufficient number of the n-grams in the query term to indicate that the page contains the query term; and responsive to a page containing the query term, retrieves the document containing the page for subsequent query analysis. - View Dependent Claims (15)
-
Specification