Document retrieval system and search method using word set and character look-up tables
First Claim
1. A method of matching a search string according to a predetermined set of matching criteria to a set of words contained in a collection of words, comprising:
- creating and storing a lexicon containing the collection of words and associating each of the stored words with a unique identifying number;
creating and storing a word look-up table identifying sets of word numbers associated with words of the lexicon that have a common set of characteristics;
creating and storing a character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character;
selecting from the word look-up table a target set of word numbers whose associated words have a set of characteristics corresponding to the search string;
refining the target set, the refining comprising selecting a set of characters from the search string, accessing the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and in response to the character identification excluding from the target set those word numbers whose associated words do not contain a predetermined number of the selected characters; and
, comparing each of the words associated with the refined target set directly with the search string and excluding from the target set any word number whose associated word fails to match the search string according to the predetermined set of matching criteria.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-operated document retrieval system includes a lexicon of words contained in system documents, and a document look-up table that relates words by unique word numbers to the documents. A word look-up table identifies sets of words with common characteristics, specifically prefix value and word length, and a character look-up table identifies whether any word contains a specified character. A target set generator accesses the word look-up table to compose a target word set with characteristics corresponding to the search string. A refining module reduces the target set by selecting a set of characters from the search string, and accessing the character look-up table to identify which target words use the character set. The character look-up table is a boolean array with one bit elements that are processed in groups whose size corresponds to the maximum bit processing count of the computer, effectively culling non-matching words simultaneously. A string comparison module determines whether any word remaining in the target set matches the search string. The system quickly executes various searches, including prefix, exact match, wildcard, and fuzzy searches.
-
Citations
40 Claims
-
1. A method of matching a search string according to a predetermined set of matching criteria to a set of words contained in a collection of words, comprising:
-
creating and storing a lexicon containing the collection of words and associating each of the stored words with a unique identifying number;
creating and storing a word look-up table identifying sets of word numbers associated with words of the lexicon that have a common set of characteristics;
creating and storing a character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character;
selecting from the word look-up table a target set of word numbers whose associated words have a set of characteristics corresponding to the search string;
refining the target set, the refining comprising selecting a set of characters from the search string, accessing the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and in response to the character identification excluding from the target set those word numbers whose associated words do not contain a predetermined number of the selected characters; and
,comparing each of the words associated with the refined target set directly with the search string and excluding from the target set any word number whose associated word fails to match the search string according to the predetermined set of matching criteria. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method of matching a search string according to a predetermined set of matching criteria to a set of words contained in a collection of words, comprising:
-
creating and storing a lexicon containing the collection of words and associating each of the words with a unique identifying number;
creating and storing a word look-up table identifying sets of word numbers, each of the word number sets identifying words of the lexicon that have a common prefix value and a common word length;
creating and storing a character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character;
accessing the word look-up table to compose a target set of word numbers whose associated words have the same prefix value as the search string and have word lengths corresponding to the length of the search string according to the predetermined set of matching criteria;
refining the target set, the refining comprising selecting a set of characters from the search string, accessing the character lookup table to identify which of the selected characters are contained in each of the words associated with the target set, and in response to the character identification excluding from the target set those word numbers whose associated words do not contain a predetermined number of the selected characters; and
,comparing each of the words associated with the refined target set directly with the search string and excluding from the target set any word number whose associated word fails to match the search string according to the predetermined set of matching criteria. - View Dependent Claims (8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21)
-
-
14. A computer-operated document retrieval system adapted to retrieve documents in response to a search string that specifies a set of words to be found in a document of interest and in response to instructions specifying a predetermined search type, the system comprising:
-
digital storage means storing;
(a) a collection of documents;
(b) a lexicon comprising words contained in the collection of documents and associating each of the words with a unique number;
(c) a word look-up table identifying sets of word numbers associated with words of the lexicon that have a common set of predetermined characteristics;
(d) a character look-up table identifying for a specified word number and a specified character whether the word in the lexicon associated with the specified word number contains the specified character; and
,(e) a document look-up table relating the word numbers of the lexicon to ones of the stored documents containing the words;
set generating means for accessing the word look-up table to compose a target set of word numbers in response to the specified search type and a set of characteristics of the search string;
set refining means for refining the target set, the set refining means programmed to select a character set consisting of characters in the search string, to access the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and to exclude from the target set any word number whose associated word contains less than a predetermined number of the selected characters;
comparison means for comparing the search string with each of the words associated with the target set and excluding from the target set any word number whose associated word does not match the search string according to a set of matching criteria determined by the specified search type; and
,retrieval means for accessing the document look-up table to identify and retrieve documents related to the word numbers of the target set.
-
-
22. A product for enabling a digital processor coupled to a digital storage medium to match a search string according to a predetermined set of matching criteria to a set of words contained in a collection of words, the product comprising a processor-readable medium containing program code for operating the processor, the program code defining means for:
-
creating and storing in the digital storage medium a lexicon containing the collection of words and associating each of the stored words with a unique identifying number;
creating and storing in the digital storage medium a word look-up table identifying word number sets consisting of word numbers associated with words of the lexicon that have a common set of characteristics;
creating and storing in the digital storage medium a character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character;
selecting a target set of word numbers from the word look-up table whose associated words have a set of characteristics corresponding to the search string;
refining the target set, the refining comprising selecting a set of characters from the search string, accessing the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and in response to the character identification excluding from the target set those word numbers whose associated words do not contain a predetermined number of the selected characters; and
,comparing each of the words associated with the refined target set directly with the search string and excluding from the target set any word number whose associated word fails to match the search string according to the predetermined set of matching criteria. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29)
-
-
30. A product for enabling a digital processor to identify among a collection of documents a set of documents that contain a set of words specified by a search string, the digital processor operating under control of program code defining (a) a target set generator for identifying a target set consisting of words with a common set of characteristics corresponding to the search string, (b) a set refining module for excluding from the target set any word that does not contain a predetermined number of characters selected from the search string, (c) a comparison module for comparing each of the words associated with the refined target set directly with the search string and excluding any word that fails to match the search string according to a predetermined set of matching criteria, and (d) a document retrieving module for identifying and retrieving documents containing a specified set of words, each of the modules programmed to identify words with unique numbers pre-assigned to the words, the product comprising a processor-readable medium containing in processor-readable form:
-
a lexicon for use with the comparison module, the lexicon containing words found in the collection of documents, the lexicon associating each of the words of the lexicon with a unique identifying number;
a word look-up table for use with the target set generating module, the word look-up table identifying sets of word numbers associated with words of the lexicon that have a common set of characteristics;
a character look-up table for use with the set refining module, the character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character; and
,a document look-up table for use with the document retrieving module, the document look-up table relating each of the word numbers to those documents of the collection that contain the word associated with the word number. - View Dependent Claims (31, 32, 33, 34, 35, 36, 38, 39, 40)
-
-
37. A product for enabling a digital processor to retrieve among a collection of documents a set of documents containing a set of words that match a search string according to a predetermined set of search criteria, the collection of documents stored together with a lexicon, a word look-up table, a character look-up table and a document look-up table in a digital storage device coupled to the processor, the lexicon containing a collection of words found in the documents and associating each of the collected words with a unique identifying number, the word look-up table identifying sets of word numbers associated with words of the lexicon that have a common set of characteristics, the character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character, and the document look-up table relating each of the word numbers to those documents of the collection of documents that contain the word associated with the word number, the product comprising a processor-readable medium containing program code that defines:
-
set generating means responsive to the search string and to the search criteria for accessing the word look-up table to compose a target set of word numbers whose associated words have a set of characteristics corresponding to the search string;
set refining means for refining the target set, the set refining means programmed to select a character set consisting of characters in the search string, to access the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and to exclude from the target set any word number whose associated word contains less than a predetermined number of the selected characters;
comparison means for comparing the search string with each of the words associated with the target set and excluding from the target set any word number whose associated word does not match the search string according to a set of matching criteria determined by the set of search criteria; and
,retrieval means for accessing the document look-up table to identify and retrieve documents related to the word numbers of the target set.
-
Specification