Document retrieval system and search method using word set and character look-up tables
First Claim
1. A method of matching a search string according to a predetermined set of matching criteria to a set of words contained in a collection of words, comprising:
- creating and storing a lexicon containing the collection of words and associating each of the stored words with a unique identifying number;
creating and storing a word look-up table identifying sets of word numbers associated with words of the lexicon that have a common set of characteristics;
assigning the word numbers to the words of the lexicon such that each of the word number sets identified by the word look-up table consists of consecutive numbers;
creating and storing a character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character and wherein the character look-up table is a two-dimensional boolean array with one dimension corresponding to character values and the other dimension corresponding to word numbers;
selecting from the word look-up table a target set of word numbers whose associated words have a set of characteristics corresponding to the search string;
refining the target set, the refining comprising selecting a set of characters from the search string, accessing the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and in response to the character identification excluding from the target set those word numbers whose associated words do not contain a predetermined number of the selected characters;
comparing each of the words associated with the refined target set directly with the search string and excluding from the target set any word number whose associated word fails to match the search string according to the predetermined set of matching criteria;
each element of the array consists of a single bit, and the elements of the array corresponding to any character value of the one dimension are stored side-by-side in a row;
the selecting of the target set comprises composing a set of consecutive word numbers from one or more word sets identified by the word look-up table;
the accessing of the character look-up table comprises generating a boolean match value for each of the word numbers in the target set which indicates whether all of the selected characters are contained in the word associated with the word number; and
, the generating of the boolean values comprising performing a logical AND operation that combines sections of each of the rows of the character look-up table corresponding to the selected characters and to the word numbers of the target set thereby to produce a resulting bit string in which each bit is associated with one of the word numbers of the target set and contains the boolean value for the associated word number, the performing of the logical AND operation comprising simultaneously combining n-bit segments of the row sections;
whereby, the boolean match values are generated simultaneously for n words associated with the target set.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-operated document retrieval system includes a lexicon of words contained in system documents, and a document look-up table that relates words by unique word numbers to the documents. A word look-up table identifies sets of words with common characteristics, specifically prefix value and word length, and a character look-up table identifies whether any word contains a specified character. A target set generator accesses the word look-up table to compose a target word set with characteristics corresponding to the search string. A refining module reduces the target set by selecting a set of characters from the search string, and accessing the character look-up table to identify which target words use the character set. The character look-up table is a boolean array with one bit elements that are processed in groups whose size corresponds to the maximum bit processing count of the computer, effectively culling non-matching words simultaneously. A string comparison module determines whether any word remaining in the target set matches the search string. The system quickly executes various searches, including prefix, exact match, wildcard, and fuzzy searches.
66 Citations
13 Claims
-
1. A method of matching a search string according to a predetermined set of matching criteria to a set of words contained in a collection of words, comprising:
-
creating and storing a lexicon containing the collection of words and associating each of the stored words with a unique identifying number;
creating and storing a word look-up table identifying sets of word numbers associated with words of the lexicon that have a common set of characteristics;
assigning the word numbers to the words of the lexicon such that each of the word number sets identified by the word look-up table consists of consecutive numbers;
creating and storing a character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character and wherein the character look-up table is a two-dimensional boolean array with one dimension corresponding to character values and the other dimension corresponding to word numbers;
selecting from the word look-up table a target set of word numbers whose associated words have a set of characteristics corresponding to the search string;
refining the target set, the refining comprising selecting a set of characters from the search string, accessing the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and in response to the character identification excluding from the target set those word numbers whose associated words do not contain a predetermined number of the selected characters;
comparing each of the words associated with the refined target set directly with the search string and excluding from the target set any word number whose associated word fails to match the search string according to the predetermined set of matching criteria;
each element of the array consists of a single bit, and the elements of the array corresponding to any character value of the one dimension are stored side-by-side in a row;
the selecting of the target set comprises composing a set of consecutive word numbers from one or more word sets identified by the word look-up table;
the accessing of the character look-up table comprises generating a boolean match value for each of the word numbers in the target set which indicates whether all of the selected characters are contained in the word associated with the word number; and
,the generating of the boolean values comprising performing a logical AND operation that combines sections of each of the rows of the character look-up table corresponding to the selected characters and to the word numbers of the target set thereby to produce a resulting bit string in which each bit is associated with one of the word numbers of the target set and contains the boolean value for the associated word number, the performing of the logical AND operation comprising simultaneously combining n-bit segments of the row sections;
whereby, the boolean match values are generated simultaneously for n words associated with the target set. - View Dependent Claims (2)
checking the n bits of the segment simultaneously to determine whether the segment defines a number 0; and
,excluding from the target set the n word numbers associated with the bits of the segment if the segment defines a numeric 0 and otherwise checking the state of each of the bits and excluding any word number associated with a 0 bit.
-
-
3. A method of matching a search string according to a predetermined set of matching criteria to a set of words contained in a collection of words, comprising:
-
creating and storing a lexicon containing the collection of words and associating each of the words with a unique identifying number;
creating and storing a word look-up table identifying sets of word numbers, each of the word number sets identifying words of the lexicon that have a common prefix value and a common word length;
assigning the word numbers consecutively to the words of the lexicon ordered primarily according to prefix value and secondarily according to word length whereby each of the word sets identified by the word look-up table consists of consecutive word numbers;
creating and storing a character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character;
accessing the word-look up table to compose a target set of word numbers whose associated words have the same prefix value as the search string and have word lengths corresponding to the length of the search string according to the predetermined set of matching criteria;
identifying the prefix value and the length of the search string;
selecting a set of consecutive word lengths corresponding to the length of the search string;
accessing the word look-up table to identify a set of consecutive word numbers that identify all words in the lexicon having the same prefix value as the search string and having a word length in the selected set of word lengths;
refining the target set, the refining comprising selecting a set of characters from the search string, accessing the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and in response to the character identification excluding from the target set those word numbers whose associated words do not contain a predetermined number of the selected characters; and
,comparing each of the words associated with the refined target set directly with the search string and excluding from the target set any word number whose associated word fails to match the search string according to the predetermined set of matching criteria;
the character look-up table is a two-dimensional boolean array with one dimension corresponding to character values and the other dimension corresponding to the word numbers assigned to the words of the lexicon;
each element of the array consists of a single bit, and the elements corresponding to any character value of the one dimension are stored side-by-side in a row;
the accessing of the character look-up table comprises generating a boolean match value for each of the word numbers in the target set that indicates whether all of the selected characters are contained in the word associated with the word number; and
,the generating of the boolean values comprises performing a logical AND operation that combines sections of each of the rows of the character look-up table corresponding to the selected characters and to the word numbers of the target set thereby to produce a resulting bit string that contains the boolean values, the performing of the logical AND operation comprising simultaneously combining n-bit segments of each of the row sections;
whereby, the boolean match values are generated simultaneously for n words associated with the target set. - View Dependent Claims (4, 5)
checking n-bit sets of the resulting bit string simultaneously to determine whether each of the n-bit sets defines a numeric 0;
excluding from the target set the n word numbers associated with any of the n-bit sets that defines a numeric 0; and
,checking the state of each of the bits in any of the n-bit sets that does not define a numeric 0 and excluding any word number associated with a 0-bit of the n-bit set.
-
-
5. The method of claim 3 in which the common prefix value of the words associated with each of the word number sets identified by the work look-up table consists solely of the value of the first character of each of the words.
-
6. A computer-operated document retrieval system adapted to retrieve documents in response to a search string that specifies a set of words to be found in a document of interest and in response to instructions specifying a predetermined search type, the system comprising:
-
digital storage means storing;
a collection of documents;
a lexicon comprising words contained in the collection of documents and associating each of the words with a unique number;
a word look-up table identifying sets of word numbers associated with words of the lexicon that have a common set of predetermined characteristics;
each of the word number sets identified by the word look-up table, the set of characteristics of the associated words comprises a common prefix value and a common word length;
a character look-up table identifying for a specified word number and a specified character whether the word in the lexicon associated with the specified word number contains the specified character; and
which the character look-up table is a two-dimensional boolean array with one dimension corresponding to character values and the other dimension corresponding to the word numbers assigned to the words of the lexicon;
a document look-up table relating the word numbers of the lexicon to ones of the stored documents containing the words;
set generating means for accessing the word look-up table to compose a target set of word numbers in response to the specified search type and a set of characteristics of the search string;
set refining means for refining the target set, the set refining means programmed to select a character set consisting of characters in the search string, to access the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and to exclude from the target set any word number whose associated word contains less than a predetermined number of the selected characters and wherein the word numbers are assigned consecutively to the words of the lexicon primarily according to the word prefix value and secondarily according to word length such that each of the word number sets identified by the word look-up table consists of consecutive word numbers; and
,the set generating means are programmed to compose in response to a specified prefix value and a specified range of consecutive word lengths, a set of consecutive word numbers whose associated words correspond to the specified prefix value and whose lengths lie in the specified range of word lengths;
comparison means for comparing the search string with each of the words associated with the target set and excluding from the target set any word number whose associated word does not match the search string according to a set of matching criteria determined by the specified search type; and
,retrieval means for accessing the document look-up table to identify and retrieve documents related to the word numbers of the target set; and
each of the elements of the array is a single bit, and the elements of the array corresponding to any character value of the one dimension are stored side-by-side in a row; and
,the set refining means are programmed to generate a boolean value for each of the word numbers in the target set that indicates whether all of the selected characters are contained in the word associated with the word number, the set refining means generating the boolean values by performing a logical AND operation that combines sections of each of the rows of the character look-up table corresponding to the selected characters and to the word numbers of the target set thereby to produce a resulting bit string in which each bit is associated with one of the word numbers of the target set and contains the boolean value for the associated word number, the set refining means performing the logical AND operation by simultaneously combining n-bit segments of the row sections whereby the boolean values are simultaneously generated for n word numbers of the target set. - View Dependent Claims (7)
to check the n bits simultaneously to determine whether the segment defines a number 0; and
,to exclude the n word numbers associated with the bits of the segment from the target set if the segment defines a number 0 and otherwise to check the state of each of the bits of the segment and exclude from the target set any word number associated with a 0 bit.
-
-
8. A product for enabling a digital processor coupled to a digital storage medium to match a search string according to a predetermined set of matching criteria to a set of words contained in a collection of words, the product comprising a processor-readable medium containing program code for operating the processor, the program code defining means for:
-
creating and storing in the digital storage medium a lexicon containing the collection of words and associating each of the stored words with a unique identifying number;
creating and storing in the digital storage medium a word look-up table identifying word number sets consisting of word numbers associated with words of the lexicon that have a common set of characteristics;
for each of the word number sets identified by the word look-up table, the set of characteristics of the associated words comprises a common prefix value and a common word length;
assigning the word numbers consecutively to the words of the lexicon primarily according to prefix value and secondarily according to word length whereby each of the word sets identified by the word look-up table consists of consecutive word numbers;
creating and storing in the digital storage medium a character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character;
selecting a target set of word numbers from the word look-up table whose associated words have a set of characteristics corresponding to the search string;
identifying the prefix value and the length of the search string;
selecting a set of consecutive word lengths corresponding to the length of the search string; and
,accessing the word look-up table to identify a set of consecutive word numbers that identify all words in the lexicon having the same prefix value as the search string and having a word length in the selected set of word lengths;
refining the target set, the refining comprising selecting a set of characters from the search string, accessing the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and in response to the character identification excluding from the target set those word numbers whose associated words do not contain a predetermined number of the selected characters; and
,comparing each of the words associated with the refined target set directly with the search string and excluding from the target set any word number whose associated word fails to match the search string according to the predetermined set of matching criteria;
the character look-up table is a boolean array in which each element of the array consists of a single bit and in which the elements corresponding to any character value of the one dimension are stored side-by-side in a row;
the accessing of the character look-up table comprises generating a boolean match value for each of the word numbers in the target set that indicates whether all of the selected characters are contained in the word associated with the word number; and
,the generating of the boolean values comprises performing a logical AND operation that combines sections of each of the rows of the character look-up table corresponding to the selected characters and to the word numbers of the target set thereby to produce a resulting bit string in which each bit is associated with one of the word numbers of the target set and contains the boolean value for the associated word number, the performing of the logical AND operation comprising simultaneously combining n-bit segments of each of the row sections;
whereby, the boolean match values are generated simultaneously for n words associated with the target set. - View Dependent Claims (9, 10, 11)
checking the n bits of the segment simultaneously to determine whether the segment defines a numeric 0; and
,excluding from the target set the n word numbers associated with the bits of the segment if the segment defines a numeric 0 and otherwise checking the state of each of the bits and excluding any word number associated with a 0 bit.
-
-
10. The product of claim 9 in which the prefix value for each of the words of the lexicon consists solely of the value of the first character of the word.
-
11. The product of claim 8 in which the word-look up table comprises, for each of the word number sets identified by the word look-up table:
-
an entry identifying the prefix value and length of the word set; and
,a pair of entries identifying a lower word number and an upper word number defining the word number set.
-
-
12. A product for enabling a digital processor to retrieve among a collection of documents a set of documents containing a set of words that match a search string according to a predetermined set of search criteria, the collection of documents stored together with a lexicon, a word look-up table, a character look-up table and a document look-up table in a digital storage device coupled to the processor, the lexicon containing a collection of words found in the documents and associating each of the collected words with a unique identifying number, the word look-up table identifying sets of word numbers associated with words of the lexicon that have a common set of characteristics, the character look-up table identifying for a specified word number and a specified character whether the word associated with the specified word number contains the specified character, and the document look-up table relating each of the word numbers to those documents of the collection of documents that contain the word associated with the word number, the product comprising a processor-readable medium containing program code that defines;
-
set generating means responsive to the search string and to the search criteria for accessing the word look-up table to compose a target set of word numbers whose associated words have a set of characteristics corresponding to the search string;
the set generating means are programmed to compose, in response to a specified prefix value and a specified set of consecutive word lengths, a set of consecutive word numbers whose associated words correspond to the specified prefix value and whose lengths are contained in the specified set of word lengths;
set refining means for refining the target set, the set refining means programmed to select a character set consisting of characters in the search string, to access the character look-up table to identify which of the selected characters are contained in each of the words associated with the target set, and to exclude from the target set any word number whose associated word contains less than a predetermined number of the selected characters;
the set refining means are programmed to generate a boolean value for each of the word numbers in the target set that indicates whether all of the selected characters are contained in the word associated with the word number, the set refining means generating the boolean values by performing a logical AND operation that combines sections of each of the rows of the character look-up table corresponding to the selected characters and to the word numbers of the target set thereby to produce a resulting bit string in which each bit is associated with one of the word numbers of the target set and contains the boolean value for the associated word number, the set refining means performing the logical AND operation by simultaneously combining n-bit segments of the row sections whereby the boolean values are simultaneously generated for n word numbers of the target set;
comparison means for comparing the search string with each of the words associated with the target set and excluding from the target set any word number whose associated word does not match the search string according to a set of matching criteria determined by the set of search criteria; and
,retrieval means for accessing the document look-up table to identify and retrieve documents related to the word numbers of the target set. - View Dependent Claims (13)
to check the n bits simultaneously to determine whether the segment defines a numeric 0; and
,to exclude the n word numbers associated with the bits of the segment from the target set if the segment defines a numeric 0 and otherwise to check the state of each of the bits of the segment and exclude from the target set any word number associated with a 0 bit.
-
Specification