System and method for flexible indexing of document content
DCFirst Claim
1. A method for flexible indexing of document content comprising:
- obtaining a collection of documents to be indexed;
storing said collection of documents in a single document information stream;
parsing each one of said documents into constituent words to facilitate indexing;
creating a plurality of stem words to be indexed by stemming each word into a standard prefix;
creating a first record providing an entry point into an index structure, said first record having a plurality of entry blocks, each one of said entry blocks being uniquely associated with a character out of a character set, said collection of documents being formed by characters drawn from said character set;
creating a plurality of additional primary and secondary records providing character-by-character pathways for locating an occurrence of a stem word in said document information stream;
creating a translation vector mapping stem words to document locations; and
creating a plurality of streams for providing locations of occurrences of said stem word in said document information stream.
3 Assignments
Litigations
0 Petitions
Accused Products
Abstract
A system and method for flexible indexing of document content for facilitating the rapid search and retrieval of large collections of documents. The method for flexible indexing of document content includes obtaining a collection of documents to be indexed, storing the collection of documents in a single document information stream, parsing each one of the documents into constituent words to facilitate indexing, creating a plurality of stem words to be indexed by stemming each word into a standard prefix; and indexing each stem word. Stem words may be indexed character-by-character based on character frequency, into either primary or secondary fixed-size records or secondary overflow records. Each stem may be terminated by a translation vector record number. Document locations may be accessible through the translation vector record number. Translation vector entries specify each stem word'"'"'s location by stream identification number and record number. Locations may be grouped by count in streams of fixed-size records.
20 Citations
26 Claims
-
1. A method for flexible indexing of document content comprising:
-
obtaining a collection of documents to be indexed;
storing said collection of documents in a single document information stream;
parsing each one of said documents into constituent words to facilitate indexing;
creating a plurality of stem words to be indexed by stemming each word into a standard prefix;
creating a first record providing an entry point into an index structure, said first record having a plurality of entry blocks, each one of said entry blocks being uniquely associated with a character out of a character set, said collection of documents being formed by characters drawn from said character set;
creating a plurality of additional primary and secondary records providing character-by-character pathways for locating an occurrence of a stem word in said document information stream;
creating a translation vector mapping stem words to document locations; and
creating a plurality of streams for providing locations of occurrences of said stem word in said document information stream. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
creating a plurality of lookup portions, each one of said lookup portions representing a record location, each record location being associated with a first one of said additional records, said first one of said additional records being associated with stem words beginning with said associated character;
creating a plurality of translation vector portions, each one of said translation vector portions specifying a translation entry and defining a location in one of said plurality of streams, each one of said streams providing a cross-reference to occurrences in said document stream of stem words which terminate after said associated character;
pairing each one of said plurality of lookup portions with an associated one of said plurality of translation vector portions to form a plurality of said entry blocks.
-
-
3. The method of claim 2, further comprises using 32-bit representations for each one of said plurality of lookup portions and each one of said plurality of translation vector portions such that each one of said entry blocks comprises 64 bits.
-
4. The method of claim 1, wherein said step of creating a first record further comprises associating each one of said entry blocks with a unique one of said characters in a frequency of usage order such that a first one of said entry blocks is associated with a character most frequently used in said collection of documents and a last one of said entry blocks is associated with a character least frequently used in said collection of documents.
-
5. The method of claim 1, wherein said step of creating a first record further comprises creating 64 entry blocks, each one of said entry blocks being uniquely associated with one of said characters.
-
6. The method of claim 1, wherein said step of creating a plurality of primary additional records further comprises:
-
creating a plurality of presence vectors;
creating a plurality of lookup segments;
pairing each one of said plurality of presence vectors with an associated one of said plurality of lookup segments, each pairing of presence vectors and lookup segments forming one of said plurality of additional records.
-
-
7. The method of claim 6, wherein said step of creating presence vectors further comprises using 64 bits to represent each one of said plurality of presence vectors, each one of said 64 bits being uniquely associated with one of said characters, each one of said 64 bits indicating a continuation of a stem word with said character.
-
8. The method of claim 6, wherein said step of creating a plurality of lookup segments further comprises:
-
creating a plurality of lookup portions, each one of said lookup portions representing a record location for one either of said additional primary and secondary records, either of said additional primary and secondary records being associated with stem words which continue with an associated character;
creating a plurality of translation vector portions, each one of said translation vector portions specifying a translation entry in said translation vector and defining a location in one of said plurality of streams, each of said streams providing a cross-reference to occurrences in said document information stream of stem words which terminate after said associated character;
pairing each one of said plurality of lookup portions with an associated one of said plurality of translation vector portions to form a plurality of lookup segments.
-
-
9. The method of claim 8, further comprises using 32-bit representations for each one of said plurality of lookup portions and each one of said plurality of translation vector portions such that each one of said lookup segments comprises 64 bits.
-
10. The method of claim 9, further comprises:
-
using 4 bits in said translation vector entry to represent a stream ID, said stream ID indicating which one of said plurality of streams contains location information for occurrences of an associated stem word in said document information stream;
using 28 bits in said translation vector entry to represent a stream record number providing a location in said one of said plurality of streams where location information for occurrences of an associated stem word in said document information stream is placed.
-
-
11. The method of claim 1, wherein said step of creating a plurality of streams further comprises:
-
creating a plurality of document ordinals, each document ordinal providing a starting location in said document information stream for a document containing said stem word;
creating a plurality of document locations, each one of said plurality of document locations providing an offset from said starting location, said offset providing a starting point for said stem word in said document;
pairing each one of said plurality of document ordinals with one of said document locations.
-
-
12. The method of claim 11 further comprising using 32 bits to represent each document ordinal and each document location.
-
13. The method of claim 12, further comprising establishing a first pairing of said document ordinal and document location to indicate a quantity of pairs of document ordinals and document locations stored in said stream.
-
14. The method of claim 11, wherein said step of creating a plurality of streams further comprises forming a plurality of stream each having a different length such that a first one of said streams has a relatively fewer pairs of document ordinals and document locations than a second one of said streams and said second one of said streams as relatively fewer pairs of document ordinals and document locations than a last one of said streams.
-
15. The method of claim 14, wherein said step of creating a plurality of streams further comprises creating 16 streams.
-
16. The method of claim 1, wherein said step of creating a plurality of additional secondary records further comprises creating a plurality of lookup segments.
-
17. The method of claim 1, wherein said step of creating a plurality of stem words further comprises:
-
placing each word in a queue for stemming;
obtaining a stop list of terms which are to be excluded from indexing;
comparing each word from said collection of documents with said stop list;
removing each word matching a term on said stop list from said queue.
-
-
18. A system for flexible indexing of document content comprising:
-
a computer system having a storage means for facilitating the retention and recall of a collection of documents to be indexed;
an indexing module for developing character-by-character index of words;
a plurality of records providing character-by-character addressing for said indexing module;
a plurality of streams for recording locations where each indexed word occurs in said collection of documents;
wherein said plurality of records further comprises;
a first record providing an entry point into said index, said first record comprising a plurality of entries, each one of said entries being uniquely associated with a character out of a character set, each one of said collection of documents is formed by characters drawn from said character set;
a plurality of additional primary and secondary records, each one of said additional records provides a character-by-character pathway for locating an occurrence of a stem word in said collection of documents. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26)
a lookup portion representing a record location for a first one of said additional records, said first one of said additional records providing a pathway to a second character in said character-by-character pathway for location occurrences of indexed words in said collection of documents;
a translation vector portion representing a translation vector record number and defining a location in one of said plurality of streams, said location providing a cross-reference to a location of an occurrence of said word in said collection of documents.
-
-
20. The system of claim 19, wherein said entry record further comprises:
-
said lookup portion being implemented as a 32-bit representation of said record location;
said translation vector portion being implemented as a 32-bit representation of said location of said cross-reference in one of said streams.
-
-
21. The system of claim 18, wherein each one of said plurality of additional primary records further comprises:
-
a presence vector having 64 bits, each bit being uniquely associated with a character from said character set, each bit providing an indication of a word continuing with said associated character;
a plurality of lookup segments, each one of said lookup segments providing information for locating occurrences of said word in said collection of documents.
-
-
22. The system of claim 21, wherein said each one of said plurality of lookup segments further comprises:
-
a lookup portion representing a record location for one of said additional records, said additional records providing a pathway to a next character in said character-by-character pathway for location occurrences of indexed words in said collection of documents;
a translation vector portion representing a translation vector record number and defining a location in one of said plurality of streams, said location providing a cross-reference to a location of an occurrence of said word in said collection of documents.
-
-
23. The system of claim 22, wherein said plurality of lookup segments comprises eight pairs of lookup portions and translation vector portions, each of said lookup portions comprising 32 bits,
each of said translation vector segments representing a translation vector record number, each translation vector entry comprising a 4 bit stream ID and a 28 bit record number, said stream ID identifying which one of said plurality of streams records cross-reference information for locating occurrences of said word in said collection of documents, said record number providing a location within said stream of said cross-reference information. -
24. The system of claim 23, wherein said plurality of lookup segments comprises sixty-four pairs of lookup portions and translation vector portions, each of said lookup portions comprising 32 bits, each of said translation vector segments comprising 32 bits and representing a translation vector record number, each translation vector entry comprising a 4 bit stream ID and a 28-bit record number, said stream ID identifying which one of said plurality of streams records cross-reference information for locating occurrences of said word in said collection of documents, said record number providing a location within said stream of said cross-reference information.
-
25. The system of claim 18 additionally comprising a translation vector mapping stem words to locations.
-
26. The system of claim 18, wherein each one of said plurality of additional secondary records further comprises a plurality of lookup segments, each one of said lookup segments providing information for locating occurrences of said word in said collection of documents.
Specification