Inverted index storage structure using subindexes and large objects for tight coupling of information retrieval with database management systems
First Claim
1. An inverted index storage structure comprising:
- a plurality of large objects each for storing a plurality of posting lists, each of said large objects being indexed according to a keyword input when said posting lists therein correspond to said keyword input; and
a plurality of subindexes, each of said subindexes being one-to-one connected to each of said posting lists to index a document identifier thereto.
1 Assignment
0 Petitions
Accused Products
Abstract
This invention relates to an inverted index storage structure that indexes keyword inputs into the storage space for the corresponding posting lists. In particular, the invention relates to the index structure that enables fast retrieval of the posting of the specific document from the posting list and enables efficient arrangement and maintenance of the posting list in document identifier (docID) order, so that fast addition, deletion, modification, and retrieval of documents are possible in environments where a database management system is tightly coupled with information retrieval. The technical solution is to store the posting list in a large object and map to each posting list a subindex that indexes the docID into the postings containing the docID.
-
Citations
9 Claims
-
1. An inverted index storage structure comprising:
-
a plurality of large objects each for storing a plurality of posting lists, each of said large objects being indexed according to a keyword input when said posting lists therein correspond to said keyword input; and
a plurality of subindexes, each of said subindexes being one-to-one connected to each of said posting lists to index a document identifier thereto. - View Dependent Claims (2, 3, 4)
-
-
5. A method of inserting a new posting in an inverted index storage structure which includes a plurality of large objects each for storing a posting list, each of said posting lists having a posting that stores together an object identifier and a document identifier besides extracted position information, and a plurality of subindexes, each of said subindexes being one-to-one connected to each of said posting lists to indirectly index said document identifier thereto using an offset array, comprising the steps of:
-
(a) checking whether said offset array has space insufficient for storage of said new posting;
(b) enlarging the size of said offset array by double if it is checked at said step (a) that said offset array has space insufficient for storage of said new posting;
(c) allocating an empty element of the offset array for said new posting if it is checked at said step (a) that said offset array has space sufficient for storage of said new posting or after the size of said offset array is enlarged by double at said step (b), and checking whether said document identifier of said new posting is larger than the existing document identifiers;
(d) adding said new posting to the end of the corresponding posting list if it is checked at said step (c) that said document identifier of said new posting is larger than the existing document identifiers;
(e) traversing said subindex to determine a new byte offset, if it is checked at said step (c) that said document identifier of said new posting is not larger than the existing document identifiers, and inserting said new posting into the determined byte offset; and
(f) recording a byte offset of said new posting on said allocated element of said offset array after said new posting is added or inserted at said step (d) or (e), correcting byte offsets, recorded on the associated elements of said offset array, of postings with byte offsets changed and inserting said document identifier of said new posting and the array index of said allocated element of said offset array into the corresponding subindex.
-
-
6. A method of deleting a posting in an inverted index storage structure which includes a plurality of large objects each for storing a plurality of posting lists, each of said posting lists having a posting that stores together an object identifier and a document identifier besides extracted position information, and a plurality of subindexes, each of said subindexes being one-to-one connected to each of said posting lists to indirectly index said document identifier thereto using an offset array, comprising the steps of:
-
(a) traversing said subindex to find the storage position of the posting including the identifier of the deleted document and deleting said posting including said identifier of said deleted document from the corresponding posting list;
(b) removing the element of said offset array corresponding to said deleted posting and correcting byte offsets, recorded on the associated elements of said offset array, of postings with byte offsets changed; and
(c) deleting the subindex entry pointing to said offset array element corresponding to said deleted posting.
-
-
7. A method of retrieving a posting or a posting list in an inverted index storage structure which includes a plurality of large objects each for storing a plurality of posting lists, each of said posting lists having a posting that stores together an object identifier and a document identifier besides extracted position information, and a plurality of subindexes, each of said subindexes being one-to-one connected to each of said posting lists to indirectly index said document identifier thereto using an offset array, comprising the steps of:
-
(a) traversing the keyword index and finding the posting list corresponding to the given keyword;
(b) checking whether the identifier of a specific document is given as input;
(c) traversing the subindex connected to the posting list of the given keyword to find the postings of said specific document so as to check whether the given keyword is included in said specific document, if it is checked at said step (b) that said specific document identifier is given as input, and returning the found postings of said specific document; and
(d) sequentially scanning the found posting list stored in a large object to return postings of all documents from which the given keyword is extracted, if it is checked at said step (b) that no specific document identifier is given as input.
-
-
8. A database system structure that has an inverted index storage structure that includes a plurality of large objects each for storing a plurality of posting lists and a plurality of subindexes, each of said posting lists having a posting that stores an object identifier and a document identifier besides extracted position information, and each of said subindexes being one-to-one connected to each of said posting lists to indirectly index said document identifier thereto using an offset array, and that comprises a plurality of database objects each including attributes such as a document identifier, a document number, a creator, a creation date and text, each of said database objects being physically one-to-one matched with a posting by using said document identifier and an object identifier as pointers to reference each other.
-
9. A method of tight coupling information retrieval with a database management system which comprises an inverted index storage structure including a plurality of large objects each for storing a posting list, each of said posting lists having a posting to store an object identifier, a document identifier and extracted position information, and a plurality of subindexes, each of said subindexes being one-to-one connected to each of said posting lists to indirectly index said document identifier thereto using an offset array, and a database system structure including a plurality of database objects each including attributes such as a document identifier, a document number, a creator, a creation date and text, each of said database objects being physically matched with a posting by using said document identifier and an object identifier as pointers to reference each other, comprising:
-
a new posting insertion mode of enlarging the size of said offset array by double if said offset array has space insufficient for storage of a new posting, allocating an empty element of the offset array for said new posting if said offset array has space sufficient for storage of said new posting or after the size of said offset array is enlarged by double, adding said new posting to the end of the corresponding posting list if said document identifier of said new posting is larger than the existing document identifiers, traversing said subindex to determine a new byte offset, if said document identifier of said new posting is not larger than the existing document identifiers, inserting said new posting into the determined byte offset, recording a byte offset of said new posting on said allocated element of said offset array, correcting byte offsets, recorded on the associated elements of said offset array, of postings with byte offsets changed, and inserting said document identifier of said new posting and the array index of said allocated element of said offset array into the corresponding subindex;
a posting deletion mode of traversing said subindex to find the storage position of the posting including the identifier of the deleted document, deleting said posting including said identifier of said deleted document from the corresponding posting list, removing the element of said offset array, corresponding to said deleted posting, correcting byte offsets, recorded on the associated elements of said offset array, of postings with byte offsets changed, and deleting the subindex entry pointing to said offset array element corresponding to said deleted posting;
a posting or posting list retrieval mode of traversing the keyword index, finding the posting list corresponding to the given keyword, checking whether the identifier of a specific document is given as input, traversing the subindex connected to the posting list of the given keyword to find the postings of said specific document so as to check whether the given keyword is included in said specific document, if said specific document identifier is given as input, returning the found postings of said specific document, and sequentially scanning the found posting list stored in a large object to return postings of all documents from which the given keyword is extracted, if no specific document identifier is given as input; and
an integrated query processing method made possible by storing the object identifier together with the document identifier in each posting, where an integrated query is processed either by first processing the information retrieval part of the query and then processing the database part of the query by accessing the desired database object through the object identifier stored in the posting or by first processing the database part of the query and then processing the information retrieval part of the query by looking for the desired posting through the subindex in the posting list of the given keyword.
-
Specification