Indexing databases for efficient relational querying
First Claim
1. An indexing system for structured or semi-structured source data, the source data being capable of being represented by a relational data view, the source data comprising data subsets which in the relational view correspond to attributes in one or more tables, each table comprising columns and rows, the indexing system comprisinga tokenizer for accepting the source data and generating data tokens in a token stream representing the source data, the tokenizer comprising means for generating identifier tokens identifying the table and column of the relational view for the data subsets of the source data, the identifier tokens being inserted in the token stream to precede the data tokens for the data subsets to which the identifier tokens correspond, and an index builder for building indexes based on the token stream, the index builder creating token stream indexes which comprise a set of positional indexes for indicating the position of data tokens in the source data, a set of lexicographical indexes for indicating the lexicographical ordering of all tokens, the set of lexicographical indexes comprising a sort vector index and an associated join bit index, and a set of data structures mapping between the lexicographical indexes and the positional indexes, comprising a lexicographic permutation data structure.
9 Assignments
0 Petitions
Accused Products
Abstract
Provided is an indexing system for structured or semi-structured source data comprising a tokenizer for accepting source data and generating tokens representing the source data, the tokens from the tokenization representing the source data in a relational view, where for tokens representing a subset of the source data, the system generates tokens identifying the table and column of the subset of the data in the relational view of the source data, and an index builder for building index structures based on the tokens generated by the tokenizer, the index builder creating indexes which comprise a set of positional indexes for indicating the position of token data in the source data, a set of lexicographical indexes comprising a sort vector index and a join bit index, associated with the sort vector index, a set of data structures mapping between the lexicographical indexes and the positional indexes.
108 Citations
20 Claims
-
1. An indexing system for structured or semi-structured source data, the source data being capable of being represented by a relational data view, the source data comprising data subsets which in the relational view correspond to attributes in one or more tables, each table comprising columns and rows, the indexing system comprising
a tokenizer for accepting the source data and generating data tokens in a token stream representing the source data, the tokenizer comprising means for generating identifier tokens identifying the table and column of the relational view for the data subsets of the source data, the identifier tokens being inserted in the token stream to precede the data tokens for the data subsets to which the identifier tokens correspond, and an index builder for building indexes based on the token stream, the index builder creating token stream indexes which comprise a set of positional indexes for indicating the position of data tokens in the source data, a set of lexicographical indexes for indicating the lexicographical ordering of all tokens, the set of lexicographical indexes comprising a sort vector index and an associated join bit index, and a set of data structures mapping between the lexicographical indexes and the positional indexes, comprising a lexicographic permutation data structure.
-
5. A method for indexing structured or semi-structured source data, the source data being capable of being represented by a relational data view, the source data comprising data subsets which in the relational view correspond to attributes in one or more tables, each table comprising columns and rows, the method of indexing comprising
accepting the source data and generating data tokens in a token stream representing the source data, generating identifier tokens identifying the table and column of the relational view for the data subsets of the source data, the identifier tokens being inserted in the token stream to precede the data tokens for the data subsets to which the identifier tokens correspond, and creating token stream indexes which comprise a set of positional indexes for indicating the position of data tokens in the source data, a set of lexicographical indexes for indicating the lexicographical ordering of all tokens, the set of lexicographical indexes comprising a sort vector index and an associated join bit index, and a set of data structures mapping between the lexicographical indexes and the positional indexes, comprising a lexicographic permutation data structure.
-
11. An indexing system for structured or semi-structured source data, the source data being capable of being represented by a relational data view, the source data comprising data subsets which in the relational view correspond to attributes in one or more tables, each table comprising columns and rows, the indexing system comprising
a tokenizer for accepting the source data and generating data tokens in a token stream representing the source data, the tokenizer comprising means for generating identifier tokens identifying the table and column of the relational view for the data subsets of the source data, the identifier tokens being inserted in the token stream to be contiguous with the data tokens for the data subsets to which the identifier tokens correspond, and an index builder for building indexes based on the token stream, the index builder creating token stream indexes which comprise a set of positional indexes for indicating the position of data tokens in the source data, a set of lexicographical indexes for indicating the lexicographical ordering of all tokens in the token stream, and a set of data structures mapping between the lexicographical indexes and the positional indexes.
-
15. A computer-implemented method for the indexing of structured or semi-structured source data, the source data being capable of being represented by a relational data view, the source data comprising data subsets which in the relational view correspond to attributes in one or more tables, each table comprising columns and rows, the method comprising
accepting the source data and generating data tokens in a token stream representing the source data, generating identifier tokens identifying the table and column of the relational view for the data subsets of the source data, inserting the identifier tokens in the token stream to be contiguous with the data tokens for the data subsets to which the identifier tokens correspond, and building indexes based on the token stream by creating token stream indexes which comprise a set of positional indexes for indicating the position of data tokens in the source data, a set of lexicographical indexes for indicating the lexicographical ordering of all tokens in the token stream, and a set of data structures mapping between the lexicographical indexes and the positional indexes.
-
19. A computer-implemented method for the indexing of structured or semi-structured source data, the source data being capable of being represented by a relational data view, the source data comprising data subsets which in the relational view correspond to attributes in one or more tables, each table comprising columns and rows, the method comprising
accepting the source data and generating data tokens in a token stream representing the source data, generating identifier tokens identifying the table and column of the relational view for the data subsets of the source data, inserting the identifier tokens in the token stream to be contiguous with the data tokens for the data subsets to which the identifier tokens correspond, and building indexes based on the token stream by creating token stream indexes which comprise a set of positional indexes for indicating the position of data tokens in the source data, the set of positional indexes comprising an inverted file for mapping unique tokens in the token stream to their position in the source data, a word list strings file comprising a sorted list of all unique tokens in the token stream, a word list file for mapping each token from the word list strings file to the inverted file, each entry in the word list file comprising a location in the inverted file and a run length, and a keys file a set of lexicographical indexes for indicating the lexicographical ordering of all tokens in the token stream, comprising a sort vector index and a join bit index, and a set of data structures mapping between the lexicographical indexes and the positional indexes comprising a lexicographic permutation data structure, the step of creating token stream indexes further comprising: -
making a pass over the token stream to produce a temporary sort vector index, sorting the temporary sort vector file on a run by run basis to produce an inverse lexicographic permutation index, making a pass over the temporary sort vector index to generate the sort vector index on a run by run basis by rearranging the entries within each run in the temporary sort vector index according to the permutation, and taking the sort vector index as input and following chains of pointers within the sort vector index to resolve equality of attribute entries to generate the join bit index.
-
Specification