Document compression system and method for use with tokenspace repository
First Claim
1. A document compression method, comprising:
- identifying a set of unique tokens contained in a set of documents, the set of documents comprising a sequence of tokens;
assigning a unique first token identifier from a set of first token identifiers to each unique token based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;
assigning a second token identifier from a set of second token identifiers to each token within a selected range of token positions in the set of documents, wherein each second token identifier corresponds to a first token identifier; and
storing the second token identifiers in a repository for subsequent retrieval.
2 Assignments
0 Petitions
Accused Products
Abstract
The disclosed embodiments enable multi-stage query scoring, including “snippet” generation, through incremental document reconstruction facilitated by a multi-tiered mapping scheme. The mapping scheme includes a first mapping between unique tokens contained in a set of documents and unique global token identifiers (e.g., 32-bit integers) contained in a global-lexicon (i.e., dictionary). The mapping scheme also includes a second mapping between the global token identifiers and a set of fixed-length local token identifiers (e.g., 8-bit integers) contained in one or more mini-lexicons (i.e., sub-dictionaries). Each mini-lexicon is associated with a range of token positions in the tokenized documents. The first and second mappings are used to encode/decode documents into local token identifiers having fixed widths which can be compactly stored in the tokenspace repository. The use of fixed-length local token identifiers allows for fast and efficient decoding of tokenized documents.
-
Citations
62 Claims
-
1. A document compression method, comprising:
-
identifying a set of unique tokens contained in a set of documents, the set of documents comprising a sequence of tokens;
assigning a unique first token identifier from a set of first token identifiers to each unique token based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;
assigning a second token identifier from a set of second token identifiers to each token within a selected range of token positions in the set of documents, wherein each second token identifier corresponds to a first token identifier; and
storing the second token identifiers in a repository for subsequent retrieval. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A document decompression method, comprising:
-
identifying a range of positions in a set of documents, each position in the range of positions corresponding to a respective token in the set of documents;
recovering the token at each respective position in the range of positions, including;
obtaining a first token identifier from a location within a repository corresponding to the respective position;
mapping the first token identifier to a second token identifier; and
mapping the second token identifier to a corresponding token in the set of documents; and
reconstructing at least a portion of a document in the set of documents using the tokens from the mappings of the second token identifiers to corresponding tokens, and from the positions of the corresponding first token identifiers;
wherein the mapping of each first token identifier is in accordance with a respective first lexicon for a portion of the repository that includes the first token identifier, and the mapping of each second token identifier is in accordance with a second lexicon that maps second token identifiers to unique tokens in the set of documents. - View Dependent Claims (19, 20, 21)
-
-
22. A document compression system, comprising:
-
a first lexicon generator configured for receiving a set of documents, the set of documents comprising a sequence of tokens, and for assigning a unique first token identifier from a set of first token identifiers to each unique token in the set of documents based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;
a second lexicon generator coupled to the first lexicon generator and configured for assigning a second token identifier from a set of second token identifiers to each unique token within a portion of the set of documents; and
a repository configured for storing a sequence of the second token identifiers representing the tokens in the portion of the set of documents for subsequent retrieval. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
-
34. A document decompression system, comprising:
-
a query processor configured for identifying a range of positions in a set of documents, each position in the range of positions corresponding to a respective token in the set of documents that matches a query term;
a first mapping module coupled to the query processor and configured for obtaining a first token identifier from a repository for each position in the range of positions; and
a second mapping module coupled to the first mapping module and configured for mapping the second token identifier to a corresponding token, and for reconstructing the portion of a document in the set of documents using the tokens from the mappings of the second token identifiers and the positions of the corresponding first token identifiers, wherein the mapping of each first token identifier is in accordance with a respective first lexicon for a portion of the set of documents, and the mapping of each second token identifier is in accordance with a second lexicon that maps second token identifiers to unique tokens in the set of documents. - View Dependent Claims (35, 36, 37)
-
-
38. A computer-readable medium having stored thereon instructions, which, when executed by a processor in a document compression system, causes the processor to perform the operations of:
-
identifying a set of unique tokens contained in a set of documents;
assigning a unique first token identifier from a set of first token identifiers to each unique token based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;
assigning a second token identifier from a set of second token identifiers to each unique token having a position within a selected range of token positions in the documents, wherein each second token identifier corresponds to a first token identifier; and
storing the second token identifiers in a repository for subsequent retrieval. - View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54)
-
-
55. A computer-readable medium having stored thereon instructions, which, when executed by a processor in a document decompression system, causes the processor to perform the operations of:
-
identifying a range of positions in a set of documents, each position in the range of positions corresponding to a respective token in the set of documents;
recovering the token at each respective position in the range of positions, including;
obtaining a first token identifier from a location within a repository corresponding to the respective position;
mapping the first token identifier to a second token identifier;
mapping the second token identifier to a corresponding token in the set of documents; and
reconstructing at least a portion of a document in the set of documents using the tokens from the mappings of the second token identifiers and the positions of the corresponding first token identifiers, wherein the mapping of each first token identifier is in accordance with a respective first lexicon for a portion of the set of documents, and the mapping of each second token identifier is in accordance with a second lexicon that maps second token identifiers to unique tokens in the set of documents. - View Dependent Claims (56, 57, 58)
-
-
59. A document compression system, comprising:
-
means for identifying a set of unique tokens contained in a set of documents;
means for assigning a unique first token identifier from a set of first token identifiers to each unique token based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;
means for assigning a second token identifier from a set of second token identifiers to each unique token having a position within a selected range of token positions in the documents, wherein each second token identifier corresponds to a first token identifier; and
means for storing the second token identifiers in a repository for subsequent retrieval.
-
-
60. A document decompression system, comprising:
-
means for identifying a range of positions in a set of documents, each position in the range of positions corresponding to a respective token in the set of documents;
means for recovering the token at each respective position in the range of positions, including means for obtaining a first token identifier from a location within a repository corresponding to the respective position;
means for mapping the first token identifier to a second token identifier; and
means for mapping the second token identifier to a corresponding token in the set of documents; and
means for reconstructing at least a portion of a document in the set of documents using the tokens from the mappings of the second token identifiers and the positions of the corresponding first token identifiers, wherein the mapping of each first token identifier is in accordance with a respective first lexicon for a portion of the set of documents, and the mapping of each second token identifier is in accordance with a second lexicon that maps second token identifiers to unique tokens in the set of documents.
-
-
61. A document compression method, comprising:
-
identifying a set of unique tokens from a plurality of tokens contained in a set of documents;
generating a first mapping between the unique tokens and a first lexicon of variable-length token identifiers, wherein the unique tokens having a high-frequency of occurrence in the set of documents are mapped to smaller valued variable-length token identifiers than the unique tokens having a low-frequency of occurrence in the set of documents;
generating a second mapping between the variable-length token identifiers and one or more second lexicons of fixed-length token identifiers, wherein each second lexicon is valid for a range of token positions in the set of documents;
mapping each token in the set of documents to a respective fixed-length token identifier using the first and second mappings, wherein the respective fixed-length token identifier is selected from a second lexicon that is valid for the position of the token in the set of documents; and
storing the fixed-length token identifiers representing the tokens in a tokenspace repository for subsequent retrieval.
-
-
62. A document decompression method, comprising:
-
receiving a set of first token identifiers from a repository;
applying first mappings to the set of first token identifiers to provide a set of second token identifiers, wherein the first mappings are each valid for a distinct corresponding range of token positions in a portion of a set of documents;
applying a second mapping to the set of second token identifiers to recover a set of tokens, wherein the recovered tokens are associated with positions in the set of documents corresponding to positions of the set of first token identifiers in the repository; and
reconstructing one or more portions of the set of documents using the set of recovered tokens and the respective positions of the set of first token identifiers in the repository.
-
Specification