Document compression system and method for use with tokenspace repository

US 20070220023A1
Filed: 08/13/2004
Published: 09/20/2007
Est. Priority Date: 08/13/2004
Status: Active Grant

First Claim

Patent Images

1. A document compression method, comprising:

identifying a set of unique tokens contained in a set of documents, the set of documents comprising a sequence of tokens;

assigning a unique first token identifier from a set of first token identifiers to each unique token based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;

assigning a second token identifier from a set of second token identifiers to each token within a selected range of token positions in the set of documents, wherein each second token identifier corresponds to a first token identifier; and

storing the second token identifiers in a repository for subsequent retrieval.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The disclosed embodiments enable multi-stage query scoring, including “snippet” generation, through incremental document reconstruction facilitated by a multi-tiered mapping scheme. The mapping scheme includes a first mapping between unique tokens contained in a set of documents and unique global token identifiers (e.g., 32-bit integers) contained in a global-lexicon (i.e., dictionary). The mapping scheme also includes a second mapping between the global token identifiers and a set of fixed-length local token identifiers (e.g., 8-bit integers) contained in one or more mini-lexicons (i.e., sub-dictionaries). Each mini-lexicon is associated with a range of token positions in the tokenized documents. The first and second mappings are used to encode/decode documents into local token identifiers having fixed widths which can be compactly stored in the tokenspace repository. The use of fixed-length local token identifiers allows for fast and efficient decoding of tokenized documents.

Citations

62 Claims

1. A document compression method, comprising:
- identifying a set of unique tokens contained in a set of documents, the set of documents comprising a sequence of tokens;
  
  assigning a unique first token identifier from a set of first token identifiers to each unique token based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;
  
  assigning a second token identifier from a set of second token identifiers to each token within a selected range of token positions in the set of documents, wherein each second token identifier corresponds to a first token identifier; and
  
  storing the second token identifiers in a repository for subsequent retrieval.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein storing the second token identifiers includes mapping the sequence of tokens within the selected range of token positions in the set of documents to a corresponding sequence of second token identifiers, and storing said corresponding sequence of second token identifiers.
  - 3. The method of claim 1, further comprising:
    - generating a mapping of the second token identifiers to corresponding first token identifiers for the selected range of token positions.
  - 4. The method of claim 1, wherein each first token identifier comprises an M bit integer value.
  - 5. The method of claim 4, wherein each second token identifier comprises an N bit integer value, N and M are positive integers and M is greater than N.
  - 6. The method of claim 5, wherein N is equal to 8 and M is equal to 32.
  - 7. The method of claim 1, further comprising encoding the mapping of the second token identifiers to corresponding first token identifiers in a compressed format.
  - 8. The method of claim 7, wherein encoding the mapping comprises:
    - grouping the first token identifiers of the mapping into first groups of N bits; and
      
      converting each first group of N bits into a second group of K bits, wherein K and N are positive integers, K is less than or equal to N, and K is determined for each second group from respective sizes of the first token identifiers in the first group.
  - 9. The method of claim 1, wherein the encoding the mapping includes delta encoding the first token identifiers of the mapping.
  - 10. The method of claim 1, further comprising:
    - sorting the unique tokens before assigning the unique tokens to the first set of token identifiers.
  - 11. The method of claim 1, further comprising:
    - sorting the set of documents by one or more sorting criteria.
  - 12. The method of claim 11, wherein the set of documents are sorted by language.
  - 13. The method of claim 11, wherein the set of documents are sorted by domain name.
  - 14. The method of claim 13, wherein portions of the domain names are interchanged prior to sorting the document by domain name.
  - 15. The method of claim 1, further comprising:
    - determining one or more attributes in the set of documents; and
      
      storing the one or more attributes for subsequent retrieval.
  - 16. The method of claim 15, wherein storing the one or more attributes includes encoding the attributes in a compressed format.
  - 17. The method of claim 1, further comprising:
    - associating ranges of token positions with portions of the set of documents; and
      
      storing a mapping of positions to sets of second token identifiers, wherein each set of second token identifiers corresponds to a respective portion of the set of documents.

18. A document decompression method, comprising:
- identifying a range of positions in a set of documents, each position in the range of positions corresponding to a respective token in the set of documents;
  
  recovering the token at each respective position in the range of positions, including;
  
  obtaining a first token identifier from a location within a repository corresponding to the respective position;
  
  mapping the first token identifier to a second token identifier; and
  
  mapping the second token identifier to a corresponding token in the set of documents; and
  
  reconstructing at least a portion of a document in the set of documents using the tokens from the mappings of the second token identifiers to corresponding tokens, and from the positions of the corresponding first token identifiers;
  
  wherein the mapping of each first token identifier is in accordance with a respective first lexicon for a portion of the repository that includes the first token identifier, and the mapping of each second token identifier is in accordance with a second lexicon that maps second token identifiers to unique tokens in the set of documents.
- View Dependent Claims (19, 20, 21)
- - 19. The method of claim 18, each second token identifier comprises an M bit integer value.
  - 20. The method of claim 19, wherein each first token identifier comprises an N bit integer value, N and M are positive integers and M is greater than N.
  - 21. The method of claim 20, wherein N is equal to 8 and M is equal to 32.

22. A document compression system, comprising:
- a first lexicon generator configured for receiving a set of documents, the set of documents comprising a sequence of tokens, and for assigning a unique first token identifier from a set of first token identifiers to each unique token in the set of documents based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;
  
  a second lexicon generator coupled to the first lexicon generator and configured for assigning a second token identifier from a set of second token identifiers to each unique token within a portion of the set of documents; and
  
  a repository configured for storing a sequence of the second token identifiers representing the tokens in the portion of the set of documents for subsequent retrieval.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 23. The system of claim 22, wherein the second lexicon generator generates a mapping of the second token identifiers to corresponding first token identifiers for the portion of the set of documents.
  - 24. The system of claim 23, wherein each first token identifier comprises an M bit integer value.
  - 25. The system of claim 23, wherein each second token identifier comprises an N bit integer value, N and M are positive integers and M is greater than N.
  - 26. The system of claim 25, wherein N is equal to 8 and M is equal to 32.
  - 27. The system of claim 22, further comprising:
    - an encoder for encoding the mapping of the second token identifiers to corresponding first token identifiers in a compressed format.
  - 28. The system of claim 27, wherein the encoder delta encodes the first token identifiers of the mapping.
  - 29. The system of claim 22, further comprising:
    - a sorter for sorting the unique tokens before assigning the unique tokens to the first set of token identifiers.
  - 30. The system of claim 22, further comprising:
    - a sorter for sorting the set of documents by one or more sorting criteria.
  - 31. The system of claim 30, wherein the sorter is configured to sort the set of documents based on language.
  - 32. The system of claim 30, wherein the sorter is configured to sort the set of documents by domain names associated with the set of documents.
  - 33. The system of claim 32, wherein portions of the domain names are interchanged prior to sorting.

34. A document decompression system, comprising:
- a query processor configured for identifying a range of positions in a set of documents, each position in the range of positions corresponding to a respective token in the set of documents that matches a query term;
  
  a first mapping module coupled to the query processor and configured for obtaining a first token identifier from a repository for each position in the range of positions; and
  
  a second mapping module coupled to the first mapping module and configured for mapping the second token identifier to a corresponding token, and for reconstructing the portion of a document in the set of documents using the tokens from the mappings of the second token identifiers and the positions of the corresponding first token identifiers, wherein the mapping of each first token identifier is in accordance with a respective first lexicon for a portion of the set of documents, and the mapping of each second token identifier is in accordance with a second lexicon that maps second token identifiers to unique tokens in the set of documents.
- View Dependent Claims (35, 36, 37)
- - 35. The system of claim 34, wherein each second token identifier comprises an M bit integer value.
  - 36. The system of claim 35, wherein each first token identifier comprises an N bit integer value, N and M are positive integers and M is greater than N.
  - 37. The system of claim 36, wherein N is equal to 8 and M is equal to 32.

38. A computer-readable medium having stored thereon instructions, which, when executed by a processor in a document compression system, causes the processor to perform the operations of:
- identifying a set of unique tokens contained in a set of documents;
  
  assigning a unique first token identifier from a set of first token identifiers to each unique token based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;
  
  assigning a second token identifier from a set of second token identifiers to each unique token having a position within a selected range of token positions in the documents, wherein each second token identifier corresponds to a first token identifier; and
  
  storing the second token identifiers in a repository for subsequent retrieval.
- View Dependent Claims (39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54)
- - 39. The computer-readable medium of claim 38, wherein storing the second token identifiers includes mapping the sequence of tokens within the selected range of token positions in the set of documents to a corresponding sequence of second token identifiers, and storing said corresponding sequence of second token identifiers.
  - 40. The computer-readable medium of claim 38, wherein the instructions further cause the processor to perform the operations of:
    - generating a mapping of the second token identifiers to corresponding first token identifiers for the selected range of token positions.
  - 41. The computer-readable medium of claim 38, wherein each first token identifier comprises an M bit integer value.
  - 42. The computer-readable medium of claim 41, wherein each second token identifier comprises an N bit integer value, N and M are positive integers and M is greater than N.
  - 43. The computer-readable medium of claim 42, wherein N is equal to 8, and M is equal to 32.
  - 44. The computer-readable medium of claim 38, wherein storing the second token identifiers includes encoding the mapping of the second token identifiers to corresponding first token identifiers in a compressed format.
  - 45. The computer-readable medium of claim 44, wherein encoding the mapping further comprises:
    - grouping the first token identifiers of the mapping into first groups of N bits; and
      
      converting each first group of N bits into a second group of K bits, wherein K and N are positive integers, K is less than or equal to N, and K is determined for each second group from respective sizes of the first token identifiers in the first group.
  - 46. The computer-readable medium of claim 44, wherein the encoding the mapping includes delta encoding the first token identifiers of the mapping.
  - 47. The computer-readable medium of claim 38, wherein the instructions further cause the processor to perform the operations of:
    - sorting the unique tokens before assigning the unique tokens to the first set of token identifiers.
  - 48. The computer-readable medium of claim 38, wherein the instructions further cause the processor to perform the operations of:
    - sorting the set of documents by one or more sorting criteria.
  - 49. The computer-readable medium of claim 48, wherein the set of documents are sorted by language.
  - 50. The computer-readable medium of claim 48, wherein the set of documents are sorted by domain name.
  - 51. The computer-readable medium of claim 50, wherein portions of the domain names are interchanged prior to sorting.
  - 52. The computer-readable medium of claim 38, further comprising:
    - determining one or more attributes in the set of documents; and
      
      storing the one or more attributes for subsequent retrieval.
  - 53. The computer-readable medium of claim 52, wherein storing the one or more attributes includes encoding the attributes in a compressed format.
  - 54. The computer-readable medium of claim 38, further comprising:
    - associating ranges of token positions with portions of the set of documents; and
      
      storing a mapping of positions to sets of second token identifiers, wherein each set of second token identifiers corresponds to a respective portion of the set of documents.

55. A computer-readable medium having stored thereon instructions, which, when executed by a processor in a document decompression system, causes the processor to perform the operations of:
- identifying a range of positions in a set of documents, each position in the range of positions corresponding to a respective token in the set of documents;
  
  recovering the token at each respective position in the range of positions, including;
  
  obtaining a first token identifier from a location within a repository corresponding to the respective position;
  
  mapping the first token identifier to a second token identifier;
  
  mapping the second token identifier to a corresponding token in the set of documents; and
  
  reconstructing at least a portion of a document in the set of documents using the tokens from the mappings of the second token identifiers and the positions of the corresponding first token identifiers, wherein the mapping of each first token identifier is in accordance with a respective first lexicon for a portion of the set of documents, and the mapping of each second token identifier is in accordance with a second lexicon that maps second token identifiers to unique tokens in the set of documents.
- View Dependent Claims (56, 57, 58)
- - 56. The computer-readable medium of claim 55, wherein each second token identifier comprises an M bit integer value.
  - 57. The computer-readable medium of claim 56, wherein each first token identifier comprises an N bit integer value, N and M are positive integers and M is greater than N.
  - 58. The computer-readable of claim 57, wherein N is equal to 8, and M is equal to 32.

59. A document compression system, comprising:
- means for identifying a set of unique tokens contained in a set of documents;
  
  means for assigning a unique first token identifier from a set of first token identifiers to each unique token based at least in part on the frequency of occurrence of the unique token in the set of documents, wherein high-frequency tokens are assigned smaller valued first token identifiers than low-frequency tokens;
  
  means for assigning a second token identifier from a set of second token identifiers to each unique token having a position within a selected range of token positions in the documents, wherein each second token identifier corresponds to a first token identifier; and
  
  means for storing the second token identifiers in a repository for subsequent retrieval.

60. A document decompression system, comprising:
- means for identifying a range of positions in a set of documents, each position in the range of positions corresponding to a respective token in the set of documents;
  
  means for recovering the token at each respective position in the range of positions, including means for obtaining a first token identifier from a location within a repository corresponding to the respective position;
  
  means for mapping the first token identifier to a second token identifier; and
  
  means for mapping the second token identifier to a corresponding token in the set of documents; and
  
  means for reconstructing at least a portion of a document in the set of documents using the tokens from the mappings of the second token identifiers and the positions of the corresponding first token identifiers, wherein the mapping of each first token identifier is in accordance with a respective first lexicon for a portion of the set of documents, and the mapping of each second token identifier is in accordance with a second lexicon that maps second token identifiers to unique tokens in the set of documents.

61. A document compression method, comprising:
- identifying a set of unique tokens from a plurality of tokens contained in a set of documents;
  
  generating a first mapping between the unique tokens and a first lexicon of variable-length token identifiers, wherein the unique tokens having a high-frequency of occurrence in the set of documents are mapped to smaller valued variable-length token identifiers than the unique tokens having a low-frequency of occurrence in the set of documents;
  
  generating a second mapping between the variable-length token identifiers and one or more second lexicons of fixed-length token identifiers, wherein each second lexicon is valid for a range of token positions in the set of documents;
  
  mapping each token in the set of documents to a respective fixed-length token identifier using the first and second mappings, wherein the respective fixed-length token identifier is selected from a second lexicon that is valid for the position of the token in the set of documents; and
  
  storing the fixed-length token identifiers representing the tokens in a tokenspace repository for subsequent retrieval.

62. A document decompression method, comprising:
- receiving a set of first token identifiers from a repository;
  
  applying first mappings to the set of first token identifiers to provide a set of second token identifiers, wherein the first mappings are each valid for a distinct corresponding range of token positions in a portion of a set of documents;
  
  applying a second mapping to the set of second token identifiers to recover a set of tokens, wherein the recovered tokens are associated with positions in the set of documents corresponding to positions of the set of first token identifiers in the repository; and
  
  reconstructing one or more portions of the set of documents using the set of recovered tokens and the respective positions of the set of first token identifiers in the repository.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Sercinoglu, Olcan, Ghemawat, Sanjay, Dean, Jeffrey, Gomes, Benedict, Thambidorai, Gautham

Granted Patent

US 7,917,480 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/2365   Ensuring data consistency a...

G06F 16/31   Indexing; Data structures t...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

Document compression system and method for use with tokenspace repository

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

62 Claims

Specification

Solutions

Use Cases

Quick Links

Document compression system and method for use with tokenspace repository

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

62 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links