Method and system for compression indexing and efficient proximity search of text data

US 7,433,893 B2
Filed: 03/08/2004
Issued: 10/07/2008
Est. Priority Date: 03/08/2004
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method of compression indexing, comprising the steps of:

selecting at least one data file, said data file having target text;

identifying each and every unique token, each of the unique tokens having a frequency;

counting the frequency of each unique token;

calculating parameters;

ranking the tokens from highest frequency to lowest frequency;

compressing the frequencies;

assigning a position to each instance of each and every unique token;

compressing the positions;

aggregating tokens, frequencies, parameters, and positions to form a compression index, said compression index being exhaustive of all said tokens, such that said target text is compressed 100%;

reconstituting a portion of the data file;

displaying the portion of the data file on a screen, wherein compressed positions point to a compressed text random access memory file, wherein the step of reconstituting the data file further comprising the steps of;

a) creating the compressed text RAM file;

b) selecting a domain to display, the domain being a portion of the data file, the domain having a starting point and an ending point;

c) decompressing successive integers;

d) determining positions of the tokens in the token list;

e) extracting the tokens from the token list; and

f) writing the tokens to the screen;

repeating steps c-f until the ending point of the domain is reached;

wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method of compression indexing and efficient proximity search of text data permits high speed search featuring ranking the relevance of search results according to closeness of desired terms within each portion of text found. The system includes (a) preparing target text, (b) creating a “compression index ebook”, (c) browsing in a compression index ebook, and (d) searching in a compression index ebook. To create the compression index, the method includes the steps of selecting target text, identifying tokens, such as words and punctuation strings, wherein each of the tokens has a frequency. The frequencies of each token are counted. Tokens are ranked from highest frequency to lowest frequency. The frequencies are compressed. The next step is assigning positions to each token frequency and compressing the positions to form a compression index ebook, which is stored in random access memory to eliminate disk seeks during browsing and searching.

119 Citations

22 Claims

1. A computer implemented method of compression indexing, comprising the steps of:
- selecting at least one data file, said data file having target text;
  
  identifying each and every unique token, each of the unique tokens having a frequency;
  
  counting the frequency of each unique token;
  
  calculating parameters;
  
  ranking the tokens from highest frequency to lowest frequency;
  
  compressing the frequencies;
  
  assigning a position to each instance of each and every unique token;
  
  compressing the positions;
  
  aggregating tokens, frequencies, parameters, and positions to form a compression index, said compression index being exhaustive of all said tokens, such that said target text is compressed 100%;
  
  reconstituting a portion of the data file;
  
  displaying the portion of the data file on a screen, wherein compressed positions point to a compressed text random access memory file, wherein the step of reconstituting the data file further comprising the steps of;
  
  a) creating the compressed text RAM file;
  
  b) selecting a domain to display, the domain being a portion of the data file, the domain having a starting point and an ending point;
  
  c) decompressing successive integers;
  
  d) determining positions of the tokens in the token list;
  
  e) extracting the tokens from the token list; and
  
  f) writing the tokens to the screen;
  
  repeating steps c-f until the ending point of the domain is reached;
  
  wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the tokens include every word, every number, every string of punctuation characters, and every markup tag without exception.
  - 3. The method of claim 1, wherein the token is a string of punctuation characters.
  - 4. The method of claim 1, wherein the token is a markup tag with no blanks.
  - 5. The method of claim 1, further comprising the step of:
    - searching the compression index.
  - 6. The method of claim 5, wherein searching occurs in random access memory.
  - 7. The method of claim 6, wherein the step of searching the compression index is performed without disk seeks.
  - 8. The method of claim 1, wherein the step of compressing the positions to form a compression index, the position of the first instance of each token is absolute, and the position of each subsequent instance of the same token is relative to the preceding position.
  - 9. The method of claim 1, wherein the compression index is structured to derive closeness-of-fit measures, the compression index also structured to reproduce any portion of original text for display.
  - 10. The method of claim 5, further comprising the step of:
    - ranking search results by relevance, wherein relevance is determined by the closeness of fit of search terms, headings, and frequencies of search terms.
  - 11. The method of claim 1, further comprising the step of:
    - browsing the compression index, wherein an associated user is able to sequentially read content created from the compression index starting at any point.

12. A computer implemented method for using a compression index, comprising the steps of:
- a) creating the compression index having the steps of;
  
  (i) providing target text, the target text being at least one data file, the target text having tokens, the tokens having frequencies;
  
  (ii) accumulating parameters;
  
  (iii) building a list of all unique tokens represented in the target text, together with their respective frequencies;
  
  (iv) sorting the list in order of declining token frequencies;
  
  (v) accumulating positions data of each instance of each token; and
  
  (vi) combining steps i-v into the compression index, the compression index being exhaustive of all tokens, such that the target text is 100% compressed;
  
  b) browsing and searching the compression index, wherein the step of accumulating parameters comprises;
  
  accumulating parameters and token frequencies on a single pass through the at least one data file, the data file having lightly marked up text, wherein the single pass through of the at least one data file of lightly marked up text comprises input for the compression index, wherein the step of building a list of all tokens represented in the target text, together with their respective frequencies further comprises the steps of;
  
  sorting all tokens in order of declining frequency;
  
  creating a temporary parameters file to facilitate passing parameters between successive states of the method to create the compression index;
  
  parsing the target text as a means for creating the token list, each of the tokens having a flag byte preceding the token and a null byte following the token; and
  
  compressing and outputting the frequencies of tokens;
  
  wherein the step of accumulating positions data further comprises the steps of;
  
  reserving a block of random access memory to accumulate positions data;
  
  reparsing the entire data file;
  
  recording the position of all tokens in the random access memory block; and
  
  compressing and outputting the positions data, wherein the position of the first instance of each token is absolute, and the position of each subsequent instance of the same token is relative to the preceding position;
  
  wherein the step of combining steps (i)-(v) into one compression index, further comprises the steps of;
  
  outputting public parameters as plain text at a beginning of the compression index;
  
  outputting compressed private parameters; and
  
  appending the tokens, frequencies and positions to complete the compression index;
  
  wherein numeric values assigned to successive tokens from the data file are compressed and successively appended to create a compressed text random access memory file, wherein the random access memory file is computationally equivalent to the target text such that the positions within this compressed text random access memory file are used as the position values of successive tokens.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The method of claim 12, wherein browsing and searching occurs on a personal computer.
  - 14. The method of claim 12 wherein, browsing and searching is server based over an Internet.
  - 15. The method of claim 12, wherein positions are compressed, compressed positions of the data file point to a compressed text random access memory file, the method further comprising the step of reconstituting the data file to browse and search the compression index comprising the steps of:
    - creating the compressed text RAM file;
      
      selecting a domain to display, the domain being a portion of the data file, the domain having a starting point and an ending point, wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain;
      
      decompressing successive integers;
      
      determining positions of the tokens in the token list;
      
      extracting the tokens from the token list;
      
      writing the tokens to the screen; and
      
      repeating steps c-f until the ending point of the domain is reached.
  - 16. The method of claim 12, wherein after searching the compression index, search results are ranked by relevance.
  - 17. The method of claim 16, wherein searching the compression index comprises the step of:
    - scoring for closeness of fit of search terms by measuring the number of non-requested words that intervene between the first and last term found within a domain, and subtracting the count of non-requested words from the maximum allowable value.
  - 18. The method of claim 17, wherein searching the compression index comprises the step of:
    - scoring for headings.
  - 19. The method of claim 18, wherein searching the compression index comprises the step of:
    - scoring for frequencies of search terms.
  - 20. The method of claim 18, further comprising the step of:
    - ranking search results from the domain with the highest score to lowest score, wherein scores equal the sum of closeness of fit, frequency of search terms in heading and frequency of terms in the domain, wherein search results are displayed as a list of hits arranged in order of closeness of fit and with frequency of occurrence in headings and body controlling the ranking when the number of intervening non-requested words is the same for multiple domains.

21. A computer readable medium containing instructions for controlling a computer system to perform a method, the method comprising the steps of:
- selecting at least one file having target text;
  
  identifying each and every unique token, each of the unique tokens having a frequency, the tokens including every word, every number, every string of punctuation characters, and every markup tag without exception;
  
  counting the frequency of each unique token;
  
  calculating parameters;
  
  ranking the tokens from highest frequency to lowest frequency;
  
  compressing the frequencies;
  
  assigning a position to each instance of each token;
  
  compressing the positions;
  
  aggregating tokens, frequencies, parameters, and positions to form a compression index, such that the target text is compressed 100%; and
  
  browsing and searching the compression index;
  
  reconstituting a portion of the at least one file;
  
  displaying the portion of the at least one file on a screen;
  
  wherein compressed positions point to a compressed text random access memory file, the step of reconstituting the at least one file further comprising the steps of;
  
  a) creating the compressed text RAM file;
  
  b) selecting a domain to display, the domain being a portion of the at least one file, the domain having a starting point and an ending point;
  
  c) decompressing successive integers;
  
  d) determining positions of the tokens in the token list;
  
  e) extracting the tokens from the token list; and
  
  f) writing the tokens to the screen;
  
  repeating steps c-f until the ending point of the domain is reached;
  
  wherein the selected domain is part of a domains list, the domains list having a plurality of domains, the domains list having the starting point and the ending point of each domain.

22. An apparatus, comprising:
- means for selecting at least one file, the at least one file having target text;
  
  means for identifying each and every unique token, each of the unique tokens having a frequency, the tokens including every word, every number, every string of punctuation characters, and every markup tag without exception;
  
  means for counting the frequency of each unique token;
  
  means for calculating parameters;
  
  means for ranking the tokens from highest frequency to lowest frequency;
  
  means for compressing the frequencies;
  
  means for assigning a position to each instance of each token;
  
  means for compressing the positions; and
  
  means for aggregating tokens, frequencies, parameters, and positions to form a compression index, such that the target text is compressed 100%; and
  
  means for browsing and searching text derived from the compression index;
  
  wherein the compression index is formed by accumulating parameters and token frequencies on a single pass through the at least one data file, the at least one data file having lightly marked up text, wherein the single pass through of the at least one data file of lightly marked up text comprises input for the compression index;
  
  wherein the tokens are all sorted in order of declining frequency;
  
  wherein the compression index further comprises a temporary parameters file to facilitate passing parameters between successive states of the method to create the compression index;
  
  wherein the target text is parsed as a means for creating the token list, each of the tokens having a flag byte preceding the token and a null byte following the token; and
  
  wherein the frequencies of tokens are compressed and outputted;
  
  wherein a block of random access memory is reserved to accumulate positions data;
  
  wherein the entire data file is reparsed;
  
  wherein the position of all tokens are recorded in the random access memory block; and
  
  wherein the compressed and outputted positions data includes the position of the first instance of each token being absolute, and the position of each subsequent instance of the same token being relative to the preceding position;
  
  wherein public parameters are outputted as plain text at a beginning of the compression index;
  
  compressed private parameters are outputted; and
  
  the tokens, frequencies and positions are appended to complete the compression index;
  
  wherein numeric values assigned to successive tokens from the data file are compressed and successively appended to create a compressed text random access memory file, wherein the random access memory file is computationally equivalent to the target text such that the positions within this compressed text random access memory file are used as the position values of successive tokens.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Marpex Incorporated
Original Assignee
Marpex Incorporated
Inventors
Lowry, Douglas B.
Primary Examiner(s)
Rones; Charles
Assistant Examiner(s)
Khoshnoodi; Fariborz

Application Number

US10/795,721
Publication Number

US 20050198070A1
Time in Patent Office

1,674 Days
Field of Search

707/104.1, 707/101, 707/102, 707/5, 707/532, 707/536, 707/3, 707/2, 704/9
US Class Current

1/1
CPC Class Codes

G06F 16/316   Indexing structures

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99945   Object-oriented database st...

Y10S 707/99948   Application of database or ...

Method and system for compression indexing and efficient proximity search of text data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

119 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for compression indexing and efficient proximity search of text data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

119 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links