Method of indexing and retrieval of electronically-stored documents

US 5,404,514 A
Filed: 09/13/1993
Issued: 04/04/1995
Est. Priority Date: 12/26/1989
Status: Expired due to Term

First Claim

Patent Images

1. A method of indexing and retrieving documents, said method using a digital computer system having a central processing unit, a memory, a display screen, a keyboard, and a large capacity file system, said method comprising the steps of:

(a) storing in said memory a vocabulary of terms, each term consisting of one or more words, and for each term an associated term-code;

(b) storing on said file system a collection of documents each with an associated unique document-number;

(c) creating index files which contain for each said term-code in (a)(i) the set of document-numbers in (b) such that the corresponding documents contain the corresponding term; and

(ii) for each said document-identifying-number in (i) the frequency-in-document of the corresponding term which is the number of times that said term appears in the corresponding document;

(d) creating a weight-in-document file which contains for each document-number in (c)(i) the weight-in-document of the corresponding term which is calculated using the frequency-in-document in (c) (ii), the number of document-numbers in (c) (i), and the total number of terms in (a) which are in the corresponding document (counted multiple times);

(e) creating a frequent-companion file which contains for each occurring term-code in (a) a ranked set of pairs of numbers where each pair consists of a first element term-code and a second element companion-percentage, where the companion-percentage is calculated by summing the weight-in-document values of said first element term-code over documents that contain both the term corresponding to said first element term-code and the term corresponding to said occurring term-code and then dividing by the sum over all documents of the weight-in-document of said occurring term-code;

(f) creating a relative file which contains for each occurring term-code in (a) a ranked set of pairs of numbers where each pair consists of a first element relative term-code and a second element relative-percentage, where the relative-percentage is calculated by taking a weighted average of the companion-percentage of said first element term-code calculated in step (e) and the companion-percentage of said occurring term-code that was calculated in step (e) when said first element term-code was the occurring term-code and said occurring term-code was the first element term-code;

(g) creating a polysemantic file which contains for each occurring term-code in (a), a polysemantic weight which is calculated using the number of sets of pairs in the relative file created in step (f) that said occurring term-code appears in, the number of documents-numbers for which the weight-in-document of said occurring term-code calculated in step (d) is greater than some threshold value, and the averages for several values of N of the first N relative-percentages of said occurring term-code calculated and ranked in step (f);

(h) accepting a query consisting of a sequence of words entered by a user using said keyboard and creating a parsed-query table of term-codes which consist of the term-codes in said vocabulary that are associated with the terms that are contained in said query;

(i) creating a temporary swap table of pairs of first element term-codes and corresponding second element summed-relative-percentages consisting of those relative term-codes created in step (f) where said corresponding second element summed-relative-percentages are the sum, over all said occurring term-codes that are in said parsed-query table, of the relative percentages of said first element term-codes;

(j) creating a modified swap table by modifying said second element summed-relative-percentages created in step (i) by multiplying them by a function of the polysemantic weight of the corresponding first element term-codes;

(k) sorting said modified swap table by said modified summed-relative-percentages in descending order;

(l) displaying on said display the terms corresponding to the term-codes of said modified swap table;

(m) accepting user keypresses or other actions which identify one or more of the terms displayed in step (l) and adding the corresponding term-codes to the parsed-query-table;

(n) repeating steps (i) through (m) as many times as the user indicates by his input;

(o) accepting an input from the user indicating a command to retrieve documents;

(p) creating a temporary rank table of pairs of first element document-numbers and corresponding second element summed-document-weight×

poly values which pairs comprise those document-numbers for which any of the term-codes that are in said parsed-query table have weight-in-document above a threshold value, and summed-document-weight×

poly values which are the sums, over all term-codes in said parsed-query table, of a function of me polysemantic weight of the term-code and the weight-in-document of the term-code;

(r) creating a sorted rank table by sorting said temporary rank table by the value of the second elements of the pairs in descending order;

(s) displaying on the display screen some portion of the document corresponding to the first document number in the sorted rank table and some indication of the corresponding summed-document-weight×

poly value;

(t) displaying other documents corresponding to other document-numbers in the sorted rank table in response to inputs from the user.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document indexing and retrieval system and method which assigns weights to the key words and assigns a relative value to pairs of key words (i.e. defines a relative relation on K×K) based on their frequency of occurrence and co-occurrence in the document data base. In response to a query both the weights and this relative relation are used to suggest additional and/or alternative key words which are very likely to find relevant documents. Documents are then ranked by number of hits adjusted for the weights of hit words and their relative values.

Citations

6 Claims

1. A method of indexing and retrieving documents, said method using a digital computer system having a central processing unit, a memory, a display screen, a keyboard, and a large capacity file system, said method comprising the steps of:
- (a) storing in said memory a vocabulary of terms, each term consisting of one or more words, and for each term an associated term-code;
  
  (b) storing on said file system a collection of documents each with an associated unique document-number;
  
  (c) creating index files which contain for each said term-code in (a)(i) the set of document-numbers in (b) such that the corresponding documents contain the corresponding term; and
  
  (ii) for each said document-identifying-number in (i) the frequency-in-document of the corresponding term which is the number of times that said term appears in the corresponding document;
  
  (d) creating a weight-in-document file which contains for each document-number in (c)(i) the weight-in-document of the corresponding term which is calculated using the frequency-in-document in (c) (ii), the number of document-numbers in (c) (i), and the total number of terms in (a) which are in the corresponding document (counted multiple times);
  
  (e) creating a frequent-companion file which contains for each occurring term-code in (a) a ranked set of pairs of numbers where each pair consists of a first element term-code and a second element companion-percentage, where the companion-percentage is calculated by summing the weight-in-document values of said first element term-code over documents that contain both the term corresponding to said first element term-code and the term corresponding to said occurring term-code and then dividing by the sum over all documents of the weight-in-document of said occurring term-code;
  
  (f) creating a relative file which contains for each occurring term-code in (a) a ranked set of pairs of numbers where each pair consists of a first element relative term-code and a second element relative-percentage, where the relative-percentage is calculated by taking a weighted average of the companion-percentage of said first element term-code calculated in step (e) and the companion-percentage of said occurring term-code that was calculated in step (e) when said first element term-code was the occurring term-code and said occurring term-code was the first element term-code;
  
  (g) creating a polysemantic file which contains for each occurring term-code in (a), a polysemantic weight which is calculated using the number of sets of pairs in the relative file created in step (f) that said occurring term-code appears in, the number of documents-numbers for which the weight-in-document of said occurring term-code calculated in step (d) is greater than some threshold value, and the averages for several values of N of the first N relative-percentages of said occurring term-code calculated and ranked in step (f);
  
  (h) accepting a query consisting of a sequence of words entered by a user using said keyboard and creating a parsed-query table of term-codes which consist of the term-codes in said vocabulary that are associated with the terms that are contained in said query;
  
  (i) creating a temporary swap table of pairs of first element term-codes and corresponding second element summed-relative-percentages consisting of those relative term-codes created in step (f) where said corresponding second element summed-relative-percentages are the sum, over all said occurring term-codes that are in said parsed-query table, of the relative percentages of said first element term-codes;
  
  (j) creating a modified swap table by modifying said second element summed-relative-percentages created in step (i) by multiplying them by a function of the polysemantic weight of the corresponding first element term-codes;
  
  (k) sorting said modified swap table by said modified summed-relative-percentages in descending order;
  
  (l) displaying on said display the terms corresponding to the term-codes of said modified swap table;
  
  (m) accepting user keypresses or other actions which identify one or more of the terms displayed in step (l) and adding the corresponding term-codes to the parsed-query-table;
  
  (n) repeating steps (i) through (m) as many times as the user indicates by his input;
  
  (o) accepting an input from the user indicating a command to retrieve documents;
  
  (p) creating a temporary rank table of pairs of first element document-numbers and corresponding second element summed-document-weight×
  
  poly values which pairs comprise those document-numbers for which any of the term-codes that are in said parsed-query table have weight-in-document above a threshold value, and summed-document-weight×
  
  poly values which are the sums, over all term-codes in said parsed-query table, of a function of me polysemantic weight of the term-code and the weight-in-document of the term-code;
  
  (r) creating a sorted rank table by sorting said temporary rank table by the value of the second elements of the pairs in descending order;
  
  (s) displaying on the display screen some portion of the document corresponding to the first document number in the sorted rank table and some indication of the corresponding summed-document-weight×
  
  poly value;
  
  (t) displaying other documents corresponding to other document-numbers in the sorted rank table in response to inputs from the user.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A method as in claim 1 wherein additional steps (j)(l) and (p)(l) are carried out after steps (j) and (p) respectively to implement the soft boolean connector algorithm which consists of the following steps:
    - (A) creating a table of relative penalties for each pair of said term-codes in said parsed-query table where said relative penalty is a function of the relative percentage corresponding to the two term-codes of said pair, the number of documents that each of the term-codes of the pair are contained in with a document-weight above a threshold, and the average over all terms of the number of documents that the term is contained in with a document-weight above said threshold;
      
      (B) modifying said relative penalties by taking the minimum of the relative penalty and some maximum value which depends on the number of terms in the parsed-query table;
      
      (C) summing said modified relative penalties to produce a sum of relative penalties;
      
      (D) modifying said sum of relative penalties by taking the minimum of said sum and some maximum sum value which depends on the number of terms in the parsed-query table to produce a modified sum of penalties;
      
      (E) summing some function of the polysemantic weights of the term-codes in the parsed-query table that are either relatives of a potential SWAPS term (jl) or are contained in a document (pl) to produce a number of hits value;
      
      (F) Calculating some function of the number of hits value and the modified sum of penalties value to produce a power value;
      
      (G) Raising a number approximately equal to 2 to the power value to produce an adjust value;
      
      (H) Multiplying either the modified summed relative percentages calculated in step j) or the summed document weight×
      
      poly values calculated in step (p) by the adjust value.
  - 3. A method as in claim 1 where the formula for calculating the weight-in-document in step (d) is:
    - ##EQU6##
  - 4. A method as in claim 1 where the formula for calculating the polysemantic weight in step (g) is:
    - ##EQU7##
  - 5. A method as in claim 1 where the function in step (j) is the identity function.
  - 6. A method as in claim 1 where the function in step (p) is the identity function.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Karl-Erbo G. Kageneck, Ted Young
Original Assignee
Karl-Erbo G. Kageneck, Ted Young
Inventors
Young, Ted, Kageneck, Karl-Erbo G.
Primary Examiner(s)
Kulik, Paul V.

Application Number

US08/121,370
Time in Patent Office

568 Days
Field of Search

395/600, 364/419.19
US Class Current

1/1
CPC Class Codes

G06F 16/30   of unstructured textual dat...

G06F 16/313   Selection or weighting of t...

G06F 16/38   Retrieval characterised by ...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99945   Object-oriented database st...

Method of indexing and retrieval of electronically-stored documents

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

Method of indexing and retrieval of electronically-stored documents

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links