Method of indexing and retrieval of electronically-stored documents
First Claim
1. A method of indexing .[.and retrieving.]. documents, said method using a digital computer system having a central processing unit, a memory, .[.a display screen, a keyboard,.]. and a large capacity file system, said method comprising the steps of:
- (a) storing in said memory a vocabulary of terms, each term consisting of one or more words, and for each term an associated term-code;
(b) storing on said file system a collection of documents each with an associated unique document-number;
(c) creating index files which contain for each said term-code in (a)(i) the set of document-numbers in (b) such that the corresponding documents contain the corresponding term; and
(ii) for each said document-identifying-number in (i) the frequency-in-document of the corresponding term which is the number of times that said term appears in the corresponding document;
(d) creating a weight-in-document file which contains for each document-number in (c)(i) the weight-in-document of the corresponding term which is calculated using the frequency-in-document in (c) (ii), the number of document-numbers in (c) (i), and the total number of terms in (a) which are in the corresponding document (counted multiple times);
(e) creating a frequent-companion file which contains for each occurring term-code in (a) a ranked set of pairs of numbers where each pair consists of a first element term-code and a second element companion-percentage, where the companion-percentage is calculated by summing the weight-in-document values of said first element term-code over documents that contain both the term corresponding to said first element term-code and the term corresponding to said occurring term-code and then dividing by the sum over all documents of the weight-in-document of said occurring term-code;
(f) creating a relative file which contains for each occurring term-code in (a) a ranked set of pairs of numbers where each pair consists of a first element relative term-code and a second element relative-percentage, where the relative-percentage is calculated by taking a weighted average of the companion-percentage of said first element term-code calculated in step (e) and the companion-percentage of said occurring term-code that was calculated in step (e) when said first element term-code was the occurring term-code and said occurring term-code was the first element term-code.[.;
(g) creating a polysemantic file which contains for each occurring term-code in (a), a polysemantic weight which is calculated using the number of sets of pairs in the relative file created in step (f) that said occurring term-code appears in, the number of documents-numbers for which the weight-in-document of said occurring term-code calculated in step (d) is greater than some threshold value, and the averages for several values of N of the first N relative-percentages of said occurring term-code calculated and ranked in step (f);
(h) accepting a query consisting of a sequence of words entered by a user using said keyboard and creating a parsed-query table of term-codes which consist of the term-codes in said vocabulary that are associated with the terms that are contained in said query;
(i) creating a temporary swap table of pairs of first element term-codes and corresponding second element summed-relative-percentages consisting of those relative term-codes created in step (f) where said corresponding second element summed-relative-percentages are the sum, over all said occurring term-codes that are in said parsed-query table, of the relative percentages of said first element term-codes;
(j) creating a modified swap table by modifying said second element summed-relative-percentages created in step (i) by multiplying them by a function of the polysemantic weight of the corresponding first element term-codes;
(k) sorting said modified swap table by said modified summed-relative-percentages in descending order;
(l) displaying on said display the terms corresponding to the term-codes of said modified swap table;
(m) accepting user keypresses or other actions which identify one or more of the terms displayed in step (l) and adding the corresponding term-codes to the parsed-query-table;
(n) repeating steps (i) through (m) as many times as the user indicates by his input;
(o) accepting an input from the user indicating a command to retrieve documents;
(p) creating a temporary rank table of pairs of first element document-numbers and corresponding second element summed-document-weight×
poly values which pairs comprise those document-numbers for which any of the term-codes that are in said parsed-query table have weight-in-document above a threshold value, and summed-document-weight×
poly values which are the sums, over all term-codes in said parsed-query table, of a function of me polysemantic weight of the term-code and the weight-in-document of the term-code;
(r) creating a sorted rank table by sorting said temporary rank table by the value of the second elements of the pairs in descending order;
(s) displaying on the display screen some portion of the document corresponding to the first document number in the sorted rank table and some indication of the corresponding summed-document-weight×
poly value;
(t) displaying other documents corresponding to other document-numbers in the sorted rank table in response to inputs from the user.]..
0 Assignments
0 Petitions
Accused Products
Abstract
A document indexing and retrieval system and method which assigns weights to the key words and assigns a relative value to pairs of key words (i.e. defines a relative relation on K×K) based on their frequency of occurrence and co-occurrence in the document data base. In response to a query both the weights and this relative relation are used to suggest additional and/or alternative key words which are very likely to find relevant documents. Documents are then ranked by number of hits adjusted for the weights of hit words and their relative values.
84 Citations
6 Claims
-
1. A method of indexing .[.and retrieving.]. documents, said method using a digital computer system having a central processing unit, a memory, .[.a display screen, a keyboard,.]. and a large capacity file system, said method comprising the steps of:
-
(a) storing in said memory a vocabulary of terms, each term consisting of one or more words, and for each term an associated term-code; (b) storing on said file system a collection of documents each with an associated unique document-number; (c) creating index files which contain for each said term-code in (a) (i) the set of document-numbers in (b) such that the corresponding documents contain the corresponding term; and (ii) for each said document-identifying-number in (i) the frequency-in-document of the corresponding term which is the number of times that said term appears in the corresponding document; (d) creating a weight-in-document file which contains for each document-number in (c)(i) the weight-in-document of the corresponding term which is calculated using the frequency-in-document in (c) (ii), the number of document-numbers in (c) (i), and the total number of terms in (a) which are in the corresponding document (counted multiple times); (e) creating a frequent-companion file which contains for each occurring term-code in (a) a ranked set of pairs of numbers where each pair consists of a first element term-code and a second element companion-percentage, where the companion-percentage is calculated by summing the weight-in-document values of said first element term-code over documents that contain both the term corresponding to said first element term-code and the term corresponding to said occurring term-code and then dividing by the sum over all documents of the weight-in-document of said occurring term-code; (f) creating a relative file which contains for each occurring term-code in (a) a ranked set of pairs of numbers where each pair consists of a first element relative term-code and a second element relative-percentage, where the relative-percentage is calculated by taking a weighted average of the companion-percentage of said first element term-code calculated in step (e) and the companion-percentage of said occurring term-code that was calculated in step (e) when said first element term-code was the occurring term-code and said occurring term-code was the first element term-code.[.; (g) creating a polysemantic file which contains for each occurring term-code in (a), a polysemantic weight which is calculated using the number of sets of pairs in the relative file created in step (f) that said occurring term-code appears in, the number of documents-numbers for which the weight-in-document of said occurring term-code calculated in step (d) is greater than some threshold value, and the averages for several values of N of the first N relative-percentages of said occurring term-code calculated and ranked in step (f); (h) accepting a query consisting of a sequence of words entered by a user using said keyboard and creating a parsed-query table of term-codes which consist of the term-codes in said vocabulary that are associated with the terms that are contained in said query; (i) creating a temporary swap table of pairs of first element term-codes and corresponding second element summed-relative-percentages consisting of those relative term-codes created in step (f) where said corresponding second element summed-relative-percentages are the sum, over all said occurring term-codes that are in said parsed-query table, of the relative percentages of said first element term-codes; (j) creating a modified swap table by modifying said second element summed-relative-percentages created in step (i) by multiplying them by a function of the polysemantic weight of the corresponding first element term-codes; (k) sorting said modified swap table by said modified summed-relative-percentages in descending order; (l) displaying on said display the terms corresponding to the term-codes of said modified swap table; (m) accepting user keypresses or other actions which identify one or more of the terms displayed in step (l) and adding the corresponding term-codes to the parsed-query-table; (n) repeating steps (i) through (m) as many times as the user indicates by his input; (o) accepting an input from the user indicating a command to retrieve documents; (p) creating a temporary rank table of pairs of first element document-numbers and corresponding second element summed-document-weight×
poly values which pairs comprise those document-numbers for which any of the term-codes that are in said parsed-query table have weight-in-document above a threshold value, and summed-document-weight×
poly values which are the sums, over all term-codes in said parsed-query table, of a function of me polysemantic weight of the term-code and the weight-in-document of the term-code;(r) creating a sorted rank table by sorting said temporary rank table by the value of the second elements of the pairs in descending order; (s) displaying on the display screen some portion of the document corresponding to the first document number in the sorted rank table and some indication of the corresponding summed-document-weight×
poly value;(t) displaying other documents corresponding to other document-numbers in the sorted rank table in response to inputs from the user.].. - View Dependent Claims (2, 3, 4, 5)
-
-
6. his input..Iaddend..Iadd.8. A method for retrieving documents identified pursuant to a query processed by the method of claim 7, said method for retrieving comprising:
-
(i) accepting an input from the user indicating a command to retrieve documents; (ii) creating a temporary rank table of pairs of first element document-numbers and corresponding second element summed-document-weight values which pairs comprise those document-numbers for which any of the term-codes that are in said parsed-query table have weight-in-document above a threshold value, and summed-document-weight values which are the sums, over all term-codes in said parsed-query table, of a function of the term-code and the weight-in-document of the term-code; (iii) creating a sorted rank table by sorting said temporary rank table by the value of the second elements of the pairs in descending order; (iv) displaying on the display screen some portion of the document corresponding to the first document number in the sorted rank table and some indication of the corresponding summed-document-weight value; (v) displaying other documents corresponding to other document-numbers in the sorted rank table in response to inputs from the user..Iaddend.
-
Specification