Information retrieval system and method that generates weighted comparison results to analyze the degree of dissimilarity between a reference corpus and a candidate document

US 6,167,398 A
Filed: 05/13/1998
Issued: 12/26/2000
Est. Priority Date: 01/30/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method of information retrieval comprising:

(a) receiving from a user, data identifying a stored reference corpus;

(b) retrieving the identified reference corpus from storage;

(c) generating initial values of respective weights corresponding to a plurality of analysis algorithms by processing the retrieved reference corpus in accordance with a predetermined algorithm;

(d) retrieving from storage another text document as a candidate document;

(e) performing respective comparisons between the candidate document and the reference corpus in accordance with each of said analysis algorithms and producing respective comparison results;

(f) generating corresponding weighted comparison results by multiplying each said comparison result by its respective weight;

(g) summing the weighted comparison results to produce a dissimilarity measure that is indicative of the degree of dissimilarity between the retrieved reference corpus and the retrieved candidate document; and

(h) storing the candidate document in a retained text store if said sum is indicative of a degree of dissimilarity less than a predetermined degree of dissimilarity.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An internet information agent accepts a reference document, performs an analysis upon it in accordance with metrics defined by its analysis algorithm and obtains respective lists (word, character-level n-gram, word-level n-gram), derives weights corresponding to the metrics, applies the metrics to a candidate document and obtains respective returned values, applies the weights to the returned values and sums the results to obtain a Document Dissimilarity (DD) value. This DD is compared with a Dissimilarity Threshold (DT) and the candidate document is stored if the DD is less than the DT. A user can apply relevance values to the search results and the agent modifies the weights accordingly. The agent can be used to improve a language model for use in speech recognition applications and the like.

200 Citations

18 Claims

1. A method of information retrieval comprising:
- (a) receiving from a user, data identifying a stored reference corpus;
  
  (b) retrieving the identified reference corpus from storage;
  
  (c) generating initial values of respective weights corresponding to a plurality of analysis algorithms by processing the retrieved reference corpus in accordance with a predetermined algorithm;
  
  (d) retrieving from storage another text document as a candidate document;
  
  (e) performing respective comparisons between the candidate document and the reference corpus in accordance with each of said analysis algorithms and producing respective comparison results;
  
  (f) generating corresponding weighted comparison results by multiplying each said comparison result by its respective weight;
  
  (g) summing the weighted comparison results to produce a dissimilarity measure that is indicative of the degree of dissimilarity between the retrieved reference corpus and the retrieved candidate document; and
  
  (h) storing the candidate document in a retained text store if said sum is indicative of a degree of dissimilarity less than a predetermined degree of dissimilarity.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. A method as in claim 1 wherein:
    - a first of said analysis algorithms is arranged to generate and compare word frequency lists and to produce a comparison result; and
      
      said predetermined algorithm includes forming from the reference corpus separate first and second parts, performing a comparison between said first and second parts in accordance with said first analysis algorithm, the resulting comparison result constituting a measure of the homogeneity of said first and second parts.
  - 3. A method as in claim 2 wherein:
    - a second of said analysis algorithms is arranged to generate and compare word-level n-gram frequency lists and to produce a comparison result; and
      
      said predetermined algorithm includes calculating a confidence value by multiplying the said measure of the homogeneity by the total number of words of the word frequency lists generated for obtaining said measure of homogeneity, and, if this confidence value is less than a predetermined threshold, setting to substantially zero the value of the weight corresponding to said second analysis algorithm.
  - 4. A method as in claim 1, including the steps of:
    - presenting to the user, for each of a plurality of candidate documents stored in the retained text store, the respective dissimilarity measures and respective links to said plurality of stored retrieved texts,receiving from the user an allocated relevance value R in respect of a presented dissimilarity measure, andmodifying said weights by multiplying each weight (Wi) by a respective weight factor (1+ki), where ki is a function of the contribution that the respective weighted comparison result makes to the dissimilarity measure, and the value of R-Rmean, where Rmean is the mean of lowest and highest possible relevance values.
  - 5. A method as in claim 2, including the steps of:
    - presenting to the user, for each of a plurality of candidate documents stored in the retained text store, the respective dissimilarity measures and respective links to said plurality of stored retrieved texts,receiving from the user an allocated relevance value R in respect of a presented dissimilarity measure, andmodifying said weights by multiplying each weight (Wi) by a respective weight factor (1+ki), where ki is a function of the contribution that the respective weighted comparison result makes to the dissimilarity measure, and the value of R-Rmean, where Rmean is the mean of lowest and highest possible relevance values.
  - 6. A method as in claim 3, including the steps of:
    - presenting to the user, for each of a plurality of candidate documents stored in the retained text store, the respective dissimilarity measures and respective links to said plurality of stored retrieved tests,receiving from the user an allocated relevance value R in respect of a presented dissimilarity measure, andmodifying said weights by multiplying each weight (Wi) by a respective weight factor (1+ki), where ki is a function of the contribution that the respective weighted comparison result makes to the dissimilarity measure, and the value of R-Rmean, where Rmean is the mean of lowest and highest possible relevance values.
  - 7. A method of generating a language model, said method comprising:
    - (i) providing a reference corpus;
      
      (ii) forming from said reference corpus a training portion and a development portion;
      
      (iii) generating initial values of respective weights corresponding to a plurality of analysis algorithms by processing the training portion in accordance with a predetermined algorithm;
      
      (iv) performing a comparison between the training portion and the development portion in accordance with a first of said analysis algorithms and producing an initial comparison result, said first analysis algorithm being arranged to generate and compare word-level n-gram frequency lists having n-grams from unigrams up to m-grams, where m is a predetermined integer, the word-level n-gram frequency list generated from the training portion constituting a language model;
      
      (v) performing information retrieval in accordance with steps (d) to (h) of the method of claim 1, using the above mentioned plurality of analysis algorithms, and using the training portion as the reference corpus;
      
      (vi) retrieving a document from the retained text store;
      
      (vii) modifying the training portion by combining it with the document retrieved from the retained text store;
      
      (viii) repeating step (iv) in respect of the modified training portion to produce a further comparison result; and
      
      (ix) modifying the weights in accordance with a weight modifying function of said initial and further comparison results.
  - 8. A method as in claim 7 wherein step (v) is stopped when a first candidate document is stored in the retained text store.
  - 9. A method as in claim 7 wherein step (vi) comprises selecting the document for retrieval on the basis of the least degree of dissimilarity.
  - 10. A method as in claim 9 wherein steps (vi) to (ix) are iteratively performed in respect of respective documents successively selected by increasing degree of dissimilarity.
  - 11. A method as in claim 10 wherein the modified training portion produced by step (vii) of one iteration of steps (vi) to (ix) constitutes the training portion of step (vii) of the next iteration thereof.
  - 12. A method as in claim 7, wherein the weight modifying function comprises multiplying each weight (Wi) by a respective weight factor (1+ki), where ki is a function of the contribution that the respective weighted comparison result makes to the dissimilarity measure, and the difference between the initial comparison result and the further comparison result.
  - 13. A method as in claim 8, wherein the weight modifying function comprises multiplying each weight (Wi) by a respective weight factor (1+ki), where ki is a function of the contribution that the respective weighted comparison result makes to the dissimilarity measure, and the difference between the initial comparison result and the further comparison result.
  - 14. A method as in claim 9, wherein the weight modifying function comprises multiplying each weight (Wi) by a respective weight factor (1+ki), where ki is a function of the contribution that the respective weighted comparison result makes to the dissimilarity measure, and the difference between the initial comparison result and the further comparison result.
  - 15. A method as in claim 10, wherein the weight modifying function comprises multiplying each weight (Wi) by a respective weight factor (1+ki), where ki is a function of the contribution that the respective weighted comparison result makes to the dissimilarity measure, and the difference between the initial comparison result and the further comparison result.
  - 16. A method as in claim 11, wherein the weight modifying function comprises multiplying each weight (Wi) by a respective weight factor (1+ki), where ki is a function of the contribution that the respective weighted comparison result makes to the dissimilarity measure, and the difference between the initial comparison result and the further comparison result.

17. An information agent for use in a communications network including a plurality of databases, the agent comprising;
- means for generating initial values of respective weights corresponding to a plurality of analysis algorithms by processing an identified reference corpus in accordance with a predetermined algorithm;
  
  means for retrieving from storage a text document as a candidate document;
  
  means for performing respective comparisons between the candidate document and the reference corpus in accordance with each of said analysis algorithms and producing respective comparison results;
  
  means for generating corresponding weighted comparison results by multiplying each said comparison result by its respective weight;
  
  means for summing the weighted comparison results to produce a dissimilarity measure that is indicative of the degree of dissimilarity between the retrieved reference corpus and the retrieved candidate document, andmeans for storing the candidate document in a retained text store if said sum is indicative of a degree of dissimilarity less than a predetermined degree of dissimilarity.

18. A document access system, for accessing documents stored in a distributed manner and accessible by means of a communications network, the access system comprising at least one software agent for use in accessing documents by means of the network, wherein the agent comprises:
- means for generating initial values of the respective weights corresponding to a plurality of analysis algorithms by processing an identified reference corpus in accordance with a predetermined algorithm;
  
  means for retrieving from storage a text document as a candidate document;
  
  means for performing respective comparisons between the candidate document and the reference corpus in accordance with each of said analysis algorithms and producing respective comparison results;
  
  means for generating corresponding weighted comparison results by multiplying each said comparison result by its respective weight;
  
  means for summing the weighted comparison results to produce a dissimilarity measure that is indicative of the degree of dissimilarity between the retrieved reference corpus and the retrieved candidate document; and
  
  means for storing the candidate document in a retained text store if said sum is indicative of a degree of dissimilarity less than a predetermined degree of dissimilarity.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
British Telecommunications PLC (BT Group PLC)
Original Assignee
British Telecommunications PLC (BT Group PLC)
Inventors
Rose, Tony G, Wyard, Peter J
Primary Examiner(s)
Breene, John
Assistant Examiner(s)
ROBINSON, GRETA LEE

Application Number

US09/068,452
Time in Patent Office

958 Days
Field of Search

707/5, 707/10, 707/2, 707/3, 707/4
US Class Current

1/1
CPC Class Codes

G06F 16/3346   using probabilistic model

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Information retrieval system and method that generates weighted comparison results to analyze the degree of dissimilarity between a reference corpus and a candidate document

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

200 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Information retrieval system and method that generates weighted comparison results to analyze the degree of dissimilarity between a reference corpus and a candidate document

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

200 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links