Information retrieval system and method that generates weighted comparison results to analyze the degree of dissimilarity between a reference corpus and a candidate document
First Claim
1. A method of information retrieval comprising:
- (a) receiving from a user, data identifying a stored reference corpus;
(b) retrieving the identified reference corpus from storage;
(c) generating initial values of respective weights corresponding to a plurality of analysis algorithms by processing the retrieved reference corpus in accordance with a predetermined algorithm;
(d) retrieving from storage another text document as a candidate document;
(e) performing respective comparisons between the candidate document and the reference corpus in accordance with each of said analysis algorithms and producing respective comparison results;
(f) generating corresponding weighted comparison results by multiplying each said comparison result by its respective weight;
(g) summing the weighted comparison results to produce a dissimilarity measure that is indicative of the degree of dissimilarity between the retrieved reference corpus and the retrieved candidate document; and
(h) storing the candidate document in a retained text store if said sum is indicative of a degree of dissimilarity less than a predetermined degree of dissimilarity.
1 Assignment
0 Petitions
Accused Products
Abstract
An internet information agent accepts a reference document, performs an analysis upon it in accordance with metrics defined by its analysis algorithm and obtains respective lists (word, character-level n-gram, word-level n-gram), derives weights corresponding to the metrics, applies the metrics to a candidate document and obtains respective returned values, applies the weights to the returned values and sums the results to obtain a Document Dissimilarity (DD) value. This DD is compared with a Dissimilarity Threshold (DT) and the candidate document is stored if the DD is less than the DT. A user can apply relevance values to the search results and the agent modifies the weights accordingly. The agent can be used to improve a language model for use in speech recognition applications and the like.
200 Citations
18 Claims
-
1. A method of information retrieval comprising:
-
(a) receiving from a user, data identifying a stored reference corpus; (b) retrieving the identified reference corpus from storage; (c) generating initial values of respective weights corresponding to a plurality of analysis algorithms by processing the retrieved reference corpus in accordance with a predetermined algorithm; (d) retrieving from storage another text document as a candidate document; (e) performing respective comparisons between the candidate document and the reference corpus in accordance with each of said analysis algorithms and producing respective comparison results; (f) generating corresponding weighted comparison results by multiplying each said comparison result by its respective weight; (g) summing the weighted comparison results to produce a dissimilarity measure that is indicative of the degree of dissimilarity between the retrieved reference corpus and the retrieved candidate document; and (h) storing the candidate document in a retained text store if said sum is indicative of a degree of dissimilarity less than a predetermined degree of dissimilarity. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. An information agent for use in a communications network including a plurality of databases, the agent comprising;
-
means for generating initial values of respective weights corresponding to a plurality of analysis algorithms by processing an identified reference corpus in accordance with a predetermined algorithm; means for retrieving from storage a text document as a candidate document; means for performing respective comparisons between the candidate document and the reference corpus in accordance with each of said analysis algorithms and producing respective comparison results; means for generating corresponding weighted comparison results by multiplying each said comparison result by its respective weight; means for summing the weighted comparison results to produce a dissimilarity measure that is indicative of the degree of dissimilarity between the retrieved reference corpus and the retrieved candidate document, and means for storing the candidate document in a retained text store if said sum is indicative of a degree of dissimilarity less than a predetermined degree of dissimilarity.
-
-
18. A document access system, for accessing documents stored in a distributed manner and accessible by means of a communications network, the access system comprising at least one software agent for use in accessing documents by means of the network, wherein the agent comprises:
-
means for generating initial values of the respective weights corresponding to a plurality of analysis algorithms by processing an identified reference corpus in accordance with a predetermined algorithm; means for retrieving from storage a text document as a candidate document; means for performing respective comparisons between the candidate document and the reference corpus in accordance with each of said analysis algorithms and producing respective comparison results; means for generating corresponding weighted comparison results by multiplying each said comparison result by its respective weight; means for summing the weighted comparison results to produce a dissimilarity measure that is indicative of the degree of dissimilarity between the retrieved reference corpus and the retrieved candidate document; and means for storing the candidate document in a retained text store if said sum is indicative of a degree of dissimilarity less than a predetermined degree of dissimilarity.
-
Specification