Method for characterizing a document set using evaluation surrogates
First Claim
1. A method for determining measures of relevance of a document to selected topics, wherein the document is represented as a stream of tokens and the selected topics are represented by topic profiles, each of which includes one or more compound term templates that specify the precise forms of terms characteristic of the topic, the method comprising the steps of:
- applying the topic profiles to the token stream to identify compound terms in the document;
augmenting the token stream with a compound term token for each compound term identified;
eliminating from the augmented token stream tokens representing common terms, redundant tokens that correspond to repeated instances of a term, and selected tokens representing components of compound terms to provide a compact representation of the document;
calculating a similarity function between the compact document representation of the document and the topic profiles to form an evaluation surrogate of the document for the topic profiles.
1 Assignment
0 Petitions
Accused Products
Abstract
A method is provided for determining the relevance of a document to one or more topics, each of which is specified by a topic profile. The document is tokenized into a stream of document tokens and compound terms specified in the topic profiles are identified among the document tokens. The stream of document tokens is augmented for each identified compound term with a tagged compound term token specified in the topic profile. The augmented stream of document tokens is stopped to eliminate tokens representing common terms, redundant terms, and selected terms associated with tagged tokens. A similarity function is calculated between the resulting document representation and each of the topic profiles to provide an evaluation surrogate that includes measures of relevance the document to each of the topic profiles.
88 Citations
12 Claims
-
1. A method for determining measures of relevance of a document to selected topics, wherein the document is represented as a stream of tokens and the selected topics are represented by topic profiles, each of which includes one or more compound term templates that specify the precise forms of terms characteristic of the topic, the method comprising the steps of:
-
applying the topic profiles to the token stream to identify compound terms in the document; augmenting the token stream with a compound term token for each compound term identified; eliminating from the augmented token stream tokens representing common terms, redundant tokens that correspond to repeated instances of a term, and selected tokens representing components of compound terms to provide a compact representation of the document; calculating a similarity function between the compact document representation of the document and the topic profiles to form an evaluation surrogate of the document for the topic profiles. - View Dependent Claims (2)
-
-
3. A method for characterizing the relevance of a document, represented as a stream of document terms, to one or more topic profiles, the method comprising the steps of, for each of the one or more topic profiles;
-
identifying compound terms in the stream using compound term data structures specified in the topic profile; augmenting the stream of document terms with compound term tokens for compound terms identified in the stream; stopping the augmented stream of document terms, according to selected criteria; calculating a similarity function between the document and the topic profile to provide a measure of relevance of the document to the topic profile; and adding the calculated similarity function as an entry in an evaluation surrogate for the document. - View Dependent Claims (4, 5, 6, 7)
-
-
8. A method for representing a document as an evaluation surrogate indicating relevance measures of the document to each of a plurality of topic profiles, the method comprising the steps of;
-
tokenizing the document into a stream of document terms; applying compound term data structures specified in each of the plurality of topic profiles to the document terms to augment the document terms with compound terms identified in the document through the data structures; stopping the document terms according to the identified compound terms of each topic profile; calculating a similarity function between the augmented document terms for each topic profile and the topic profile to determine a measure of relevance of the document to the topic profile; and collecting the measures of relevance for each of the plurality of topic profiles into an evaluation surrogate for the document. - View Dependent Claims (9, 10)
-
-
11. A method for determining measures of relevance of a set of documents to a plurality of topics, the method comprising the steps of:
-
generating a topic profile for each of the selected topics, each of the topic profiles comprising terms, including compound terms, characteristic of the topic; comparing each of the topic profiles with each of the documents of the set to identify a set of topic profile terms appearing in each document of the set; calculating a measure of relevance for each document to each of the topic profiles, using the identified set of topic profile terms of each document; and generating an evaluation surrogate for each document, the evaluation surrogate of each document comprising the measures of relevance of the document to each of the topic profiles.
-
-
12. A method for identifying one or more documents in a document set that are relevant to a topic query comprising one or more topic profiles, the method comprising the steps of:
-
tokenizing each of the one or more documents into a stream of document tokens; identifying compound terms, specified in the one or more topic profiles, in each stream of document tokens, and adding a compound term token to the stream for each compound term identified; eliminating common, redundant, and selected tokens from each stream to provide a compact representation of each document of the set; and determining a similarity function between each document representation and the one or more topic profiles to determine a measure of relevance of each document to each of the one or more topic profiles.
-
Specification