Method for measuring thresholded relevance of a document to a specified topic
First Claim
1. A method for determining a thresholded measure of relevance of a document to a topic defined by a topic profile that includes a compound term template, the method comprising the steps of:
- converting the document into a stream of tokens;
scanning the stream for one or more tokens matching the compound term template;
augmenting the document tokens with a compound term token to provide a representation of the document, when a match is detected; and
calculating a similarity transform between the document tokens and the topic profile to provide the measure of relevance.
1 Assignment
0 Petitions
Accused Products
Abstract
A method is provided for specifying the representation of a document and determining the relevance of the document according to an externally defined topic profile. The topic profile includes one or more compound terms having a positive correlation with the topic of interest. Each compound term has a specified form such as capitalization, punctuation, number, or adjacency relation, that is either ignored by conventional indexing processes or requires substantial data overhead to track. The compound terms of the topic profile are tagged to indicate how corresponding terms are treated when identified in a document being analyzed. Application of the topic profile to a document generates a document representation in which compound terms present in the document are retained in their specified form. A similarity function between the document representation and the topic profile is calculated, and the result is compared to a relevance threshold associated with the topic profile. A document is deemed relevant to the topic when the similarity function meets or exceeds the threshold.
-
Citations
30 Claims
-
1. A method for determining a thresholded measure of relevance of a document to a topic defined by a topic profile that includes a compound term template, the method comprising the steps of:
-
converting the document into a stream of tokens; scanning the stream for one or more tokens matching the compound term template; augmenting the document tokens with a compound term token to provide a representation of the document, when a match is detected; and calculating a similarity transform between the document tokens and the topic profile to provide the measure of relevance. - View Dependent Claims (2, 3, 4)
-
-
5. A method for identifying a topic to be searched in a document that is to be represented as a plurality of tokens, the method comprising the steps of:
-
identifying one or more profile terms having a positive correlation with the topic; selecting from the one or more profile terms a compound term to be identified in a specified form in the document; and tagging the compound term to indicate a retention status for a token assigned to the compound term in the document representation when an instance of the compound term is identified in the document. - View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method for generating a representation of a document as a plurality of tokens tailored for searches on a selected topic, the method comprising the steps of:
-
identifying one or more profile terms having positive correlations with the selected topic; selecting a profile term to be identified in a specified form in the one or more documents; and tagging the selected profile term to indicate a retention status for an associated token in the representation of the one or more documents.
-
-
21. A method for determining a measure of relevance of a document, represented as a plurality of tokens, to a specified subject area, the method comprising the steps of:
-
identifying a plurality of topics within the specified subject area; for each identified topic, selecting a topic profile that includes one or more compound terms that are characteristic of the topic, have a specified form to be identified in the plurality of tokens representing the document, and an associated retention status tag; modifying the plurality of tokens representing the document according to associated retention status tags when instances of the one or more compound terms are identified in the plurality of tokens; stopping and stemming the modified plurality of tokens to generate a document representation; and calculating a similarity measure between each of the plurality of topic profiles and the document representation. - View Dependent Claims (22, 23, 24)
-
-
25. A topic profile for evaluating the relevance to a topic of a document that has been tokenized into a stream of terms, the topic profile comprising:
-
one or more profile terms that are characteristic of the topic; one or more tags associated with selected profile terms to indicate retention/elimination of document terms represented by the selected profile terms in the stream; and a relevance threshold, wherein the topic profile is applied to tokenized terms of the document to preserve selected profile terms in a document representation and he one or more compound terms
-
-
26. A method for generating a topic profile to determine a measure of relevance of a document to a topic characterized by the topic profile, the method comprising the steps of:
-
identifying a term used to discuss the topic; eliminating the identified term when it is a common term; adding the term the topic profile when the term has a strong, positive correlation with the topic and does not alias other topics; and when a term aliases other topics, defining a specified form for the term that reduces aliasing and maintains a positive correlation with the topic, and adding the specified form to the topic profile as a compound term. - View Dependent Claims (27, 28, 29, 30)
-
Specification