System, method and computer program product for automatic topic identification using a hypertext corpus
First Claim
Patent Images
1. A method of improving accuracy of computerized topic identification comprising:
- a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;
a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;
b) receiving, by the at least one computer processor, at least one content document;
c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;
d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;
a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and
e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;
i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set,ii. evaluating the semantic relatedness score, by the at least one computer processor, comprisingevaluating said theoretical information content of an edge of said graph,wherein said theoretical information content comprises at least one of;
A. self-information, or
B. surprisal and;
iii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying; and
wherein said at least one data set comprises;
a hypertext corpus comprisingat least one graph, wherein said graph comprises;
a plurality V of vertices, wherein each individual vertex v represents a page; and
a plurality of edges, each edge representing a link; and
wherein said each edge comprises;
information content I(u,v),
wherein said information content I comprises
0 Assignments
0 Petitions
Accused Products
Abstract
A system, method, and/or computer program product for automatic topic identification using a hypertext corpus may include a) receiving a content document(s); b) identifying or lexically scoring candidate topic(s) in the received content document based on label(s) used in a corpus to link to or relate to the candidate topics; c) evaluating or semantically scoring the candidate topic(s) of the received document based on a relationship between two or more candidate topics in the corpus; and d) weighting candidate topics for relevance based on algorithmic or statistical analysis of links or relationships in the corpus.
68 Citations
17 Claims
-
1. A method of improving accuracy of computerized topic identification comprising:
-
a domain independent, language independent, computer processor automated topic identification analysis method, the method comprising; a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic, wherein said at least one term comprises at least one word, wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term, wherein each of said at least one sense refers to a single topic of said at least one term, wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime, wherein a prior probability is associated with each of said at least one sense of said each said term, wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense; b) receiving, by the at least one computer processor, at least one content document; c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document; d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, and accumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic, wherein said lexically scoring comprises lexically scoring, by the at least one computer processor, based on; a number of occurrences of the at least one term found in the at least one content document, a weighting factor representing a relative importance of the at least one term found, and the prior probability of the sense of the at least one term found; and e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising; i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set, ii. evaluating the semantic relatedness score, by the at least one computer processor, comprising evaluating said theoretical information content of an edge of said graph, wherein said theoretical information content comprises at least one of;
A. self-information, or
B. surprisal and;iii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying; and wherein said at least one data set comprises; a hypertext corpus comprising at least one graph, wherein said graph comprises; a plurality V of vertices, wherein each individual vertex v represents a page; and a plurality of edges, each edge representing a link; and wherein said each edge comprises; information content I(u,v),
wherein said information content I comprises - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A computerized data processing analysis system of improving accuracy of computerized topic identification, comprising:
-
at least one memory; and at least one topic identification computer processor coupled to said at least one memory, said at least one topic identification computer processor configured to; execute a domain independent, language independent, computer processor automated topic identification analysis module, wherein said at least one topic identification computer processor is configured to; derive a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic, wherein said at least one term comprises at least one word, wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term, wherein each of said at least one sense refers to a single topic of said at least one term, wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime, wherein a prior probability is associated with each of said at least one sense of said each said term, wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic of said each of said at least one sense; receive at least one content document; search for at least one term from the lexicon derived from the at least one hypertext corpus data set, and find the at least one term from the lexicon that appears in the at least one content document to determine at least one candidate topic of the at least one content document; lexically score each of said at least one candidate topic found that appears in the at least one content document based on the at least one term found in said search of the at least one content document to obtain a lexical score for each of said at least one candidate topic, and wherein said at least one topic identification computer processor is configured to; accumulate said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic; lexically score based on; a number of occurrences of the at least one term found in the at least one content document, a weighting factor representing a relative importance of the at least one term found, and the prior probability of the sense of the at least one term found; and semantically score the at least one candidate topic found in the at least one content document based on a degree to which any plurality of candidate topics are semantically related to each other comprising; wherein said at least one topic identification computer processor is configured to; quantify a semantic relatedness score of said any plurality of topics based on theoretical information content of co-adjacent links of a graph representation of data, evaluate the semantic relatedness score comprising wherein said at least one topic identification computer processor is configured to evaluate the theoretical information content of an edge of said graph, wherein said theoretical information content comprises at least one of;
self-information, or
surprisal; andsemantically score based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set and based on said semantic relatedness score from said quantify, wherein said at least one topic identification computer processor is configured to; weight said at least one candidate topic for relevance based on at least one of algorithmic analysis, or statistical analysis of semantic relatedness, and wherein said each edge comprises; information content I(u,v), wherein said information content I comprises - View Dependent Claims (16)
-
-
17. A nontransitory computer program product embodied on a nontransitory computer readable medium, said computer program product comprising program logic, which when executed on at least one computer processor performs a computerized data processing method of improving accuracy of computerized topic identification comprising:
-
a domain independent, language independent, computer processor automated topic identification analysis method, the method comprising; a) deriving, by the at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic, wherein said at least one term comprises at least one word, wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term, wherein each of said at least one sense refers to a single topic of said at least one term, wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime, wherein a prior probability is associated with each of said at least one sense of said each said term, wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense; b) receiving, by the at least one computer processor, at least one content document; c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document; d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, and accumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic, wherein said lexically scoring comprises lexically scoring, by the at least one computer processor, based on; a number of occurrences of the at least one term found in the at least one content document, a weighting factor representing relative importance of the at least one term found, and the prior probability of the sense of the at least one term found; and e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document based on a degree to which any plurality of candidate topics are semantically related to each other comprising; i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set, ii. evaluating the semantic relatedness score, by the at least one computer processor, comprising evaluating said theoretical information content of an edge of said graph, wherein said theoretical information content comprises at least one of;
A. self-information, or
B. surprisal; andf) semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic derived from the hypertext corpus data set, wherein the data processing method comprises wherein said lexically scoring comprises at least one of; i) accumulating, by the at least one computer processor, at least one lexical score for a given candidate topic across at least one occurrence of the at least one term, which has at least one sense that refers to said given candidate topic; ii) lexically scoring, by the at least one computer processor, based on a probability that the at least one term has a particular sense that refers to a particular topic; iii) lexically scoring, by the at least one computer processor, based on a frequency of the at least one term in at least one of; the at least one data set;
oran external linguistic corpus;
oriv) lexically scoring, by the at least one computer processor, based on a relative position of each of said at least one occurrence of the at least one term in said at least one content document, and wherein said each edge comprises; information content I(u,v), wherein said information content I comprises
-
Specification