System, method and computer program product for automatic topic identification using a hypertext corpus
First Claim
Patent Images
1. A method of improving accuracy of computerized topic identification comprising:
- a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;
a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;
b) receiving, by the at least one computer processor, at least one content document;
c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;
d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;
a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and
e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;
i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics of the at least one hypertext corpus data set, andwherein said quantifying of the semantic relatedness score is based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set;
evaluating the semantic relatedness score during quantifying, by the at least one computer processor, comprisingevaluating the theoretical information content of an edge of said graph representation of the at least one hypertext corpus data set,wherein a vertex of said graph representation of the at least one hypertext corpus data set represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics;
wherein the theoretical information content of an edge is based on information theory and comprises at least one of;
A. self-information, orB. surprisalii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying.
1 Assignment
0 Petitions
Accused Products
Abstract
A system, method, and/or computer program product for automatic topic identification using a hypertext corpus may include a) receiving a content document(s); b) identifying or lexically scoring candidate topic(s) in the received content document based on label(s) used in a corpus to link to or relate to the candidate topics; c) evaluating or semantically scoring the candidate topic(s) of the received document based on a relationship between two or more candidate topics in the corpus; and d) weighting candidate topics for relevance based on algorithmic or statistical analysis of links or relationships in the corpus.
65 Citations
33 Claims
-
1. A method of improving accuracy of computerized topic identification comprising:
-
a domain independent, language independent, computer processor automated topic identification analysis method, the method comprising; a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic, wherein said at least one term comprises at least one word, wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term, wherein each of said at least one sense refers to a single topic of said at least one term, wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime, wherein a prior probability is associated with each of said at least one sense of said each said term, wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense; b) receiving, by the at least one computer processor, at least one content document; c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document; d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, and accumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic, wherein said lexically scoring comprises lexically scoring, by the at least one computer processor, based on; a number of occurrences of the at least one term found in the at least one content document, a weighting factor representing a relative importance of the at least one term found, and the prior probability of the sense of the at least one term found; and e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising; i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics of the at least one hypertext corpus data set, and wherein said quantifying of the semantic relatedness score is based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set; evaluating the semantic relatedness score during quantifying, by the at least one computer processor, comprising evaluating the theoretical information content of an edge of said graph representation of the at least one hypertext corpus data set, wherein a vertex of said graph representation of the at least one hypertext corpus data set represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics; wherein the theoretical information content of an edge is based on information theory and comprises at least one of; A. self-information, or B. surprisal ii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. A computerized data processing analysis system of improving accuracy of computerized topic identification comprising:
-
at least one memory; and at least one topic identification computer processor coupled to said at least one memory, said at least one topic identification computer processor configured to; execute a domain independent, language independent, computer processor automated topic identification analysis module, wherein said at least one topic identification computer processor is configured to; derive a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic, wherein said at least one term comprises at least one word, wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term, wherein each of said at least one sense refers to a single topic of said at least one term, wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime, wherein a prior probability is associated with each of said at least one sense of said each said term, wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense; receive at least one content document; search for at least one term from the lexicon derived from the at least one hypertext corpus data set, and find the at least one term from the lexicon that appears in the at least one content document to determine at least one candidate topic of the at least one content document; lexically score each of said at least one candidate topic found that appears in the at least one content document based on the at least one term found in said search of the at least one content document to obtain a lexical score for each of said at least one candidate topic, and accumulate said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic, lexically score based on; a number of occurrences of the at least one term found in the at least one content document, a weighting factor representing a relative importance of the at least one term found, and the prior probability of the sense of the at least one term found; and semantically score the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising; wherein said at least one topic identification computer processor is configured to; quantify a semantic relatedness score of said any plurality of topics of the at least one hypertext corpus data set, and wherein said quantifying of the semantic relatedness score is based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set; evaluating the semantic relatedness score during quantifying, by the at least one computer processor, comprising evaluating the theoretical information content of an edge of said graph representation of the at least one hypertext corpus data set, wherein a vertex of said graph representation of the at least one hypertext corpus data set represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics; wherein the theoretical information content of an edge is based on information theory and comprises at least one of; A. self-information, or B. surprisal ii. semantically score based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying. - View Dependent Claims (28, 29, 30)
-
-
31. A nontransitory computer program produce embodied on a nontransitory computer readable medium, said computer program product comprising program logic, which when executed on at least one computer processor performs a computerized data processing method of improving accuracy of computerized topic identification comprising:
-
a domain independent, language independent, computer processor automated topic identification analysis method, the method comprising; a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic, wherein said at least one term comprises at least one word, wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term, wherein each of said at least one sense refers to a single topic of said at least one term, wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime, wherein a prior probability is associated with each of said at least one sense of said each said term, wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense; b) receiving, by the at least one computer processor, at least one content document; c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document; d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, and accumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic, wherein said lexically scoring comprises lexically scoring, by the at least one computer processor, based on; a number of occurrences of the at least one term found in the at least one content document, a weighting factor representing a relative importance of the at least one term found, and the prior probability of the sense of the at least one term found; and e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising; i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics of the at least one hypertext corpus data set, and wherein said quantifying of the semantic relatedness score is based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set; evaluating the semantic relatedness score during quantifying, by the at least one computer processor, comprising evaluating the theoretical information content of an edge of said graph representation of the at least one hypertext corpus data set, wherein a vertex of said graph representation of the at least one hypertext corpus data set represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics; wherein the theoretical information content of an edge is based on information theory and comprises at least one of; A. self-information, or B. surprisal ii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying. - View Dependent Claims (32)
-
-
33. A method of improving accuracy of computerized topic identification comprising:
-
a domain independent, language independent, computer processor automated topic identification analysis method, the method comprising; a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic, wherein said at least one term comprises at least one word, wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term, wherein each of said at least one sense refers to a single topic of said at least one term, wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime, wherein a prior probability is associated with each of said at least one sense of said each said term, wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense; b) receiving, by the at least one computer processor, at least one content document; c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document; d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, and accumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic, wherein said lexically scoring comprises lexically scoring, by the at least one computer processor, based on; a number of occurrences of the at least one term found in the at least one content document, a weighting factor representing a relative importance of the at least one term found, and the prior probability of the sense of the at least one term found; and e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising; i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics of the at least one hypertext corpus data set, and ii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying; and said method further comprising; analyzing semantic relatedness by the at least one computer processor, wherein said analyzing said semantic relatedness comprises; f) quantifying, by the at least one computer processor, semantic relatedness of a plurality of topics in a semantic network based on theoretical information content of co-adjacent edges as a portion of total information content of all or a defined subset of edges of a graph representation of data, wherein a vertex of said graph representation of data represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics; and wherein said quantifying comprises at least one of; i. calculating, by the at least one computer processor, information content of each said edge in said graph based on any of; a total in- or out-degree of edges across an entirety of said graph; an in- or out-degree of co-adjacent edges;
oran explicit type of semantic relationship represented by said edge; ii. calculating, by the at least one computer processor, semantic relatedness between at least two topics based on at least one of; the theoretical information content of edges that connect vertices associated with each topic to the topics associated with co-adjacent vertices; the theoretical information content of edges that connects vertices associated with each topic to each other topic;
orthe theoretical information content of edges, optionally subject to a cost function, along indirect paths that connect vertices associate with each topic to each other topic; and g) evaluating, by the at least one computer processor, the semantic relatedness of said plurality of topics in a semantic network, comprising; evaluating, by the at least one computer processor, theoretical information content of an edge in a graph;
wherein the theoretical information content comprises at least one of;self-information;
orsurprisal.
-
Specification