×

System, method and computer program product for automatic topic identification using a hypertext corpus

  • US 9,442,930 B2
  • Filed: 03/14/2013
  • Issued: 09/13/2016
  • Est. Priority Date: 09/07/2011
  • Status: Expired due to Fees
First Claim
Patent Images

1. A method of improving accuracy of computerized topic identification comprising:

  • a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;

    a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;

    b) receiving, by the at least one computer processor, at least one content document;

    c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;

    d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;

    a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and

    e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;

    i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set,ii. evaluating the semantic relatedness score, by the at least one computer processor, comprisingevaluating said theoretical information content of an edge of said graph,wherein said theoretical information content comprises at least one of;



    A. self-information, or 

    B. surprisal and;

    iii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying; and

    wherein said at least one data set comprises;

    a hypertext corpus comprisingat least one graph, wherein said graph comprises;

    a plurality V of vertices, wherein each individual vertex v represents a page; and

    a plurality of edges, each edge representing a link; and

    wherein said each edge comprises;

    information content I(u,v), 

    wherein said information content I comprises

View all claims
  • 0 Assignments
Timeline View
Assignment View
    ×
    ×