System, method and computer program product for automatic topic identification using a hypertext corpus

US 9,442,930 B2
Filed: 03/14/2013
Issued: 09/13/2016
Est. Priority Date: 09/07/2011
Status: Expired due to Fees

First Claim

Patent Images

1. A method of improving accuracy of computerized topic identification comprising:

a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;

a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;

b) receiving, by the at least one computer processor, at least one content document;

c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;

d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;

a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and

e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;

i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set,ii. evaluating the semantic relatedness score, by the at least one computer processor, comprisingevaluating said theoretical information content of an edge of said graph,wherein said theoretical information content comprises at least one of;

A. self-information, or

B. surprisal and;

iii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying; and

wherein said at least one data set comprises;

a hypertext corpus comprisingat least one graph, wherein said graph comprises;

a plurality V of vertices, wherein each individual vertex v represents a page; and

a plurality of edges, each edge representing a link; and

wherein said each edge comprises;

information content I(u,v),

wherein said information content I comprises

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, method, and/or computer program product for automatic topic identification using a hypertext corpus may include a) receiving a content document(s); b) identifying or lexically scoring candidate topic(s) in the received content document based on label(s) used in a corpus to link to or relate to the candidate topics; c) evaluating or semantically scoring the candidate topic(s) of the received document based on a relationship between two or more candidate topics in the corpus; and d) weighting candidate topics for relevance based on algorithmic or statistical analysis of links or relationships in the corpus.

68 Citations

View as Search Results

17 Claims

1. A method of improving accuracy of computerized topic identification comprising:
- a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;
  
  a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;
  
  b) receiving, by the at least one computer processor, at least one content document;
  
  c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;
  
  d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;
  
  a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and
  
  e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;
  
  i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set,ii. evaluating the semantic relatedness score, by the at least one computer processor, comprisingevaluating said theoretical information content of an edge of said graph,wherein said theoretical information content comprises at least one of;
  
  A. self-information, or
  
  B. surprisal and;
  
  iii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying; and
  
  wherein said at least one data set comprises;
  
  a hypertext corpus comprisingat least one graph, wherein said graph comprises;
  
  a plurality V of vertices, wherein each individual vertex v represents a page; and
  
  a plurality of edges, each edge representing a link; and
  
  wherein said each edge comprises;
  
  information content I(u,v),
  
  wherein said information content I comprises
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method according to claim 1, further comprising:
    - iterating through at least one of;
      
      each candidate topic, ora subset of candidate topics, identified by a lexical stage; and
      
      calculating a semantic score S, for each link from a first said candidate topic being evaluated that couples to another said candidate topic,wherein S=Σ
      
      (L_x×
      
      H),wherein L_xdescribes the lexical score for each said candidate topic in a collection of said candidate topics.
  - 3. The method according to claim 1 further comprising:
    - f) weighting, by the at least one computer processor, said at least one candidate topic identified in said at least one received content document for relevance based on at least one of an algorithmic analysis or a statistical analysis of at least one relationship between said plurality of candidate topics in the at least one data set.
  - 4. The method according to claim 3, wherein said weighting comprises:
    - calculating, by the at least one computer processor, a weighting u, for a given term based on an occurrence of said given term in an input document,wherein
  - 5. The method according to claim 3, wherein an occurrence of a term in an early or a late portion of said content document is indicative of a greater significance than otherwise.
  - 6. The method according to claim 3, wherein the at least one relationship comprises a hypertext link.
  - 7. The method according to claim 1, wherein a significance of said degree of relatedness is weighted based on the information content of said degree of relatedness.
  - 8. The method according to claim 1 further comprising:
    - calculating, by the at least one computer, similarity between a plurality of topics based on the information content of edges that connect vertices associated with each topic to topics associated with co-adjacent vertices of said each topic.
  - 9. The method according to claim 1 further comprising:
    - f) calculating similarity between two or more topics based on the information content of edges that couple vertices associated with each topic to topics associated with co-adjacent vertices of the each topic.
  - 10. The method according to claim 1, wherein said lexically scoring comprises at least one of:
    - accumulating, by the at least one computer processor, at least one lexical score for a given candidate topic across at least one occurrence of the at least one term, which has at least one sense that refers to said given candidate topic;
      
      lexically scoring, by the at least one computer processor, based on a probability that the at least one term has a particular sense that refers to a particular topic;
      
      lexically scoring, by the at least one computer processor, based on a frequency of the at least one term in at least one of;
      
      the at least one data set;
      
      oran external linguistic corpus;
      
      orlexically scoring, by the at least one computer processor, based on a relative position of each of said at least one occurrence of the at least one term in the at least one content document.
  - 11. The method according to claim 1, further comprising taking action, comprising at least one of:
    - filtering content based on the at least one analyzed topic;
      
      orhighlighting content based on the at least one analyzed topic.
  - 12. The method according to claim 1, further comprising analyzing comprising:
    - iv) weighting, by the at least one computer processor,the at least one candidate topic for relevance based on at least one of;
      
      an algorithmic analysis, ora statistical analysisof the at least one relationship in the at least one data set.
  - 13. The method according to claim 1, wherein the method further takes into account at least one of:
    - at least one sense of meaning;
      
      at least one tense;
      
      orat least one other dimension.
  - 14. The method according to claim 1, wherein said deriving, by the at least one computer processor, of said lexicon from said at least one data set to associate said at least one term with said at least one topic, comprises:
    - wherein said at least one data set comprises at least one hypertext corpora as a source of lexical knowledge about said at least one term and meaning of said term, andwherein said at least one term comprises a link label or a link anchor.

15. A computerized data processing analysis system of improving accuracy of computerized topic identification, comprising:
- at least one memory; and
  
  at least one topic identification computer processor coupled to said at least one memory, said at least one topic identification computer processor configured to;
  
  execute a domain independent, language independent, computer processor automated topic identification analysis module,wherein said at least one topic identification computer processor is configured to;
  
  derive a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic of said each of said at least one sense;
  
  receive at least one content document;
  
  search for at least one term from the lexicon derived from the at least one hypertext corpus data set, andfind the at least one term from the lexicon that appears in the at least one content document to determine at least one candidate topic of the at least one content document;
  
  lexically score each of said at least one candidate topic found that appears in the at least one content document based on the at least one term found in said search of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andwherein said at least one topic identification computer processor is configured to;
  
  accumulate said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic;
  
  lexically score based on;
  
  a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and
  
  semantically score the at least one candidate topic found in the at least one content document based on a degree to which any plurality of candidate topics are semantically related to each other comprising;
  
  wherein said at least one topic identification computer processor is configured to;
  
  quantify a semantic relatedness score of said any plurality of topics based on theoretical information content of co-adjacent links of a graph representation of data,evaluate the semantic relatedness score comprising wherein said at least one topic identification computer processor is configured to evaluate the theoretical information content of an edge of said graph,wherein said theoretical information content comprises at least one of;
  
  self-information, or
  
  surprisal; and
  
  semantically score based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set and based on said semantic relatedness score from said quantify, wherein said at least one topic identification computer processor is configured to;
  
  weight said at least one candidate topic for relevance based on at least one of algorithmic analysis, or statistical analysis of semantic relatedness, andwherein said each edge comprises;
  
  information content I(u,v),wherein said information content I comprises
- View Dependent Claims (16)
- - 16. The system according to claim 15, wherein said at least one topic identification computer processor is configured to at least one of:
    - accumulate at least one lexical score for a given candidate topic found appearing in the at least one content document, across at least one occurrence of the at least one term, which has at least one sense that refers to said given candidate topic;
      
      lexically score based on a probability that the at least one term has a particular sense that refers to a particular topic;
      
      lexically score based on a frequency of the at least one term in at least one of;
      
      the at least one data set;
      
      oran external linguistic corpus;
      
      orlexically score based on a relative position of each of said at least one occurrence of the at least one term in said at least one content document.

17. A nontransitory computer program product embodied on a nontransitory computer readable medium, said computer program product comprising program logic, which when executed on at least one computer processor performs a computerized data processing method of improving accuracy of computerized topic identification comprising:
- a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;
  
  a) deriving, by the at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;
  
  b) receiving, by the at least one computer processor, at least one content document;
  
  c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;
  
  d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic, wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;
  
  a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and
  
  e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document based on a degree to which any plurality of candidate topics are semantically related to each other comprising;
  
  i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set,ii. evaluating the semantic relatedness score, by the at least one computer processor, comprisingevaluating said theoretical information content of an edge of said graph,wherein said theoretical information content comprises at least one of;
  
  A. self-information, or
  
  B. surprisal; and
  
  f) semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic derived from the hypertext corpus data set, wherein the data processing method comprises wherein said lexically scoring comprises at least one of;
  
  i) accumulating, by the at least one computer processor, at least one lexical score for a given candidate topic across at least one occurrence of the at least one term, which has at least one sense that refers to said given candidate topic;
  
  ii) lexically scoring, by the at least one computer processor, based on a probability that the at least one term has a particular sense that refers to a particular topic;
  
  iii) lexically scoring, by the at least one computer processor, based on a frequency of the at least one term in at least one of;
  
  the at least one data set;
  
  oran external linguistic corpus;
  
  oriv) lexically scoring, by the at least one computer processor, based on a relative position of each of said at least one occurrence of the at least one term in said at least one content document, andwherein said each edge comprises;
  
  information content I(u,v),wherein said information content I comprises

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Venio, Inc.
Original Assignee
Venio, Inc.
Inventors
Szucs, John Joseph, Warner, Kurtis Lee, Paris, Thomas Carl, Moye, Charles David
Primary Examiner(s)
Badawi, Sherief
Assistant Examiner(s)
Brooks, David T

Application Number

US13/829,472
Publication Number

US 20130204876A1
Time in Patent Office

1,279 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/345   Summarisation for human users

G06F 16/374   Thesaurus

G06F 16/94   Hypermedia Hyperlinking G06...

System, method and computer program product for automatic topic identification using a hypertext corpus

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

68 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

System, method and computer program product for automatic topic identification using a hypertext corpus

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

68 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links