System, method and computer program product for automatic topic identification using a hypertext corpus

US 9,442,928 B2
Filed: 09/07/2012
Issued: 09/13/2016
Est. Priority Date: 09/07/2011
Status: Expired due to Fees

First Claim

Patent Images

1. A method of improving accuracy of computerized topic identification comprising:

a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;

a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;

b) receiving, by the at least one computer processor, at least one content document;

c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;

d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;

a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and

e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;

i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics of the at least one hypertext corpus data set, andwherein said quantifying of the semantic relatedness score is based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set;

evaluating the semantic relatedness score during quantifying, by the at least one computer processor, comprisingevaluating the theoretical information content of an edge of said graph representation of the at least one hypertext corpus data set,wherein a vertex of said graph representation of the at least one hypertext corpus data set represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics;

wherein the theoretical information content of an edge is based on information theory and comprises at least one of;

A. self-information, orB. surprisalii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, method, and/or computer program product for automatic topic identification using a hypertext corpus may include a) receiving a content document(s); b) identifying or lexically scoring candidate topic(s) in the received content document based on label(s) used in a corpus to link to or relate to the candidate topics; c) evaluating or semantically scoring the candidate topic(s) of the received document based on a relationship between two or more candidate topics in the corpus; and d) weighting candidate topics for relevance based on algorithmic or statistical analysis of links or relationships in the corpus.

65 Citations

View as Search Results

33 Claims

1. A method of improving accuracy of computerized topic identification comprising:
- a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;
  
  a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;
  
  b) receiving, by the at least one computer processor, at least one content document;
  
  c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;
  
  d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;
  
  a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and
  
  e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;
  
  i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics of the at least one hypertext corpus data set, andwherein said quantifying of the semantic relatedness score is based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set;
  
  evaluating the semantic relatedness score during quantifying, by the at least one computer processor, comprisingevaluating the theoretical information content of an edge of said graph representation of the at least one hypertext corpus data set,wherein a vertex of said graph representation of the at least one hypertext corpus data set represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics;
  
  wherein the theoretical information content of an edge is based on information theory and comprises at least one of;
  
  A. self-information, orB. surprisalii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 2. The method according to claim 1, wherein said at least one hypertext corpus data set comprises at least one of:
    - a source of at least one of lexical information or semantic information;
      
      a Wiki;
      
      a WIKIPEDIA or an online encyclopedia with collaborative editing;
      
      at least one online encyclopedia;
      
      an electronic encyclopedia;
      
      at least a portion of the web pages of a world wide web(WWW);
      
      a hyperlinked group of web pages;
      
      an enterprise intranet;
      
      a dictionary or lexicon;
      
      a semantic database;
      
      a professional term database;
      
      a source of lexical or semantic information;
      
      a dictionary;
      
      a lexicon;
      
      a database;
      
      corpora;
      
      a personalized corpus;
      
      a field specific corpus;
      
      orat least one online encyclopedia supplemented by at least one dictionary.
  - 3. The method according to claim 1, wherein said at least one content document comprises text content.
  - 4. The method according to claim 3, wherein said text content comprises at least one of:
    - unstructured text content;
      
      structured text content;
      
      marked-up text content;
      
      converted text content;
      
      an Atom feed;
      
      a comment;
      
      audio content;
      
      an electronic communication;
      
      an electronic mail (email) message;
      
      a facsimile;
      
      a forum posting;
      
      hypermedia content or document;
      
      an image or image content;
      
      an information stream;
      
      an Internet content page;
      
      an intranet content page;
      
      a microblog;
      
      a multimedia message;
      
      a multimedia message system (MMS) message;
      
      optical character recognized (OCR) content, optically recognized characters (OCR) from content, or a recognized speech converted document;
      
      a Really Simple Syndication (RSS) feed;
      
      recognized data;
      
      a recognized speech;
      
      a social-networking or social media communication, post, posting, or comment;
      
      a simple message system (SMS) message;
      
      syndicated content, or a syndicated content stream;
      
      a text message or document;
      
      alphanumeric text-based content;
      
      video content;
      
      ora web page.
  - 5. The method according to claim 1, wherein said at least one term comprises an entry in said lexicon, wherein said lexicon is derived from at least one label in said at least one hypertext corpus data set,wherein said lexicon comprises said at least one term, and at least one sense;
    - wherein said at least one term comprises a word or phrase;
      
      wherein each of said at least one sense references a concept;
      
      wherein each of said at least one sense has a prior probability associated therewith;
      
      wherein said prior probability comprises a probability in said at least one hypertext corpus data set that an occurrence of a term has a corresponding one of said at least one sense; and
      
      wherein said at least one term comprises at least one of;
      
      i) an anchor term in the at least one hypertext corpus data set;
      
      ii) a single word;
      
      iii) a phrase including a plurality of words;
      
      iv) a word or phrase in said lexicon;
      
      v) an identifying name, word or phrase associated with an entity in a database, wherein said database comprises at least one of;
      
      a contact list,an inventory,a product database,another database, ora hypertext link;
      
      orvi) a label.
  - 6. The method according to claim 1, wherein said searching for said lexically scoring comprises at least one of:
    - i) finding a plurality of said at least one term used in the lexicon, appearing in said at least one received content document in a lexical stage,ii) calculating a calculated lexical score for each of said at least one term appearing in said at least one received content document; and
      
      iii) identifying said at least one candidate topic based on at least one topic to which a given term refers in the lexicon.
  - 7. The method according to claim 6, wherein said (i) comprises:
    - wherein said finding comprises at least one of;
      
      1) searching for said at least one term using case insensitive matching;
      
      2) searching for said at least one term using case sensitive matching;
      
      3) searching for said at least one term using a full text search algorithm;
      
      4) indexing of keywords;
      
      5) synonyms;
      
      6) word stemming or conjugating;
      
      or7) searching for said at least one term using a text search algorithm, comprising at least one of;
      
      indexing,word stemming,conjugating,stop-word filtering,case folding,pattern matching, orregular expression matching.
  - 8. The method according to claim 6, further comprising at least one of:
    - a) filtering said plurality of said at least one term appearing in said at least one received content document;
      
      b) excluding based on criteria;
      
      orc) excluding based on a most commonly used words list for a given language.
  - 9. The method according to claim 6, wherein said calculating of said (ii) comprises:
    - counting a number of occurrences of said at least one term in said at least one content document; and
      
      optionally adjusting the calculated lexical score of a given one of said at least one term based on the number of occurrences of said given one of said at least one term in a given language.
  - 10. The method according to claim 6, wherein said identifying of said (iii) comprisesat least one of:
    - 1) selecting at least one candidate topic in the at least one hypertext corpus data set to which a given term refers;
      
      or2) calculating a lexical score for each said at least one candidate topic from at least one factor,wherein said calculating said lexical score based on said at least one factor comprises at least one of;
      
      i) calculating said lexical score based on said calculated lexical score for said at least one term associated with said at least one candidate topic;
      
      orii) calculating said lexical score based on a probability that a given term refers to a given topic in the at least one hypertext corpus data set.
  - 11. The method according to claim 1, wherein said semantically scoring of said (e) comprises:
    - semantically scoring based on said degree to which said plurality of candidate topics are semantically related is based on at least one of;
      
      at least one link,at least one hypertext link,at least one link in a database,at least one link in a semantic database,at least one relationship in a relational database,at least one relationship in a social network, orat least one relationship between data elements.
  - 12. The method according to claim 1, wherein said semantically scoring of said (e) comprises:
    - wherein said semantic relatedness comprises semantic relatedness, which is different from semantic similarity,wherein said semantic relatedness comprises a metric of how connected a plurality of entities are;
      
      wherein one or more links of said semantic relatedness are derived from the at least one hypertext corpus data set as a pre-processing step, andwherein said one or more links of said semantic relatedness represent persistent, enduring, semantic relationships, andat least one of;
      
      adjusting said semantic score based on existence of a relationship between said plurality of candidate topics;
      
      adjusting said semantic score based on a weighted strength of a relationship between said plurality of candidate topics;
      
      adjusting said semantic score based on a statistical analysis of relatedness between said plurality of candidate topics;
      
      oradjusting said semantic score based on algorithmic analysis of the relatedness between said plurality of candidate topics.
  - 13. The method, according to claim 12,wherein said algorithmic analysis comprises at least one of:
    - 1) evaluating, by the at least one computer, the semantic relatedness between a plurality of said at least one topic in a hypermedia database;
      
      2) evaluating, by the at least one computer, the semantic relatedness between a plurality of adjacent vertices in the graph representation of a hypermedia database;
      
      3) evaluating, by the at least one computer, the semantic relatedness between a plurality of adjacent vertices based on a number of other vertices to which said plurality of vertices are coupled;
      
      4) evaluating, by the at least one computer, the theoretical information content of an edge in the graph;
      
      wherein the theoretical information content comprises at least one of;
      
      i) self-information;
      
      orii) surprisal;
      
      or5) evaluating, by the at least one computer, the semantic relatedness between a plurality of non-adjacent vertices;
      
      wherein said evaluating comprises at least one of;
      
      i) finding a path between said plurality of non-adjacent vertices;
      
      orii) incorporating a cost function into said path finding of (i) based on said (2), (3), or (4).
  - 14. The method according to claim 1, wherein said at least one sense comprises at least one of a plurality of alternative meanings or related topics of said at least one term.
  - 15. The method according to claim 1 further comprising:
    - f) weighting, by the at least one computer processor, said at least one candidate topic identified in said at least one received content document for relevance based on at least one of;
      
      an algorithmic analysis;
      
      ora statistical analysis;
      
      of at least one relationship between said plurality of candidate topics in the at least one hypertext corpus data set.
  - 16. The method according to claim 15, wherein said weighting comprises at least one of:
    - wherein said algorithmic analysis comprises at least one of;
      
      i) a page ranking relevancy analysis;
      
      ii) a link analysis algorithm;
      
      oriii) an algorithm evaluating relative importance of at least two candidate topics in the corpus;
      
      orwherein said statistical analysis comprises at least one of;
      
      i) vector space models;
      
      orii) probabilistic models.
  - 17. The method according to claim 1, wherein the at least one sense comprises at least one of:
    - at least one sense of meaning of said at least one topic;
      
      at least one temporal tense of said at least one topic;
      
      orat least one other dimension of meaning of said at least one topic.
  - 18. The method according to claim 1, wherein said deriving, by the at least one computer processor, of the lexicon from said at least one hypertext corpus data set to associate said at least one term with said at least one topic, comprises:
    - wherein said at least one hypertext corpus data set comprises at least one hypertext corpora as a source of lexical knowledge about said at least one term and meaning of said term;
      
      wherein said at least one term comprises a link label or a link anchor in said at least one hypertext corpus data set; and
      
      wherein the lexicon applies a logarithmic scaling function to the prior probability associated with each of said at least one sense associated with each of said at least one term, and referring to said single topic.
  - 19. The computer-implemented method according to claim 1 further comprising:
    - a computer-implemented data processing method for performing the automated workflow process of improving accuracy of computerized topic identification on the at least one content document comprising;
      
      f) receiving, by the at least one computer processor, the at least one content document;
      
      g) processing, by the at least one computer processor, the at least one content document according to said (a) through said (f) to analyze at least one topic, said processing comprising at least one of;
      
      organizing, by the at least one computer processor, the at least one content document, which is determined to be associated with the at least one content document;
      
      discovering, by the at least one computer processor, the at least one content document, which is determined to be associated with the at least one content document;
      
      indexing, by the at least one computer processor, the at least one content document, which is determined to be associated with the at least one content document;
      
      searching, by the at least one computer processor, the at least one content document, which is determined to be associated with the at least one content document;
      
      linking, by the at least one computer processor, the at least one content document, which is determined to be associated with the at least one content document;
      
      tagging, by the at least one computer processor, the at least one content document, which is determined to be associated with the at least one content document;
      
      filtering, by the at least one computer processor, the at least one content document, which is determined to be associated with the at least one content document;
      
      prioritizing, by the at least one computer processor, the at least one content document, which is determined to be associated with the at least one content document;
      
      orranking, by the at least one computer processor, the at least one content document, which is determined to be associated with the at least one content document; and
      
      h) taking action, by the at least one computer processor, on the at least one content document, based on the at least one topic analyzed.
  - 20. The method according to claim 19, wherein said lexically scoring comprises at least one of:
    - accumulating, by the at least one computer processor, the at least one lexical score for a given candidate topic found appearing in the at least one content document, across at least one occurrence of the at least one term, which has at least one sense that refers to said given candidate topic;
      
      lexically scoring, by the at least one computer processor, based on a probability that the at least one term has a particular sense that refers to a particular topic;
      
      lexically scoring, by the at least one computer processor, based on a frequency of the at least one term in at least one of;
      
      the at least one hypertext corpus data set;
      
      oran external linguistic corpus;
      
      orlexically scoring, by the at least one computer processor, based on a relative position of each of said at least one occurrence of the at least one term in the at least one content document.
  - 21. The method according to claim 19, wherein said taking action, comprises at least one of:
    - storing or processing the at least one content document based on the at least one analyzed topic;
      
      filtering the at least one content document based on the at least one analyzed topic;
      
      orhighlighting the at least one content document based on the at least one analyzed topic.
  - 22. The method according to claim 19, wherein said processing further comprises:
    - i) weighting, by the at least one computer processor, the at least one candidate topic for relevance based on at least one of;
      
      an algorithmic analysis, ora statistical analysisof the at least one relationship in the at least one hypertext corpus data set.
  - 23. The method according to claim 19, wherein the at least one content document comprises text content.
  - 24. The method according to claim 1, wherein said semantically scoring comprises:
    - selecting a top plurality of said at least one candidate topics from said lexically scoring;
      
      initially setting said semantic relatedness score of each of said top plurality of said at least one candidate topics selected, to said lexical score for each of said at least one candidate topic;
      
      calculating said semantic relatedness score for each of said top plurality of said at least one candidate topics, to each other of said top plurality of said at least one candidate topics; and
      
      multiplying said lexical score of a second topic, by said semantic relatedness score of said given second topic to obtain a product, and adding said product to said semantic relatedness score of a first topic.
  - 25. The method of improving accuracy of computerized topic identification of claim 1, further comprising:
    - analyzing semantic relatedness by the at least one computer processor coupled to at least one computer memory, wherein said analyzing relatedness comprises;
      
      f) quantifying, by the at least one computer processor, semantic relatedness of a plurality of topics in a semantic network based on theoretical information content of co-adjacent edges as a portion of total information content of all or a defined subset of edges of a graph representation of data,wherein a vertex of said graph representation of data represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics; and
      
      wherein said quantifying comprises at least one of;
      
      i. calculating, by the at least one computer processor, information content of each said edge in said graph based on any of;
      
      a total in- or out-degree of edges across an entirety of said graph;
      
      an in- or out-degree of co-adjacent edges;
      
      oran explicit type of semantic relationship represented by said edge;
      
      ii. calculating, by the at least one computer processor, semantic relatedness between at least two topics based on at least one of;
      
      the theoretical information content of edges that connect vertices associated with each topic to the topics associated with co-adjacent vertices;
      
      the theoretical information content of edges that connect vertices associated with each topic to each other topic;
      
      orthe theoretical information content of edges, optionally subject to a cost function, along indirect paths that connect vertices associated with each topic to each other topic; and
      
      g) evaluating, by the at least one computer processor, the semantic relatedness of said plurality of topics in a semantic network, comprising at least one of;
      
      i. evaluating, by the at least one computer processor, semantic relatedness between a plurality of vertices in a graph representation of a database, wherein said vertices represent concepts, topics, or entities;
      
      ii. evaluating, by the at least one computer processor, semantic relatedness between a plurality of adjacent vertices based on a number of other vertices to which said plurality of vertices are coupled;
      
      iii. evaluating, by the at least one computer processor, theoretical information content of an edge in a graph;
      
      wherein the theoretical information content comprises at least one of;
      
      self-information;
      
      orsurprisal;
      
      oriv. evaluating, by the at least one computer processor, semantic relatedness between a plurality of non-adjacent vertices,wherein said evaluating comprises at least one of;
      
      finding a path between said plurality of non-adjacent vertices;
      
      orincorporating a cost function into said path based on at least one of said quantifying, or said evaluating.
  - 26. The method according to claim 1, wherein said quantifying the semantic relatedness score, comprises:
    - calculating an informational content of the hypertext link between the any plurality of candidate topics (u and v) of said (e), comprising evaluating an expression comprising;

27. A computerized data processing analysis system of improving accuracy of computerized topic identification comprising:
- at least one memory; and
  
  at least one topic identification computer processor coupled to said at least one memory, said at least one topic identification computer processor configured to;
  
  execute a domain independent, language independent, computer processor automated topic identification analysis module,wherein said at least one topic identification computer processor is configured to;
  
  derive a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;
  
  receive at least one content document;
  
  search for at least one term from the lexicon derived from the at least one hypertext corpus data set, and find the at least one term from the lexicon that appears in the at least one content document to determine at least one candidate topic of the at least one content document;
  
  lexically score each of said at least one candidate topic found that appears in the at least one content document based on the at least one term found in said search of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulate said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,lexically score based on;
  
  a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and
  
  semantically score the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;
  
  wherein said at least one topic identification computer processor is configured to;
  
  quantify a semantic relatedness score of said any plurality of topics of the at least one hypertext corpus data set, andwherein said quantifying of the semantic relatedness score is based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set;
  
  evaluating the semantic relatedness score during quantifying, by the at least one computer processor, comprisingevaluating the theoretical information content of an edge of said graph representation of the at least one hypertext corpus data set,wherein a vertex of said graph representation of the at least one hypertext corpus data set represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics;
  
  wherein the theoretical information content of an edge is based on information theory and comprises at least one of;
  
  A. self-information, orB. surprisalii. semantically score based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying.
- View Dependent Claims (28, 29, 30)
- - 28. The system according to claim 27, wherein said at least one topic identification computer processor is configured to:
    - weight said at least one candidate topic for relevance based on at least one of;
      
      algorithmic analysis of relatedness, orstatistical analysis of relatedness.
  - 29. The system according to claim 27, wherein said at least one topic identification computer processor is configured to:
    - accumulate at least one lexical score for a given candidate topic across at least one occurrence of the at least one term, which has at least one sense of meaning that refers to said given candidate topic;
      
      lexically score based on a probability that the at least one term has a particular sense of meaning that refers to a particular topic;
      
      lexically score based on a frequency of the at least one term in at least one of;
      
      the at least one hypertext corpus data set;
      
      oran external linguistic corpus; and
      
      lexically score based on a relative position of each of said at least one occurrence of the at least one term in said at least one content document.
  - 30. The system according to claim 27, wherein said quantify relatedness score, comprises:
    - wherein said at least one topic identification computer processor is configured to calculate an informational content of the hypertext link between the plurality of candidate topics (u and v) comprising;

31. A nontransitory computer program produce embodied on a nontransitory computer readable medium, said computer program product comprising program logic, which when executed on at least one computer processor performs a computerized data processing method of improving accuracy of computerized topic identification comprising:
- a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;
  
  a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;
  
  b) receiving, by the at least one computer processor, at least one content document;
  
  c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;
  
  d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;
  
  a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and
  
  e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;
  
  i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics of the at least one hypertext corpus data set, andwherein said quantifying of the semantic relatedness score is based on theoretical information content of co-adjacent links of a graph representation of the at least one hypertext corpus data set;
  
  evaluating the semantic relatedness score during quantifying, by the at least one computer processor, comprisingevaluating the theoretical information content of an edge of said graph representation of the at least one hypertext corpus data set,wherein a vertex of said graph representation of the at least one hypertext corpus data set represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics;
  
  wherein the theoretical information content of an edge is based on information theory and comprises at least one of;
  
  A. self-information, orB. surprisalii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying.
- View Dependent Claims (32)
- - 32. The computer program product according to claim 31, wherein the data processing method comprises wherein said lexically scoring comprises at least one of:
    - accumulating, by the at least one computer processor, at least one lexical score for a given candidate topic across at least one occurrence of the at least one term, which has at least one sense of meaning that refers to said given candidate topic;
      
      lexically scoring, by the at least one computer processor, based on a probability that the at least one term has a particular sense of meaning that refers to a particular topic;
      
      lexically scoring, by the at least one computer processor, based on a frequency of the at least one term in at least one of;
      
      the at least one hypertext corpus data set;
      
      oran external linguistic corpus;
      
      orlexically scoring, by the at least one computer processor, based on a relative position of each of said at least one occurrence of the at least one term in said at least one content document.

33. A method of improving accuracy of computerized topic identification comprising:
- a domain independent, language independent, computer processor automated topic identification analysis method,the method comprising;
  
  a) deriving, by at least one computer processor, a lexicon from at least one hypertext corpus data set to associate at least one term with at least one topic,wherein said at least one term comprises at least one word,wherein at least one sense is derived from at least one hypertext link of the at least one hypertext corpus data set, and is associated with each of said at least one term,wherein each of said at least one sense refers to a single topic of said at least one term,wherein said at least one topic referred to by said at least one sense is to be used as a candidate topic at runtime,wherein a prior probability is associated with each of said at least one sense of said each said term,wherein each said prior probability is a fraction of occurrences of a given one of said at least one term as a relationship comprising a hypertext link that links to said single topic from said each of said at least one sense;
  
  b) receiving, by the at least one computer processor, at least one content document;
  
  c) searching for, by the at least one computer processor, at least one term from the lexicon derived from the at least one hypertext corpus data set, and finding the at least one term from the lexicon appearing in the at least one content document to determine at least one candidate topic of the at least one content document;
  
  d) lexically scoring, by the at least one computer processor, each of said at least one candidate topic found appearing in the at least one content document based on the at least one term found in said search of said (c) of the at least one content document to obtain a lexical score for each of said at least one candidate topic, andaccumulating said lexical score for each of said at least one candidate topic for each occurrence in the at least one content document of a term, wherein the term has an associated sense, wherein the associated sense refers to said each said candidate topic,wherein said lexically scoring compriseslexically scoring, by the at least one computer processor, based on;
  
  a number of occurrences of the at least one term found in the at least one content document,a weighting factor representing a relative importance of the at least one term found, andthe prior probability of the sense of the at least one term found; and
  
  e) semantically scoring, by the at least one computer processor, the at least one candidate topic found in the at least one content document, based on a degree to which any plurality of candidate topics are semantically related to each other comprising;
  
  i. quantifying a semantic relatedness score, by the at least one computer processor, of said any plurality of topics of the at least one hypertext corpus data set, andii. semantically scoring, by the at least one computer processor, based on said lexical score of said each of said at least one candidate topic found appearing in the at least one content document, wherein said lexical score of said each of said at least one candidate topic was determined from the at least one term of said lexicon derived from the hypertext corpus data set, and based on said semantic relatedness score from said quantifying; and
  
  said method further comprising;
  
  analyzing semantic relatedness by the at least one computer processor, wherein said analyzing said semantic relatedness comprises;
  
  f) quantifying, by the at least one computer processor, semantic relatedness of a plurality of topics in a semantic network based on theoretical information content of co-adjacent edges as a portion of total information content of all or a defined subset of edges of a graph representation of data,wherein a vertex of said graph representation of data represents one of said plurality of topics, and an edge of said co-adjacent edges represents a relationship between two topics; and
  
  wherein said quantifying comprises at least one of;
  
  i. calculating, by the at least one computer processor, information content of each said edge in said graph based on any of;
  
  a total in- or out-degree of edges across an entirety of said graph;
  
  an in- or out-degree of co-adjacent edges;
  
  oran explicit type of semantic relationship represented by said edge;
  
  ii. calculating, by the at least one computer processor, semantic relatedness between at least two topics based on at least one of;
  
  the theoretical information content of edges that connect vertices associated with each topic to the topics associated with co-adjacent vertices;
  
  the theoretical information content of edges that connects vertices associated with each topic to each other topic;
  
  orthe theoretical information content of edges, optionally subject to a cost function, along indirect paths that connect vertices associate with each topic to each other topic; and
  
  g) evaluating, by the at least one computer processor, the semantic relatedness of said plurality of topics in a semantic network, comprising;
  
  evaluating, by the at least one computer processor, theoretical information content of an edge in a graph;
  
  wherein the theoretical information content comprises at least one of;
  
  self-information;
  
  orsurprisal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Venio, Inc.
Original Assignee
Venio, Inc.
Inventors
Szucs, John J., Warner, Kurtis L., Paris, Thomas C., Moye, Charles D.
Primary Examiner(s)
Badawi, Sherief
Assistant Examiner(s)
Brooks, David T

Application Number

US13/607,639
Publication Number

US 20130246430A1
Time in Patent Office

1,467 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/3334   Selection or weighting of t...

G06F 16/3344   using natural language anal...

G06F 16/345   Summarisation for human users

G06F 16/374   Thesaurus

G06F 16/93   Document management systems

G06F 16/94   Hypermedia Hyperlinking G06...

System, method and computer program product for automatic topic identification using a hypertext corpus

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

65 Citations

33 Claims

Specification

Solutions

Use Cases

Quick Links

System, method and computer program product for automatic topic identification using a hypertext corpus

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

65 Citations

33 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links