×

Systems and methods for characterizing linked documents using a latent topic model

  • US 8,234,274 B2
  • Filed: 12/01/2009
  • Issued: 07/31/2012
  • Est. Priority Date: 12/18/2008
  • Status: Active Grant
First Claim
Patent Images

1. A method for characterizing a corpus of documents each having one or more links, comprising:

  • forming a Bayesian network using the documents;

    determining a Bayesian network structure using the one or more links;

    generating a content link model where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process with a topic distribution of each document being a mixture of distributions associated with related documents;

    using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ

    , topic distribution matrix Θ and

    word probabilities matrix Ψ

    , including;

    choosing a related document c from p (c|d,Ξ

    ), a multinomial probability conditioned on the document d;

    choosing a topic z from the topic distribution of the document c, p(z|c,Θ

    );

    choosing a word w which follows the multinomial distribution p(w|z,Ψ

    ) conditioned on the topic z; and

    determining one or more topics in the corpus and topic distribution for each document wherein the content link model captures direct and indirect relationships represented by the links.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×