Systems and methods for characterizing linked documents using a latent topic model
First Claim
Patent Images
1. A method for characterizing a corpus of documents each having one or more links, comprising:
- forming a Bayesian network using the documents;
determining a Bayesian network structure using the one or more links;
generating a content link model where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process with a topic distribution of each document being a mixture of distributions associated with related documents;
using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ
, topic distribution matrix Θ and
word probabilities matrix Ψ
, including;
choosing a related document c from p (c|d,Ξ
), a multinomial probability conditioned on the document d;
choosing a topic z from the topic distribution of the document c, p(z|c,Θ
);
choosing a word w which follows the multinomial distribution p(w|z,Ψ
) conditioned on the topic z; and
determining one or more topics in the corpus and topic distribution for each document wherein the content link model captures direct and indirect relationships represented by the links.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods are disclosed for extracting characteristics from a corpus of linked documents by deriving a content link model that explicitly captures direct and indirect relations represented by the links, and extracting document topics and the topic distributions for all the documents in the corpus using the content-link model.
35 Citations
21 Claims
-
1. A method for characterizing a corpus of documents each having one or more links, comprising:
-
forming a Bayesian network using the documents; determining a Bayesian network structure using the one or more links; generating a content link model where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process with a topic distribution of each document being a mixture of distributions associated with related documents; using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ
, topic distribution matrix Θ and
word probabilities matrix Ψ
, including;choosing a related document c from p (c|d,Ξ
), a multinomial probability conditioned on the document d;choosing a topic z from the topic distribution of the document c, p(z|c,Θ
);choosing a word w which follows the multinomial distribution p(w|z,Ψ
) conditioned on the topic z; anddetermining one or more topics in the corpus and topic distribution for each document wherein the content link model captures direct and indirect relationships represented by the links. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A method for extracting characteristics from a corpus of linked documents, comprising:
-
deriving a content link model that explicitly captures direct and indirect relations represented by the links where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process, with a topic distribution of each document being a mixture of distributions associated with related documents, using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ
, topic distribution matrix Θ and
word probabilities matrix Ψ
, including;choosing a related document c from p(c|d,Ξ
), a multinomial probability conditioned on the document d;choosing a topic z from the topic distribution of the document c, p(z|c Θ
);choosing a word w which follows the multinomial distribution p(w|z,Ψ
) conditioned on the topic z;and extracting document topics and the topic distributions for all the documents in the corpus using the content-link model, wherein the content link model captures direct and indirect relationships represented by the links. - View Dependent Claims (16, 17, 18, 19)
-
-
20. A system for extracting characteristics from a corpus of linked documents, comprising:
-
a processor and a data storage device coupled to the processor; computer readable code to derive a content link model that explicitly captures direct and indirect relations represented by the links where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process with a topic distribution of each document being a mixture of distributions associated with related documents, using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ
, topic distribution matrix Θ and
word probabilities matrix Ψ
, including;choosing a related document c from p(c|d,Ξ
), a multinomial probability conditioned on the document d;choosing a topic z from the topic distribution of the document c, p(z|c,Θ
);choosing a word w which follows the multinomial distribution p(w|z,Ψ
) conditioned on the topic z;and computer readable code to extract document topics and the topic distributions for all the documents in the corpus using the content-link model, wherein the content link model captures direct and indirect relationships represented by the links.
-
-
21. A system for characterizing a corpus of documents each having one or more links, comprising:
-
a processor and a data storage device coupled to the processor; computer readable code to form a Bayesian network using the documents; computer readable code to determine Bayesian network structure using the one or more links; computer readable code to generate a content link model; and computer readable code to apply the content link model to determine one or more topics in the corpus and topic distribution for each document; and where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process with a topic distribution of each document being a mixture of distributions associated with related documents; using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ
, topic distribution matrix Θ and
word probabilities matrix Ψ
, including;choosing a related document c from p(c|d,Ξ
), a multinomial probability conditioned on the document d;choosing a topic z from the topic distribution of the document c, p(z|c,Θ
);choosing a word w which follows the multinomial distribution p(w|z,Ψ
) conditioned on the topic z;wherein the content link model captures direct and indirect relationships represented by the links.
-
Specification