Systems and methods for characterizing linked documents using a latent topic model

US 8,234,274 B2
Filed: 12/01/2009
Issued: 07/31/2012
Est. Priority Date: 12/18/2008
Status: Active Grant

First Claim

Patent Images

1. A method for characterizing a corpus of documents each having one or more links, comprising:

forming a Bayesian network using the documents;

determining a Bayesian network structure using the one or more links;

generating a content link model where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process with a topic distribution of each document being a mixture of distributions associated with related documents;

using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ

, topic distribution matrix Θ and

word probabilities matrix Ψ

, including;

choosing a related document c from p (c|d,Ξ

), a multinomial probability conditioned on the document d;

choosing a topic z from the topic distribution of the document c, p(z|c,Θ

);

choosing a word w which follows the multinomial distribution p(w|z,Ψ

) conditioned on the topic z; and

determining one or more topics in the corpus and topic distribution for each document wherein the content link model captures direct and indirect relationships represented by the links.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are disclosed for extracting characteristics from a corpus of linked documents by deriving a content link model that explicitly captures direct and indirect relations represented by the links, and extracting document topics and the topic distributions for all the documents in the corpus using the content-link model.

35 Citations

View as Search Results

21 Claims

1. A method for characterizing a corpus of documents each having one or more links, comprising:
- forming a Bayesian network using the documents;
  
  determining a Bayesian network structure using the one or more links;
  
  generating a content link model where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process with a topic distribution of each document being a mixture of distributions associated with related documents;
  
  using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ
  
  , topic distribution matrix Θ and
  
  word probabilities matrix Ψ
  
  , including;
  
  choosing a related document c from p (c|d,Ξ
  
  ), a multinomial probability conditioned on the document d;
  
  choosing a topic z from the topic distribution of the document c, p(z|c,Θ
  
  );
  
  choosing a word w which follows the multinomial distribution p(w|z,Ψ
  
  ) conditioned on the topic z; and
  
  determining one or more topics in the corpus and topic distribution for each document wherein the content link model captures direct and indirect relationships represented by the links.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, comprising applying a Bayesian inference to extract document topics.
  - 3. The method of claim 1, comprising applying non-negative matrix factorization to the content link model.
  - 4. The method of claim 1, comprising performing a maximum likelihood estimation in the content link model.
  - 5. The method of claim 1, comprising performing a Bayesian inference in a BPT (Bernoulli process topic) model.
  - 6. The method of claim 1, wherein the Bayesian network encodes direct and indirect relations, and wherein relationships are derived explicitly from links or implicitly from similarities among documents.
  - 7. The method of claim 1, comprising applying the topics for characterizing, representing, summarizing, visualizing, indexing, ranking, or searching the documents in the corpus.
  - 8. The method of claim 1, comprising applying the topics and the topic distributions to derive features for document clustering or classification.
  - 9. The method of claim 1, comprising applying EM to estimate the parameters in a citation-topic model and a Bernoulli-process topic model.
  - 10. The method of claim 1, comprising applying a nonnegative matrix factorization (NMF) to estimate parameters for the Bayesian network.
  - 11. The method of claim 10, comprising applying the NMF to perform maximum likelihood estimation for a citation topic model.
  - 12. The method of claim 10, comprising applying the NMF to perform Bayesian inference for a Bernoulli-process topic model.
  - 13. The system of claim 1, comprising a BPT (Bernoulli process topic) model with computer readable code to process an input document D ={d₁,d₂, . . . ,d_N} to generate variational parameters {γ
    - _i,1 ≦
      
      i ≦
      
      N} and regenerative model parameters α
      
      ,Λ
      
      , further comprising code to;
  - 14. The system of claim 1, comprising applying a nonnegative matrix factorization (NMF) method and a sparse LU-decomposition-based method.

15. A method for extracting characteristics from a corpus of linked documents, comprising:
- deriving a content link model that explicitly captures direct and indirect relations represented by the links where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process, with a topic distribution of each document being a mixture of distributions associated with related documents,using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ
  
  , topic distribution matrix Θ and
  
  word probabilities matrix Ψ
  
  , including;
  
  choosing a related document c from p(c|d,Ξ
  
  ), a multinomial probability conditioned on the document d;
  
  choosing a topic z from the topic distribution of the document c, p(z|c Θ
  
  );
  
  choosing a word w which follows the multinomial distribution p(w|z,Ψ
  
  ) conditioned on the topic z;
  
  andextracting document topics and the topic distributions for all the documents in the corpus using the content-link model, wherein the content link model captures direct and indirect relationships represented by the links.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The method of claim 15, comprising applying inference algorithms based on non-negative matrix factorization to the content model.
  - 17. The method of claim 15, comprising generating a Bayesian network for the documents that encodes the direct and indirect relations, wherein the relations are derived either explicitly from links or implicitly from the similarities among documents.
  - 18. The method of claim 15, comprising applying the topics for characterizing, representing, summarizing, visualizing, indexing, ranking, or searching the documents in the corpus.
  - 19. The method of claim 15, comprising applying the topics and the topic distributions to derive features for document clustering or classification.

20. A system for extracting characteristics from a corpus of linked documents, comprising:
- a processor and a data storage device coupled to the processor;
  
  computer readable code to derive a content link model that explicitly captures direct and indirect relations represented by the links where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process with a topic distribution of each document being a mixture of distributions associated with related documents,using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ
  
  , topic distribution matrix Θ and
  
  word probabilities matrix Ψ
  
  , including;
  
  choosing a related document c from p(c|d,Ξ
  
  ), a multinomial probability conditioned on the document d;
  
  choosing a topic z from the topic distribution of the document c, p(z|c,Θ
  
  );
  
  choosing a word w which follows the multinomial distribution p(w|z,Ψ
  
  ) conditioned on the topic z;
  
  andcomputer readable code to extract document topics and the topic distributions for all the documents in the corpus using the content-link model, wherein the content link model captures direct and indirect relationships represented by the links.

21. A system for characterizing a corpus of documents each having one or more links, comprising:
- a processor and a data storage device coupled to the processor;
  
  computer readable code to form a Bayesian network using the documents;
  
  computer readable code to determine Bayesian network structure using the one or more links;
  
  computer readable code to generate a content link model; and
  
  computer readable code to apply the content link model to determine one or more topics in the corpus and topic distribution for each document; and
  
  where the model is a generative probabilistic model of the corpus along with citation information among documents, each document represented as a mixture over latent topics, and each relationship among documents is modeled by another generative process with a topic distribution of each document being a mixture of distributions associated with related documents;
  
  using a citation-topic (CT) model with a generative process for each word w in the document d in the corpus, with document probabilities Ξ
  
  , topic distribution matrix Θ and
  
  word probabilities matrix Ψ
  
  , including;
  
  choosing a related document c from p(c|d,Ξ
  
  ), a multinomial probability conditioned on the document d;
  
  choosing a topic z from the topic distribution of the document c, p(z|c,Θ
  
  );
  
  choosing a word w which follows the multinomial distribution p(w|z,Ψ
  
  ) conditioned on the topic z;
  
  wherein the content link model captures direct and indirect relationships represented by the links.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NEC Corporation
Original Assignee
NEC Laboratories America Inc (NEC Corporation)
Inventors
Guo, Zhen, Zhu, Shenghuo, Chi, Yun
Primary Examiner(s)
Vital, Pierre
Assistant Examiner(s)
Obisesan, Augustine K

Application Number

US12/629,043
Publication Number

US 20100161611A1
Time in Patent Office

973 Days
Field of Search

None
US Class Current

707/726
CPC Class Codes

G06F 16/94 Hypermedia Hyperlinking G06...

G06N 7/01 Probabilistic graphical mod...

Systems and methods for characterizing linked documents using a latent topic model

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

35 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for characterizing linked documents using a latent topic model

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

35 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links