Knowledge discovery from citation networks

US 9,269,051 B2
Filed: 12/29/2014
Issued: 02/23/2016
Est. Priority Date: 12/06/2010
Status: Active Grant

First Claim

Patent Images

1. A method of modeling a set of documents within an automated computer system, each document within the set of documents comprising content, and at least a portion of the documents being linked to other documents through citations, comprising:

automatically defining a Bernoulli Process Topic model with the automated computer system, for the set of documents representing each document as a represented as a mixture over latent topics, dependent on (1) an intrinsic content of each respective document, and (2) a content of other documents related to the respective document through a multi-level citation structure of direct and indirect linkages to the other documents;

for each respective document, automatically determining with the automated computer system, a first set of latent topic characteristics based on an intrinsic content of the respective document;

for each document, automatically determining with the automated computer system, a second set of latent topic characteristics based on a respective content of the other documents which are directly and indirectly linked, the indirectly linked documents contributing transitively to the latent topic characteristics of the respective document;

automatically representing with the automated computer system, a third set of latent topics for the respective document based on a joint probability distribution of at least the first and second sets of latent topic characteristics, Bernoulli Process Topic model; and

outputting, by the automated computer system, at least one document in response to an input, based on at least the third set of latent topics.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. A Bernoulli Process Topic (BPT) model is provided which models the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multi-level hierarchical structure of the citation network is captured by a generative process involving a Bernoulli process. The distribution parameters of the BPT model are estimated by a variational approximation approach.

Citations

20 Claims

1. A method of modeling a set of documents within an automated computer system, each document within the set of documents comprising content, and at least a portion of the documents being linked to other documents through citations, comprising:
- automatically defining a Bernoulli Process Topic model with the automated computer system, for the set of documents representing each document as a represented as a mixture over latent topics, dependent on (1) an intrinsic content of each respective document, and (2) a content of other documents related to the respective document through a multi-level citation structure of direct and indirect linkages to the other documents;
  
  for each respective document, automatically determining with the automated computer system, a first set of latent topic characteristics based on an intrinsic content of the respective document;
  
  for each document, automatically determining with the automated computer system, a second set of latent topic characteristics based on a respective content of the other documents which are directly and indirectly linked, the indirectly linked documents contributing transitively to the latent topic characteristics of the respective document;
  
  automatically representing with the automated computer system, a third set of latent topics for the respective document based on a joint probability distribution of at least the first and second sets of latent topic characteristics, Bernoulli Process Topic model; and
  
  outputting, by the automated computer system, at least one document in response to an input, based on at least the third set of latent topics.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, further comprising automatically ranking with the automated computer system, at least a portion of the set of documents, in dependence on at least the input and the third set of latent topics.
  - 3. The method according to claim 1, further comprising automatically determining with the automated computer system, a Bayesian network structure based on at least one linkage, wherein each linkage implies a content relationship of the linked documents.
  - 4. The method according to claim 1, further comprising automatically iteratively determining for each location in each document, with the automated computer system, a topic from a topic distribution of the document having a Dirichlet probability distribution, and a word which follows a multinomial distribution conditioned on the determined topic for the respective location.
  - 5. The method according to claim 1, further comprising, for each document at a document level, and for each location in each document, using the automated computer system, iteratively until convergence within a convergence criterion is achieved:
    - choosing a document linked to the respective document based on the location in the document and a multinomial distribution conditioned on the respective document,choosing a topic from a topic distribution of the linked document based on the respective location in the respective document, andchoosing a word which follows the multinomial distribution conditioned on the chosen topic.

6. A method for characterizing a corpus of documents, each having one or more semantically significant citation linkages between respective documents, comprising:
- identifying, using at least one automated processor, a multilevel hierarchy of directly cited documents and indirectly cited documents;
  
  for each respective document, determining using the at least one automated processor, latent topic characteristics based on an intrinsic semantic content of the respective document, semantic content associated with directly cited documents, and semantic content associated with indirectly cited documents, wherein a semantic content significance of a citation has a transitive property;
  
  representing latent topics for documents within the corpus using the at least one automated processor, based on a joint probability distribution of the latent topic characteristics; and
  
  storing, in a memory device, by the at least one automated processor, the represented set of latent topics.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The method according to claim 6, wherein relationships among the documents of the corpus of documents are automatically modeled by the at least one automated processor according to a Bernoulli random process, such that a topic distribution of each respective document is a mixture of distributions associated with the directly linked documents and the indirectly linked documents.
  - 8. The method according to claim 7, wherein a joint probability distribution is estimated by an iterative process implemented by the at least one automated processor, at a citation level, comprising,for each document d_j, and for the i-th location in document d_j,choosing a topic z_ijfrom a topic distribution of document d_j, p(z|d_j,θ
    - _d_j),where the distribution parameter θ
      
      _d_jis drawn from a Dirichlet distribution Dir(α
      
      ),choosing a word w_jiwhich follows a multinomial distribution p(w|z_ji,Λ
      
      ) conditioned on the topic z_ji, andrespectively incrementing the locations i and documents d_jto process each location in each document.
  - 9. The method according to claim 8, wherein a joint probability distribution is estimated by an iterative process implemented by the at least one automated processor, at a document level, comprising,for each document d_s, and for the i-th location in each document d_s,choosing a cited document c_sifrom p(c|d_s,Ξ
    - ), a multinomial distribution conditioned on the document d_s,choosing a topic t_sifrom a topic distribution of the document c_siat a citation level, andchoosing a word w_siwhich follows the multinomial distribution p(w|t_si,Λ
      
      ) conditioned on the topic t_si,wherein Ξ
      
      is a mixing coefficient matrix which represents how much of a content of a respective document is from direct citations and how much of the content is from indirect citations, anda composition of Ξ and
      
      θ
      
      models the topic distribution at the document level.
  - 10. The method according to claim 9, wherein:
    - the mixing coefficient matrix Ξ
      
      is an N×
      
      N matrix, where Ξ
      
      _js=p(c_si=d_j|d_s), which is treated as a fixed quantity computed from the citation information of the corpus of documents;
      
      K is a number of latent topics;
      
      matrix Θ
      
      is a K×
      
      N matrix representing a set of topic distributions at a citation level, where Θ
      
      _lj=p(z_ji=l|d_j),matrix Λ
      
      is a M×
      
      K matrix representing word probability, where Λ
      
      _hl=p(w_si^h=1|t_si=l), andeach document d_shas a set of citations Q_d_s,the method further comprising constructing by the at least one automated processor, a matrix S to denote direct relationships among the documents d_swherein

11. An automated method for determining a set of latent topics of a Bayesian network of multilevel hierarchically related documents having direct and indirect references associated with content relationships, comprising:
- for each respective document, automatically determining using at least one automated processor, a set of latent topic characteristics based on an intrinsic content of the respective document and a set of latent topic characteristics based on a respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document;
  
  automatically representing by the at least one automated processor, a set of latent topics for each respective document based on a joint probability distribution of at least the latent topic characteristics based on the intrinsic content and the respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document, dependent on the identified network and a random process; and
  
  automatically communicating, using the at least one automated processor, information dependent on the represented set of latent topics for at least one respective document.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The method according to claim 11, wherein relationships among the Bayesian network of multilevel hierarchically related documents are automatically modeled by the at least one automated processor according to a Bernoulli process, such that a latent topic distribution of each respective document is a mixture of distributions associated with the related document.
  - 13. The method according to claim 11, wherein the Bayesian network of multilevel hierarchically related documents is automatically modeled by the at least one automated processor according to a generative probabilistic model of a latent topic content of the set of documents along with the references among the Bayesian network of multilevel hierarchically related documents.
  - 14. The method according to claim 11, wherein relationships among the Bayesian network of multilevel hierarchically related documents are automatically modeled by the at least one automated processor according to a Bernoulli process such that a topic distribution of each respective document is a mixture of distributions associated with documents linked to the respective documents.
  - 15. The method according to claim 11, wherein the latent topics are automatically modeled using the at least one automated processor, at both a document level and a citation level, and distinctions in the multilevel hierarchical network are captured by a Bernoulli random process.
  - 16. The method according to claim 11, wherein the Bayesian network of multilevel hierarchically related documents is automatically modeled by the at least one automated processor using a generative probabilistic model of a topic content of each document of the Bayesian network of multilevel hierarchically related documents along with the linkages among members of the Bayesian network of multilevel hierarchically related documents.
  - 17. The method according to claim 11, further comprising:
    - automatically performing an iterative process using the at least one automated processor for each document referenced by a respective document, and for each location in each document, comprising choosing a topic from a Dirichlet topic distribution of a respective document, choosing a word which follows a multinomial distribution conditioned on the topic; and
      
      automatically performing an iterative process using the at least one automated processor for each document, and for each location in each document, comprising choosing a referenced document from a multinomial distribution conditioned on the respective document, choosing a topic from the topic distribution of the document at the referenced document level, and choosing a word which follows a multinomial distribution conditioned on the topic.
  - 18. The method according to claim 11, further comprising:
    - automatically performing using the at least one automated processor, an iterative process for each document d_jreferenced by a respective document, and for each location i in each document, d_j, comprising;
      
      choosing a topic z_jifrom a topic distribution of document d_j, p(z|d_j,θ
      
      _d_j), where the distribution parameter θ
      
      _d_jis drawn from a Dirichlet distribution Dir(α
      
      ),choosing a word w_jiwhich follows a multinomial distribution p(w|z_ji,Λ
      
      ) conditioned on the topic z_ji,wherein a composition of θ
      
      models the topic distribution at the document level;
      
      performing an iterative process for each document d_s, and for each location i in each document d_s, comprising;
      
      choosing a referenced document c_sifrom p(c|d_s,Ξ
      
      ), a multinomial distribution conditioned on the document d_s,choosing a topic t_sifrom the topic distribution of the document c_siat the referenced document level, andchoosing a word w_siwhich follows a multinomial distribution p(w|t_si,Λ
      
      ) conditioned on the topic t_si,wherein Ξ
      
      is a mixing coefficient matrix which represents contributions to the content of the respective document from direct references and indirect references, and a composition of Ξ
      
      models the topic distribution at the document level,wherein a number of latent topics is K and the mixing coefficients are parameterized by an N×
      
      N matrix Ξ
      
      where Ξ
      
      _js=p(c_si=d_j|d_s), which are treated as a fixed quantity computed from the reference information of the network of multilevel hierarchically related documents, andwherein topic distributions at a level of the referenced documents are parameterized by a K×
      
      N matrix Θ
      
      where Θ
      
      _lj=p(z_ji=l|d_j), and an M×
      
      K word probability matrix Λ
      
      , where Λ
      
      _hl=p(w_si^h=1|t_si=l),wherein each document d_shas a set of references Q_d,further comprising automatically constructing using the at least one automated processor a matrix S to denote direct relationships among the documents wherein
  - 19. The method according to claim 11, further comprising using the set of latent topics to automatically rank using the at least one automated processor, at least a portion of the Bayesian network of multilevel hierarchically related documents.
  - 20. The method according to claim 11, further comprising using, by the at least one automated processor, the set of latent topics as an index for the Bayesian network of multilevel hierarchically related documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
The Research Foundation for The State University of New York (State University of New York)
Original Assignee
The Research Foundation for The State University of New York (State University of New York)
Inventors
Guo, Zhen, Zhang, Mark
Primary Examiner(s)
HO, BINH VAN

Application Number

US14/583,925
Publication Number

US 20150186789A1
Time in Patent Office

421 Days
Field of Search

707/608
US Class Current

1/1
CPC Class Codes

G06F 16/24573   using data annotations, e.g...

G06F 16/24578   using ranking

G06F 16/93   Document management systems

G06F 16/9558   Details of hyperlinks; Mana...

G06N 20/00   Machine learning

G06N 7/01   Probabilistic graphical mod...

G06Q 10/10   Office automation; Time man...

Knowledge discovery from citation networks

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Knowledge discovery from citation networks

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links