Knowledge discovery from citation networks
First Claim
1. A method for characterizing a set of documents, comprising:
- identifying a network of multilevel hierarchically related documents having direct and indirect references associated with content relationships;
for each respective document, determining a set of latent topic characteristics captured by a Bernoulli process, based on at least both an intrinsic content of the respective document and a set of latent topic characteristics based on a respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document, such that a topic distribution of each respective document is a mixture of distributions associated with at least the at least one other document;
representing a set of latent topics for the respective document based on a joint probability distribution of at least the latent topic characteristics based on the intrinsic content and the respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document, dependent on the identified network and a random process; and
storing, in a memory, the represented set of latent topics for the respective document.
2 Assignments
0 Petitions
Accused Products
Abstract
In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. A Bernoulli Process Topic (BPT) model is provided which models the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multi-level hierarchical structure of the citation network is captured by a generative process involving a Bernoulli process. The distribution parameters of the BPT model are estimated by a variational approximation approach.
-
Citations
17 Claims
-
1. A method for characterizing a set of documents, comprising:
-
identifying a network of multilevel hierarchically related documents having direct and indirect references associated with content relationships; for each respective document, determining a set of latent topic characteristics captured by a Bernoulli process, based on at least both an intrinsic content of the respective document and a set of latent topic characteristics based on a respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document, such that a topic distribution of each respective document is a mixture of distributions associated with at least the at least one other document; representing a set of latent topics for the respective document based on a joint probability distribution of at least the latent topic characteristics based on the intrinsic content and the respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document, dependent on the identified network and a random process; and storing, in a memory, the represented set of latent topics for the respective document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method for characterizing a set of documents, comprising:
-
identifying a multilevel hierarchy of documents having direct references and indirect references, having citation linkages with semantic significance; for each respective document, determining latent topic characteristics captured by a Bernoulli process, based on at least an intrinsic semantic content of the respective document, semantic content associated with directly cited documents, and semantic content associated with documents referenced by directly cited documents, wherein a semantic content significance of a citation has a transitive property, such that a topic distribution of each respective document is a mixture of distributions associated with at least the cited documents; representing latent topics for documents within the set of documents based on a joint probability distribution of the latent topic characteristics, wherein distinctions in the multilevel hierarchical network are captured by a random process; and storing, in a memory, the represented set of latent topics. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A non-transitory computer readable medium storing instructions for controlling a programmable processor to characterize a set of documents, to perform a method comprising:
-
for each respective document in a multilevel hierarchy of documents having direct references and indirect references, having citation linkages with semantic significance, determining latent topic characteristics captured by a Bernoulli process, based on at least an intrinsic semantic content of the respective document, semantic content associated with directly cited documents, and semantic content associated with documents referenced by directly cited documents, wherein a semantic content significance of a citation has a transitive property, such that a topic distribution of each respective document is a mixture of distributions associated with at least the cited documents; representing latent topics for documents within the set of documents based on a joint probability distribution of the latent topic characteristics, wherein distinctions in the multilevel hierarchical network are captured by a random process; and storing the represented set of latent topics. - View Dependent Claims (17)
-
Specification