Knowledge discovery from citation networks
First Claim
1. A method of modeling a set of documents within an automated computer system, each document within the set of documents comprising content, and at least a portion of the documents being linked to other documents through citations, comprising:
- automatically defining a Bernoulli Process Topic model with the automated computer system, for the set of documents representing each document as a represented as a mixture over latent topics, dependent on (1) an intrinsic content of each respective document, and (2) a content of other documents related to the respective document through a multi-level citation structure of direct and indirect linkages to the other documents;
for each respective document, automatically determining with the automated computer system, a first set of latent topic characteristics based on an intrinsic content of the respective document;
for each document, automatically determining with the automated computer system, a second set of latent topic characteristics based on a respective content of the other documents which are directly and indirectly linked, the indirectly linked documents contributing transitively to the latent topic characteristics of the respective document;
automatically representing with the automated computer system, a third set of latent topics for the respective document based on a joint probability distribution of at least the first and second sets of latent topic characteristics, Bernoulli Process Topic model; and
outputting, by the automated computer system, at least one document in response to an input, based on at least the third set of latent topics.
1 Assignment
0 Petitions
Accused Products
Abstract
In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. A Bernoulli Process Topic (BPT) model is provided which models the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multi-level hierarchical structure of the citation network is captured by a generative process involving a Bernoulli process. The distribution parameters of the BPT model are estimated by a variational approximation approach.
-
Citations
20 Claims
-
1. A method of modeling a set of documents within an automated computer system, each document within the set of documents comprising content, and at least a portion of the documents being linked to other documents through citations, comprising:
-
automatically defining a Bernoulli Process Topic model with the automated computer system, for the set of documents representing each document as a represented as a mixture over latent topics, dependent on (1) an intrinsic content of each respective document, and (2) a content of other documents related to the respective document through a multi-level citation structure of direct and indirect linkages to the other documents; for each respective document, automatically determining with the automated computer system, a first set of latent topic characteristics based on an intrinsic content of the respective document; for each document, automatically determining with the automated computer system, a second set of latent topic characteristics based on a respective content of the other documents which are directly and indirectly linked, the indirectly linked documents contributing transitively to the latent topic characteristics of the respective document; automatically representing with the automated computer system, a third set of latent topics for the respective document based on a joint probability distribution of at least the first and second sets of latent topic characteristics, Bernoulli Process Topic model; and outputting, by the automated computer system, at least one document in response to an input, based on at least the third set of latent topics. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for characterizing a corpus of documents, each having one or more semantically significant citation linkages between respective documents, comprising:
-
identifying, using at least one automated processor, a multilevel hierarchy of directly cited documents and indirectly cited documents; for each respective document, determining using the at least one automated processor, latent topic characteristics based on an intrinsic semantic content of the respective document, semantic content associated with directly cited documents, and semantic content associated with indirectly cited documents, wherein a semantic content significance of a citation has a transitive property; representing latent topics for documents within the corpus using the at least one automated processor, based on a joint probability distribution of the latent topic characteristics; and storing, in a memory device, by the at least one automated processor, the represented set of latent topics. - View Dependent Claims (7, 8, 9, 10)
-
-
11. An automated method for determining a set of latent topics of a Bayesian network of multilevel hierarchically related documents having direct and indirect references associated with content relationships, comprising:
-
for each respective document, automatically determining using at least one automated processor, a set of latent topic characteristics based on an intrinsic content of the respective document and a set of latent topic characteristics based on a respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document; automatically representing by the at least one automated processor, a set of latent topics for each respective document based on a joint probability distribution of at least the latent topic characteristics based on the intrinsic content and the respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document, dependent on the identified network and a random process; and automatically communicating, using the at least one automated processor, information dependent on the represented set of latent topics for at least one respective document. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification