KNOWLEDGE DISCOVERY FROM CITATION NETWORKS
First Claim
1. A method of modeling a set of documents, each document within the set of documents comprising content, and at least a portion of the documents being linked to other documents through citations, comprising:
- defining a Bernoulli Process Topic model for the set of documents representing each document as a represented as a mixture over latent topics, dependent on (1) an intrinsic content of each respective document, and (2) a content of other documents related to the respective document through a multi-level citation structure of direct and indirect linkages to the other documents;
for each respective document, determining a first set of latent topic characteristics based on an intrinsic content of the respective document;
for each document, determining a second set of latent topic characteristics based on a respective content of the other documents which are directly and indirectly linked, the indirectly linked documents contributing transitively to the latent topic characteristics of the respective document;
representing a third set of latent topics for the respective document based on a joint probability distribution of at least the first and second sets of latent topic characteristics, Bernoulli Process Topic model; and
outputting at least one document in response to an input, based on at least the third set of latent topics.
1 Assignment
0 Petitions
Accused Products
Abstract
In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. A Bernoulli Process Topic (BPT) model is provided which models the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multi-level hierarchical structure of the citation network is captured by a generative process involving a Bernoulli process. The distribution parameters of the BPT model are estimated by a variational approximation approach.
-
Citations
20 Claims
-
1. A method of modeling a set of documents, each document within the set of documents comprising content, and at least a portion of the documents being linked to other documents through citations, comprising:
-
defining a Bernoulli Process Topic model for the set of documents representing each document as a represented as a mixture over latent topics, dependent on (1) an intrinsic content of each respective document, and (2) a content of other documents related to the respective document through a multi-level citation structure of direct and indirect linkages to the other documents; for each respective document, determining a first set of latent topic characteristics based on an intrinsic content of the respective document; for each document, determining a second set of latent topic characteristics based on a respective content of the other documents which are directly and indirectly linked, the indirectly linked documents contributing transitively to the latent topic characteristics of the respective document; representing a third set of latent topics for the respective document based on a joint probability distribution of at least the first and second sets of latent topic characteristics, Bernoulli Process Topic model; and outputting at least one document in response to an input, based on at least the third set of latent topics. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for characterizing a corpus of documents, each having one or more semantically significant citation linkages between respective documents, comprising:
-
identifying a multilevel hierarchy of directly cited documents and indirectly cited documents; for each respective document, determining latent topic characteristics based on an intrinsic semantic content of the respective document, semantic content associated with directly cited documents, and semantic content associated with indirectly cited documents, wherein a semantic content significance of a citation has a transitive property; representing latent topics for documents within the corpus based on a joint probability distribution of the latent topic characteristics; and storing, in a memory, the represented set of latent topics. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A method for determining a set of latent topics of a Bayesian network of multilevel hierarchically related documents having direct and indirect references associated with content relationships, comprising:
-
for each respective document, determining a set of latent topic characteristics based on an intrinsic content of the respective document and a set of latent topic characteristics based on a respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document; and representing a set of latent topics for each respective document based on a joint probability distribution of at least the latent topic characteristics based on the intrinsic content and the respective content of other documents which are directly referenced and indirectly referenced through at least one other document to the respective document, dependent on the identified network and a random process. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
Specification