Knowledge discovery from citation networks
First Claim
1. A method for characterizing a corpus of documents each having one or more references, comprising:
- identifying a network of multilevel hierarchically related documents having direct references, and indirect references, wherein the references are associated with content relationships;
for each respective document, determining a first set of latent topic characteristics based on an intrinsic content of the respective document;
for each document, determining a second set of latent topic characteristics based on a respective content of other documents which are referenced directly and indirectly through at least one other document to the respective document, the indirectly referenced documents contributing transitively to the latent topic characteristics of the respective document;
representing a set of latent topics for the respective document based on a joint probability distribution of at least the first and second sets of latent topic characteristics, dependent on the identified network, wherein the contributions of at least the second set of latent topic characteristics are determined by an iterative process, wherein the represented set of latent topics is modeled at both a document level and a reference level, by differentiating the two different levels and the multilevel hierarchical network which is captured by a Bernoulli random process; and
storing, in a memory, the represented set of latent topics for the respective document.
3 Assignments
0 Petitions
Accused Products
Abstract
In a corpus of scientific articles such as a digital library, documents are connected by citations and one document plays two different roles in the corpus: document itself and a citation of other documents. A Bernoulli Process Topic (BPT) model is provided which models the corpus at two levels: document level and citation level. In the BPT model, each document has two different representations in the latent topic space associated with its roles. Moreover, the multi-level hierarchical structure of the citation network is captured by a generative process involving a Bernoulli process. The distribution parameters of the BPT model are estimated by a variational approximation approach.
68 Citations
22 Claims
-
1. A method for characterizing a corpus of documents each having one or more references, comprising:
-
identifying a network of multilevel hierarchically related documents having direct references, and indirect references, wherein the references are associated with content relationships; for each respective document, determining a first set of latent topic characteristics based on an intrinsic content of the respective document; for each document, determining a second set of latent topic characteristics based on a respective content of other documents which are referenced directly and indirectly through at least one other document to the respective document, the indirectly referenced documents contributing transitively to the latent topic characteristics of the respective document; representing a set of latent topics for the respective document based on a joint probability distribution of at least the first and second sets of latent topic characteristics, dependent on the identified network, wherein the contributions of at least the second set of latent topic characteristics are determined by an iterative process, wherein the represented set of latent topics is modeled at both a document level and a reference level, by differentiating the two different levels and the multilevel hierarchical network which is captured by a Bernoulli random process; and storing, in a memory, the represented set of latent topics for the respective document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A method for characterizing a corpus of documents each having one or more citation linkages, comprising:
-
identifying a multilevel hierarchy of linked documents having direct references, and indirect references, wherein the citation linkages have semantic significance; for each respective document, determining latent topic characteristics based on an intrinsic semantic content of the respective document, semantic content associated with directly cited documents, and semantic content associated with documents referenced by directly cited documents, wherein a semantic content significance of a citation has a transitive property; representing latent topics for documents within the corpus based on a joint probability distribution of the latent topic characteristics, wherein the latent topics are modeled at both a document level and a citation level, and distinctions in the multilevel hierarchical network are captured by a Bernoulli random process; and storing, in a memory, the represented set of latent topics. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22)
-
Specification