INTEGRATING AND EXTRACTING TOPICS FROM CONTENT OF HETEROGENEOUS SOURCES
First Claim
1. A system for integrating and extracting topics from content of heterogeneous sources, the system comprising:
- a processor to;
identify a plurality of observed words in documents that are received from the heterogeneous sources;
obtain document metadata and source metadata from the heterogeneous sources;
use the document metadata to calculate a plurality of word topic probabilities for the plurality of observed words;
use the source metadata to calculate a plurality of source topic probabilities for the plurality of observed words; and
determine a latent topic for one of the documents based on the plurality of observed words, the plurality of word topic probabilities, and the plurality of source topic probabilities.
2 Assignments
0 Petitions
Accused Products
Abstract
Examples relate to integrating and extracting topics from content of heterogeneous sources. Observed words are identified in documents, which are received from the heterogeneous sources. Next, document metadata and source metadata are obtained from the heterogeneous sources. The document metadata is used to calculate word topic probabilities for the observed words, and the source metadata is used to calculate source topic probabilities for the observed words. A latent topic is then determined for one of the documents based on the observed words, the word topic probabilities, and the source topic probabilities.
-
Citations
15 Claims
-
1. A system for integrating and extracting topics from content of heterogeneous sources, the system comprising:
a processor to; identify a plurality of observed words in documents that are received from the heterogeneous sources; obtain document metadata and source metadata from the heterogeneous sources; use the document metadata to calculate a plurality of word topic probabilities for the plurality of observed words; use the source metadata to calculate a plurality of source topic probabilities for the plurality of observed words; and determine a latent topic for one of the documents based on the plurality of observed words, the plurality of word topic probabilities, and the plurality of source topic probabilities. - View Dependent Claims (2, 3, 4, 5, 6)
-
7. A method, implemented at least in part by a computing device, for integrating and extracting topics from content of heterogeneous sources, the method comprising:
-
identifying a plurality of observed words in documents that are received from the heterogeneous sources; presenting document metadata and source metadata from the heterogeneous sources; using the document metadata to calculate a plurality of word topic probabilities for the plurality of observed words; using the source metadata to calculate a plurality of source topic probabilities for the plurality of observed words; and using a Discriminative Dirichlet Allocation (DDA) modeling technique to determine a latent topic for one of documents based on the plurality of observed words, the plurality of word topic probabilities, and the plurality of source topic probabilities. - View Dependent Claims (8, 9, 10, 11)
-
-
12. A non-transitory machine-readable storage medium encoded with instructions executable by a processor for integrating and extracting topics from content of heterogeneous sources, the machine-readable storage medium comprising instructions to:
-
identify a plurality of observed words in documents that are received from the heterogeneous sources; obtain document metadata and source metadata from the heterogeneous sources; use the document metadata to calculate a plurality of word topic probabilities for the plurality of observed words; use a global vocabulary and a global Dirichlet prior parameter to determine a plurality of global word topic probabilities; use the source metadata to calculate a plurality of source topic probabilities for the plurality of observed words; and determine a latent topic for one of the documents based on the plurality of observed words, the plurality of word topic probabilities, the plurality of global word topic probabilities, and the plurality of source topic probabilities. - View Dependent Claims (13, 14, 15)
-
Specification