Integrating and extracting topics from content of heterogeneous sources

US 9,176,969 B2
Filed: 08/29/2013
Issued: 11/03/2015
Est. Priority Date: 08/29/2013
Status: Active Grant

First Claim

Patent Images

1. A system for integrating and extracting topics from content of heterogeneous sources, the system comprising:

a processor to;

identify a plurality of observed words in documents that are received from the heterogeneous sources;

obtain document metadata and source metadata from the heterogeneous sources;

use the document metadata to calculate a plurality of word topic probabilities for the plurality of observed words;

use the source metadata to calculate a plurality of source topic probabilities for the plurality of observed words; and

determine a latent topic for one of the documents based on the plurality of observed words, the plurality of word topic probabilities, and the plurality of source topic probabilities, wherein the latent topic is determined using a Discriminative Dirichlet Allocation (DDA) modeling technique comprising;

in response to determining that a number of occurrences of related observed words assigned to the latent topic has reached a dynamic threshold, adjusting a word topic probability based on pre-determined user-defined features; and

adjusting the word topic probability of an observed word based on a source topic probability of the source topic probabilities associated with the observed word, wherein the adjusting the word topic probability of the observed word comprises using Gibbs sampling to apply a bicriterion that maximizes the plurality of word topic probabilities and uses the dynamic threshold to monitor the number of occurrences of the related observed words.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Examples relate to integrating and extracting topics from content of heterogeneous sources. Observed words are identified in documents, which are received from the heterogeneous sources. Next, document metadata and source metadata are obtained from the heterogeneous sources. The document metadata is used to calculate word topic probabilities for the observed words, and the source metadata is used to calculate source topic probabilities for the observed words. A latent topic is then determined for one of the documents based on the observed words, the word topic probabilities, and the source topic probabilities.

Citations

8 Claims

1. A system for integrating and extracting topics from content of heterogeneous sources, the system comprising:
- a processor to;
  
  identify a plurality of observed words in documents that are received from the heterogeneous sources;
  
  obtain document metadata and source metadata from the heterogeneous sources;
  
  use the document metadata to calculate a plurality of word topic probabilities for the plurality of observed words;
  
  use the source metadata to calculate a plurality of source topic probabilities for the plurality of observed words; and
  
  determine a latent topic for one of the documents based on the plurality of observed words, the plurality of word topic probabilities, and the plurality of source topic probabilities, wherein the latent topic is determined using a Discriminative Dirichlet Allocation (DDA) modeling technique comprising;
  
  in response to determining that a number of occurrences of related observed words assigned to the latent topic has reached a dynamic threshold, adjusting a word topic probability based on pre-determined user-defined features; and
  
  adjusting the word topic probability of an observed word based on a source topic probability of the source topic probabilities associated with the observed word, wherein the adjusting the word topic probability of the observed word comprises using Gibbs sampling to apply a bicriterion that maximizes the plurality of word topic probabilities and uses the dynamic threshold to monitor the number of occurrences of the related observed words.
- View Dependent Claims (2, 3)
- - 2. The system of claim 1, wherein the processor is further to use a global vocabulary and a global Dirichlet prior parameter to determine a plurality of global word topic probabilities, wherein the latent topic is further based on the plurality of global word topic probabilities.
  - 3. The system of claim 1, wherein the heterogeneous sources comprise a news source, a blog source, a social media source, a document repository source, an online retailer source, an email source, and a discussion forum source.

4. A method, implemented at least in part by a computing device, for integrating and extracting topics from content of heterogeneous sources, the method comprising:
- identifying, by using the computing device, a plurality of observed words in documents that are received from the heterogeneous sources;
  
  preserving document metadata and source metadata from the heterogeneous sources;
  
  using the document metadata to calculate a plurality of word topic probabilities for the plurality of observed words;
  
  using the source metadata to calculate a plurality of source topic probabilities for the plurality of observed words; and
  
  using a Discriminative Dirichlet Allocation (DDA) modeling technique to determine a latent topic for one of documents based on the plurality of observed words, the plurality of word topic probabilities, and the plurality of source topic probabilities, wherein the DDA modeling technique comprises;
  
  in response to determining that a number of occurrences of related observed words assigned to the latent topic has reached a dynamic threshold, adjusting a word topic probability based on pre-determined user-defined features; and
  
  adjusting the word topic probability of an observed word based on a source topic probability of the source topic probabilities associated with the observed word, wherein the adjusting the word topic probability of the observed word comprises using Gibbs sampling to apply a bicriterion that maximizes the plurality of word topic probabilities and uses the dynamic threshold to monitor the number of occurrences of the related observed words.
- View Dependent Claims (5, 6)
- - 5. The method of claim 4, further comprising using a global vocabulary and a global Dirichlet prior parameter to determine a plurality of global word topic probabilities, wherein the latent topic is further based on the plurality of global word topic probabilities.
  - 6. The method of claim 4, wherein the heterogeneous sources comprise a news source, a blog source, a social media source, a document repository source, an online retailer source, an email source, and a discussion forum source.

7. A non-transitory machine-readable storage medium encoded with instructions executable by a processor for integrating and extracting topics from content of heterogeneous sources, the machine-readable storage medium comprising instructions to:
- identify a plurality of observed words in documents that are received from the heterogeneous sources;
  
  obtain document metadata and source metadata from the heterogeneous sources;
  
  use the document metadata to calculate a plurality of word topic probabilities for the plurality of observed words;
  
  use a global vocabulary and a global Dirichlet prior parameter to determine a plurality of global word topic probabilities;
  
  use the source metadata to calculate a plurality of source topic probabilities for the plurality of observed words; and
  
  determine a latent topic for one of the documents based on the plurality of observed words, the plurality of word topic probabilities, the plurality of global word topic probabilities, and the plurality of source topic probabilities, wherein the latent topic is determined using a Discriminative Dirichlet Allocation (DDA) modeling technique that comprises;
  
  in response to determining that a number of occurrences of related observed words assigned to the latent topic has reached a dynamic threshold, adjusting word topic probability based on pre-determined user-defined features; and
  
  adjusting the word topic probability of an observed word based on a source topic probability of the source topic probabilities associated with the observed word, wherein the adjusting the word topic probability of the observed word comprises using Gibbs sampling to apply a bicriterion that maximizes the plurality of word topic probabilities and uses the dynamic threshold to monitor the number of occurrences of the related observed words.
- View Dependent Claims (8)
- - 8. The non-transitory machine-readable storage medium of claim 7, wherein the heterogeneous sources comprise a news source, a blog source, a social media source, a document repository source, an online retailer source, an email source, and a discussion forum source.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett Packard Enterprise Development LP (Hewlett-Packard Enterprise Company)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Asur, Sitaram, Ghosh, Rumi
Primary Examiner(s)
TRUONG, CAM Y T

Application Number

US14/014,122
Publication Number

US 20150066904A1
Time in Patent Office

796 Days
Field of Search

707/722
US Class Current

1/1
CPC Class Codes

G06F 16/14 Details of searching files ...

G06F 16/951 Indexing; Web crawling tech...

Integrating and extracting topics from content of heterogeneous sources

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Integrating and extracting topics from content of heterogeneous sources

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links