System, method and apparatus for increasing speed of hierarchial latent dirichlet allocation model

US 8,527,448 B2
Filed: 12/20/2012
Issued: 09/03/2013
Est. Priority Date: 12/16/2011
Status: Active Grant

First Claim

Patent Images

1. A data processing method, comprising:

sending, by a master node, global initial statistical information to a plurality of slave nodes, wherein the global initial statistical information comprises;

text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;

receiving local statistical information from each of the plurality of slave nodes;

merging the received local statistical information of each slave node, to obtain new global statistical information, wherein the local statistical information comprises;

a document-topic count matrix, a topic-word count matrix and a document hierarchical topic path of each slave node, and the new global statistical information comprises;

global text-topic count matrix information, topic-word count matrix information, topic-word count matrix information of each slave node, and a global document hierarchical topic path;

after judging that a Gibbs sampling performed by a slave node has ended, calculating a probability distribution between the document and a topic and a probability distribution between the topic and a word according to the new global statistical information, wherein the Gibbs sampling is used to allocate a topic for each word of each document, and allocate a hierarchical topic path for each document;

according to the probability distributions obtained through calculation, establishing a likelihood function of the text set, and maximizing the likelihood function, to obtain a new hierarchical Latent Dirichlet Allocation model hyper-parameter; and

after judging that an iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter has converged, and according to the new hierarchical Latent Dirichlet Allocation model hyper-parameter, calculating and outputting the probability distribution between the document and topic and the probability distribution between the topic and word.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the present invention disclose a data processing method including: sending global initial statistical information to each slave node; merging received local statistical information of each slave node, to obtain new global statistical information; if Gibbs sampling performed by a slave node has ended, calculating a probability distribution between a document and topic and a probability distribution between the topic and a word according to the new global statistical information; according to the probability distributions obtained through calculation, establishing a likelihood function of a text set, and maximizing the likelihood function, to obtain a new hLDA hyper-parameter; and if iteration of solving for an hLDA hyper-parameter has converged, and according to the new hLDA hyper-parameter, calculating and outputting the probability distribution between the document and topic and the probability distribution between the topic and word.

6 Citations

View as Search Results

17 Claims

1. A data processing method, comprising:
- sending, by a master node, global initial statistical information to a plurality of slave nodes, wherein the global initial statistical information comprises;
  
  text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;
  
  receiving local statistical information from each of the plurality of slave nodes;
  
  merging the received local statistical information of each slave node, to obtain new global statistical information, wherein the local statistical information comprises;
  
  a document-topic count matrix, a topic-word count matrix and a document hierarchical topic path of each slave node, and the new global statistical information comprises;
  
  global text-topic count matrix information, topic-word count matrix information, topic-word count matrix information of each slave node, and a global document hierarchical topic path;
  
  after judging that a Gibbs sampling performed by a slave node has ended, calculating a probability distribution between the document and a topic and a probability distribution between the topic and a word according to the new global statistical information, wherein the Gibbs sampling is used to allocate a topic for each word of each document, and allocate a hierarchical topic path for each document;
  
  according to the probability distributions obtained through calculation, establishing a likelihood function of the text set, and maximizing the likelihood function, to obtain a new hierarchical Latent Dirichlet Allocation model hyper-parameter; and
  
  after judging that an iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter has converged, and according to the new hierarchical Latent Dirichlet Allocation model hyper-parameter, calculating and outputting the probability distribution between the document and topic and the probability distribution between the topic and word.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, before the sending the global initial statistical information to the slave node, the method comprises:
    - setting a different initial value for each hyper-parameter of the hierarchical Latent Dirichlet Allocation model;
      
      dividing the text set into multiple text subsets, wherein the number of the text subsets is the same as the number of nodes; and
      
      allocating one topic path for each document in the text set, allocating one topic for each word in the document, and according to the statistical total number of words in the text set, the total number of words contained in each document, and a word list of the text set, obtaining a document-topic count matrix and a topic-word count matrix.
  - 3. The method according to claim 1, after merging the received local statistical information of each slave node, to obtain the new global statistical information, the method comprises:
    - judging whether the Gibbs sampling performed by the slave node ends according to the number of times of iteration of the Gibbs sampling or a gradient of the likelihood function.
  - 4. The method according to claim 3, further comprising:
    - if the Gibbs sampling performed by the slave node does not end, sending the new global statistical information to the slave node.
  - 5. The method according to claim 4, after establishing the likelihood function of the text set, and maximizing the likelihood function, to obtain the new hierarchical Latent Dirichlet Allocation model hyper-parameter, the method comprises:
    - judging whether an iteration of solving for the hierarchical Latent Dirichlet Allocation model hyper-parameter has converged when the gradient of a likelihood function value of the text set corresponding to the hierarchical Latent Dirichlet Allocation model hyper-parameter is less than a preset gradient threshold.
  - 6. The method according to claim 5, further comprising:
    - if the iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter does not converge, sending the slave node the new global statistical information having a hierarchical Latent Dirichlet Allocation model hyper-parameter updated.

7. A data processing method, comprising:
- receiving, at a plurality of slave nodes, global initial statistical information sent by a master node, wherein the global initial statistical information comprises;
  
  text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;
  
  according to a hierarchical topic path of each document, reallocating a topic for each word in each document through Gibbs sampling;
  
  according to the nested Chinese restaurant process prior, and an updated document-topic count matrix and topic-word count matrix, reallocating a hierarchical topic path for each document through Gibbs sampling; and
  
  sending local statistical information to the master node, wherein the local statistical information comprises;
  
  document-topic count matrix information and topic-word count matrix information and hierarchical topic path information of each document which are updated and are of a present slave node.
- View Dependent Claims (8, 9, 10)
- - 8. The method according to claim 7, after reallocating a topic for each word in each document through Gibbs sampling, the method comprises:
    - updating the document-topic count matrix and topic-word count matrix information of each document having the topic reallocated for the word.
  - 9. The method according to claim 8, wherein reallocating a topic for each word in each document through Gibbs sampling comprises:
    - allocating multiple hierarchical sub-topics for each document in the text subset, and in the multiple hierarchical sub-topics, allocating a corresponding topic for each word in the document through Gibbs sampling.
  - 10. The method according to claim 7, further comprising:
    - if new global statistical information sent by the the master node is received, reallocating a hierarchical topic path for each document and reallocating a topic for each word in each document, through Gibbs sampling and according to the new global statistical information.

11. A master node configured as a computer accessible to a data network, the master node comprising:
- a sending unit, configured to send global initial statistical information over the data network to a plurality of slave nodes, wherein the global initial statistical information comprises;
  
  text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;
  
  further configured to, if Gibbs sampling performed by a slave node does not end, send new global statistical information to the slave node; and
  
  configured to, if iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter does not converge, send the slave node the new global statistical information having a hierarchical Latent Dirichlet Allocation model hyper-parameter updated;
  
  a merging unit, configured to merge local statistical information received from the plurality of slave nodes, to obtain new global statistical information, wherein the local statistical information comprises;
  
  a document-topic count matrix, a topic-word count matrix and a document hierarchical topic path of each slave node, and the new global statistical information comprises;
  
  global text-topic count matrix information, topic-word count matrix information, topic-word count matrix information of each slave node, and a global document hierarchical topic path;
  
  a calculating unit, configured to, after judging that a Gibbs sampling performed by the slave node has ended, calculate a probability distribution between the document and a topic and a probability distribution between the topic and a word according to the new global statistical information;
  
  further configured to, according to the probability distributions obtained through calculation, establish a likelihood function of the text set, and maximize the likelihood function to obtain new hierarchical Latent Dirichlet Allocation model hyper-parameter; and
  
  configured to, after judging that an iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter converges, and according to the new hierarchical Latent Dirichlet Allocation model hyper-parameter, calculate and output the probability distribution between the document and topic and the probability distribution between the topic and word.
- View Dependent Claims (12, 13)
- - 12. The master node according to claim 11, further comprising:
    - a setting unit, configured to set a different initial value for each hyper-parameter of the hierarchical Latent Dirichlet Allocation model;
      
      a dividing unit, configured to divide the text set into multiple text subsets, wherein the number of the text subsets is the same as the number of nodes;
      
      an allocating unit, configured to allocate one topic path for each document in the text set, allocate one topic for each word in the document, and according to the statistical total number of words in the text set, the total number of words contained in each document, and a word list of the text set, obtaining a document-topic count matrix and a topic-word count matrix; and
      
      a judging unit, configured to judge whether the Gibbs sampling performed by the slave node ends according to the number of times of iteration of the Gibbs sampling or a gradient of the likelihood function;
      
      further configured to, judge, whether the iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter converges when a gradient of a likelihood function value of the text set corresponding to the hierarchical Latent Dirichlet Allocation model hyper-parameter is less than a preset gradient threshold.
  - 13. The master node according to claim 12, wherein,the sending unit is configured to, if the Gibbs sampling performed by the slave node does not end, send the new global statistical information to the slave node, and if the iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter does not converge, send the slave node the new global statistical information having a hierarchical Latent Dirichlet Allocation model hyper-parameter updated.

14. A slave node configured as a computer accesible to a data network, the slave node comprising:
- an information receiving unit, configured to receive global initial statistical information sent over the data network by a master node, wherein the global initial statistical information comprises;
  
  text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;
  
  a topic allocating unit, configured to, according to a hierarchical topic path of each document, reallocate a topic for each word in each document through Gibbs sampling;
  
  a path allocating unit, configured to, according to the nested Chinese restaurant process prior, and an updated document-topic count matrix and topic-word count matrix, reallocate a hierarchical topic path for each document through Gibbs sampling; and
  
  an information sending unit, configured to send local statistical information to the master node, wherein the local statistical information comprises;
  
  document-topic count matrix information and topic-word count matrix information and hierarchical topic path information of each document which are updated and are of a present slave node.
- View Dependent Claims (15, 16, 17)
- - 15. The slave node according to claim 14, further comprising:
    - an updating unit, configured to update the document-topic count matrix and topic-word count matrix information of each document having the topic reallocated for the word.
  - 16. The slave node according to claim 15, whereinthe topic allocating unit is configured to allocate a corresponding topic for each word in the document in the manner of allocating multiple hierarchical sub-topics for each document in the text subset, and in the multiple hierarchical sub-topics, allocating a corresponding topic for each word in the document through Gibbs sampling.
  - 17. The slave node according to claim 14, whereinthe path allocating unit is further configured to, if new global statistical information sent by the master node is received, reallocate a hierarchical topic path for each document through Gibbs sampling and according to the new global statistical information;
    - andthe topic allocating unit is further configured to, if the new global statistical information sent by the master node is received, reallocate a topic for each word in each document through Gibbs sampling and according to the new global statistical information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Original Assignee
Huawei Technologies Co., Ltd. (Huawei Investment & Holding Co., Ltd.)
Inventors
Vladislav, Kopylov, Wen, Liufei, Shi, Guangyu
Primary Examiner(s)
Chaki, Kakali
Assistant Examiner(s)
PELLETT, DANIEL T

Application Number

US13/722,078
Publication Number

US 20130159236A1
Time in Patent Office

257 Days
Field of Search

None
US Class Current

706/52
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 40/30   Semantic analysis

G06N 5/01   Dynamic search techniques; ...

G06N 5/02   Knowledge representation; S...

System, method and apparatus for increasing speed of hierarchial latent dirichlet allocation model

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

6 Citations

17 Claims

Specification

Use Cases

Quick Links

Others

System, method and apparatus for increasing speed of hierarchial latent dirichlet allocation model

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

6 Citations

17 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others