System, method and apparatus for increasing speed of hierarchial latent dirichlet allocation model
First Claim
1. A data processing method, comprising:
- sending, by a master node, global initial statistical information to a plurality of slave nodes, wherein the global initial statistical information comprises;
text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;
receiving local statistical information from each of the plurality of slave nodes;
merging the received local statistical information of each slave node, to obtain new global statistical information, wherein the local statistical information comprises;
a document-topic count matrix, a topic-word count matrix and a document hierarchical topic path of each slave node, and the new global statistical information comprises;
global text-topic count matrix information, topic-word count matrix information, topic-word count matrix information of each slave node, and a global document hierarchical topic path;
after judging that a Gibbs sampling performed by a slave node has ended, calculating a probability distribution between the document and a topic and a probability distribution between the topic and a word according to the new global statistical information, wherein the Gibbs sampling is used to allocate a topic for each word of each document, and allocate a hierarchical topic path for each document;
according to the probability distributions obtained through calculation, establishing a likelihood function of the text set, and maximizing the likelihood function, to obtain a new hierarchical Latent Dirichlet Allocation model hyper-parameter; and
after judging that an iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter has converged, and according to the new hierarchical Latent Dirichlet Allocation model hyper-parameter, calculating and outputting the probability distribution between the document and topic and the probability distribution between the topic and word.
1 Assignment
0 Petitions
Accused Products
Abstract
Embodiments of the present invention disclose a data processing method including: sending global initial statistical information to each slave node; merging received local statistical information of each slave node, to obtain new global statistical information; if Gibbs sampling performed by a slave node has ended, calculating a probability distribution between a document and topic and a probability distribution between the topic and a word according to the new global statistical information; according to the probability distributions obtained through calculation, establishing a likelihood function of a text set, and maximizing the likelihood function, to obtain a new hLDA hyper-parameter; and if iteration of solving for an hLDA hyper-parameter has converged, and according to the new hLDA hyper-parameter, calculating and outputting the probability distribution between the document and topic and the probability distribution between the topic and word.
6 Citations
17 Claims
-
1. A data processing method, comprising:
-
sending, by a master node, global initial statistical information to a plurality of slave nodes, wherein the global initial statistical information comprises;
text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;receiving local statistical information from each of the plurality of slave nodes; merging the received local statistical information of each slave node, to obtain new global statistical information, wherein the local statistical information comprises;
a document-topic count matrix, a topic-word count matrix and a document hierarchical topic path of each slave node, and the new global statistical information comprises;
global text-topic count matrix information, topic-word count matrix information, topic-word count matrix information of each slave node, and a global document hierarchical topic path;after judging that a Gibbs sampling performed by a slave node has ended, calculating a probability distribution between the document and a topic and a probability distribution between the topic and a word according to the new global statistical information, wherein the Gibbs sampling is used to allocate a topic for each word of each document, and allocate a hierarchical topic path for each document; according to the probability distributions obtained through calculation, establishing a likelihood function of the text set, and maximizing the likelihood function, to obtain a new hierarchical Latent Dirichlet Allocation model hyper-parameter; and after judging that an iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter has converged, and according to the new hierarchical Latent Dirichlet Allocation model hyper-parameter, calculating and outputting the probability distribution between the document and topic and the probability distribution between the topic and word. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A data processing method, comprising:
-
receiving, at a plurality of slave nodes, global initial statistical information sent by a master node, wherein the global initial statistical information comprises;
text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;according to a hierarchical topic path of each document, reallocating a topic for each word in each document through Gibbs sampling; according to the nested Chinese restaurant process prior, and an updated document-topic count matrix and topic-word count matrix, reallocating a hierarchical topic path for each document through Gibbs sampling; and sending local statistical information to the master node, wherein the local statistical information comprises;
document-topic count matrix information and topic-word count matrix information and hierarchical topic path information of each document which are updated and are of a present slave node. - View Dependent Claims (8, 9, 10)
-
-
11. A master node configured as a computer accessible to a data network, the master node comprising:
-
a sending unit, configured to send global initial statistical information over the data network to a plurality of slave nodes, wherein the global initial statistical information comprises;
text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;
further configured to, if Gibbs sampling performed by a slave node does not end, send new global statistical information to the slave node; and
configured to, if iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter does not converge, send the slave node the new global statistical information having a hierarchical Latent Dirichlet Allocation model hyper-parameter updated;a merging unit, configured to merge local statistical information received from the plurality of slave nodes, to obtain new global statistical information, wherein the local statistical information comprises;
a document-topic count matrix, a topic-word count matrix and a document hierarchical topic path of each slave node, and the new global statistical information comprises;
global text-topic count matrix information, topic-word count matrix information, topic-word count matrix information of each slave node, and a global document hierarchical topic path;a calculating unit, configured to, after judging that a Gibbs sampling performed by the slave node has ended, calculate a probability distribution between the document and a topic and a probability distribution between the topic and a word according to the new global statistical information;
further configured to, according to the probability distributions obtained through calculation, establish a likelihood function of the text set, and maximize the likelihood function to obtain new hierarchical Latent Dirichlet Allocation model hyper-parameter; and
configured to, after judging that an iteration of solving for a hierarchical Latent Dirichlet Allocation model hyper-parameter converges, and according to the new hierarchical Latent Dirichlet Allocation model hyper-parameter, calculate and output the probability distribution between the document and topic and the probability distribution between the topic and word. - View Dependent Claims (12, 13)
-
-
14. A slave node configured as a computer accesible to a data network, the slave node comprising:
-
an information receiving unit, configured to receive global initial statistical information sent over the data network by a master node, wherein the global initial statistical information comprises;
text subset information divided in advance according to a text set, preset initial hyper-parameter information of a hierarchical Latent Dirichlet Allocation model, a pre-established nested Chinese restaurant process prior of the text set, hierarchical topic path information of a document, document-topic count matrix information, and topic-word count matrix information;a topic allocating unit, configured to, according to a hierarchical topic path of each document, reallocate a topic for each word in each document through Gibbs sampling; a path allocating unit, configured to, according to the nested Chinese restaurant process prior, and an updated document-topic count matrix and topic-word count matrix, reallocate a hierarchical topic path for each document through Gibbs sampling; and an information sending unit, configured to send local statistical information to the master node, wherein the local statistical information comprises;
document-topic count matrix information and topic-word count matrix information and hierarchical topic path information of each document which are updated and are of a present slave node. - View Dependent Claims (15, 16, 17)
-
Specification